CN110362831B - Target user identification method, device, electronic equipment and storage medium - Google Patents

Target user identification method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110362831B
CN110362831B CN201910649162.0A CN201910649162A CN110362831B CN 110362831 B CN110362831 B CN 110362831B CN 201910649162 A CN201910649162 A CN 201910649162A CN 110362831 B CN110362831 B CN 110362831B
Authority
CN
China
Prior art keywords
user
characteristic value
target
barrage
day
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910649162.0A
Other languages
Chinese (zh)
Other versions
CN110362831A (en
Inventor
王非池
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Caimeng Technology Co ltd
Original Assignee
Guangzhou Caimeng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Caimeng Technology Co ltd filed Critical Guangzhou Caimeng Technology Co ltd
Priority to CN201910649162.0A priority Critical patent/CN110362831B/en
Publication of CN110362831A publication Critical patent/CN110362831A/en
Application granted granted Critical
Publication of CN110362831B publication Critical patent/CN110362831B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44213Monitoring of end-user related data
    • H04N21/44222Analytics of user selections, e.g. selection of programs or purchase activity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/478Supplemental services, e.g. displaying phone caller identification, shopping application
    • H04N21/4788Supplemental services, e.g. displaying phone caller identification, shopping application communicating with other users, e.g. chatting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Social Psychology (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A target user identification method is applied to the technical field of communication and comprises the following steps: and acquiring text characteristics of all the barrages released by the user on the same day, constructing a logistic regression model according to the text characteristics, acquiring characteristic values of the user on the same day, acquiring historical barrage data of the user, and judging whether the user is a target user or not based on the characteristic values of the user on the same day and the historical barrage data. The invention also discloses a target user identification device, electronic equipment and storage medium, which are used for comprehensively identifying the user by combining the historical barrage data issued by the user, improving the identification accuracy of the target user and preventing the problem that the normal user cannot normally use the barrage function to key barrage information.

Description

Target user identification method, device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a target user identification method, apparatus, electronic device, and storage medium.
Background
The barrage is one of the direct means of user interaction in the live platform, and high-quality barrage culture is easier for users to impress on the live platform, so that the user of the platform is increased. However, in the live broadcast process, some users often release a bullet screen whose content does not conform to the bullet screen rule. Resulting in an excessive number of barrages occupying the entire screen. We will refer to this as the target user.
The main characteristic of the target user is that the words are not civilized and comprise keywords such as dirty words. The method adopted by the prior art is that keywords are filtered by using regular expressions aiming at the barrage, then the barrage is forbidden to be released again by a user who releases the barrage containing the keywords, the method has unilateral performance, a normal user is easy to misjudge as a target user, the accuracy is low, and meanwhile, the normal user cannot normally use the barrage function to key barrage information.
Disclosure of Invention
The invention mainly aims to provide a target user identification method, a target user identification device, electronic equipment and a storage medium, which can improve the accuracy of target user identification and prevent the problem that a normal user cannot normally use a barrage function to key barrage information.
To achieve the above object, a first aspect of an embodiment of the present invention provides a target user identification method, including:
acquiring text characteristics of all barrages released by a user in the same day;
constructing a logistic regression model according to the text characteristics to obtain the characteristic value of the user on the same day;
acquiring historical barrage data of the user;
and judging whether the user is a target user or not based on the characteristic value of the user on the current day and the historical barrage data.
Further, the determining whether the user is a target user based on the characteristic value of the user on the current day and the historical barrage data includes:
calculating the comprehensive characteristic value of the user based on the characteristic value of the user on the current day and the historical barrage data;
judging whether the comprehensive characteristic value of the user is larger than a preset threshold value or not;
and if the comprehensive characteristic value of the user is larger than the preset threshold value, the user is a target user.
Further, the calculating the comprehensive feature value of the user based on the comprehensive feature value of the user on the same day and the historical barrage data comprises:
acquiring the number N of active days of the user, and collecting all bullet screen sets T of the user history tota1 The target barrage set T of the user history spam All barrages set C of the user on the same day total The target barrage set C of the user on the same day spam
Let the comprehensive characteristic value of the user be Y, the characteristic value f (x) of the user on the current day beThen:
wherein sigma is a preset parameter, and sigma is more than or equal to 1 and less than or equal to 1.5.
Further, the calculating the comprehensive feature value of the user based on the feature value of the user on the current day and the historical barrage data comprises:
acquiring the active days N of the user;
enabling the user to integrate featuresThe value is Y, the characteristic value of the user on the current day is f (x), and the characteristic value of the user on the previous day isThen:
further, the text features include punctuation features, expressive features, negative word features, and TF-IDF values for each term in the all of the barrages.
Further, the constructing a logistic regression model according to the text features, and obtaining the feature value of the user on the same day includes:
acquiring preset weight coefficients corresponding to all the characteristics in the text characteristics;
let the Nth feature in the text feature be X n X, th n Corresponding preset weight coefficient is theta i The characteristic value of the user on the same day is f (x), and then:
where e is a natural constant and b is a natural number.
Further, when the text feature includes TF-IDF values of the respective terms in the all the barrages, the obtaining the text feature of all the barrages issued by the user on the same day includes:
let the TF-IDF value of the ith term in the jth barrage be TF-IDF i,j Then:
wherein n is i,j Representing the number of occurrences of the ith term in the jth barrage, Σ k n k,j Represents the number of all words in the jth barrage, |D| represents the number of all barragesIs the total number of barrages, |{ j: t is t i ∈d j And the number of the bullet screen strips containing the ith entry in all bullet screens is represented by the number of the bullet screen strips.
A second aspect of an embodiment of the present invention provides a target user identifying apparatus, including:
the first acquisition module is used for acquiring text characteristics of all the barrages released by the user in the same day;
the construction module is used for constructing a logistic regression model according to the text characteristics to obtain the characteristic value of the user on the same day;
the second acquisition module is used for acquiring the historical barrage data of the user;
and the judging module is used for judging whether the user is a target user or not based on the characteristic value of the user on the current day and the historical barrage data.
A third aspect of an embodiment of the present invention provides an electronic device, including:
the system comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, and is characterized in that the processor realizes the target user identification method provided by the first aspect of the embodiment of the invention when executing the program.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the target user identification method provided by the first aspect of the embodiments of the present invention.
According to the target user identification method, the target user identification device, the electronic equipment and the storage medium, text characteristics of all the barrages issued by the user on the same day are obtained, a logistic regression model is constructed according to the text characteristics, the characteristic value of the user on the same day is obtained, historical barrage data of the user are obtained, whether the user is the target user is judged based on the characteristic value of the user on the same day and the historical barrage data, the user is comprehensively identified by combining the historical barrage data issued by the user, the identification accuracy of the target user is improved, and the problem that normal users cannot normally use barrage functions to key barrage information is prevented.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other drawings may be obtained from them without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a target user identification method according to an embodiment of the invention;
FIG. 2 is a flowchart of a target user identification method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a target user identification apparatus according to an embodiment of the present invention;
fig. 4 shows a hardware configuration diagram of an electronic device.
Detailed Description
In order to make the objects, features and advantages of the present invention more comprehensible, the technical solutions in the embodiments of the present invention will be clearly described in conjunction with the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that, in the method for identifying a target user in the present disclosure, each step may be performed in part on a terminal, and the other steps may be performed on a server, or each step may be performed on a terminal, for example, off-line target user identification, so that the following steps performed by the server are exemplary, but not all, execution manners.
Referring to fig. 1, fig. 1 is a flowchart of a target user identification method according to an embodiment of the present invention, where the method can be applied to an electronic device with a barrage release function, and the electronic device includes: a cell phone, tablet (Portable Android Device, PAD), notebook, personal digital assistant (Personal Digital Assistant, PDA), etc., the method comprising the steps of:
s101, acquiring text characteristics of all barrages released by a user in the same day;
in a live web site, a user communicates with a host or other user by publishing a barrage. The content characteristics of the barrage released from the user may be such that the user is a normal user or a potential target user. The content characteristics of the barrage mainly comprise punctuation mark characteristics, expression characteristics, negative word characteristics and TF-IDF values of various entries in all barrages in the embodiment of the invention.
The bullet screen released by the user can be a bullet screen obtained by a server from a terminal, and the terminal comprises, but is not limited to, electronic equipment with a bullet screen releasing function, such as a smart phone, a tablet computer, a smart home appliance and the like.
For punctuation features and expressive features, the live platform is more biased towards spoken language, the barrage is significantly different from written language, is generally shorter, and there are a large number of expressions, pigment words, etc. Therefore, punctuation marks and expressions in the barrage are extracted to be used as the basis for judging whether the current user is a potential target user or not. For example, the number of the cells to be processed, negative expressions, "???" "......."etc. More, for emoji expressions, an emoji word list is constructed, and the emoji expressions are mapped into a single feature to be processed.
For negative word features, the target user often employs words with negative properties to post the barrage. For example, the ironic words such as "haha", "sham", "living in dream", and the like. By basing such negative terms, it may be initially determined whether the user is a potential target user.
The TF-IDF values of the entries in all the barrages are common technical means for determining text features for those skilled in the art, and are not described herein.
S102, constructing a logistic regression model according to the text characteristics to obtain the characteristic value of the user on the same day;
logistic regression (Logistic Regression) is a classification model in machine learning and is very widely used in practice due to the simplicity and efficiency of the algorithm. The logistic regression model can be utilized to comprehensively evaluate various characteristics, the evaluation result is more accurate, and the accuracy of target user discrimination can be effectively improved. Meanwhile, the logistic regression model is relatively simpler than other artificial intelligence models, and the time required for discrimination can be effectively reduced. The server can conveniently generate an accurate discrimination score by utilizing bullet screen features issued by users by utilizing a logistic regression model. For example, 30 minutes, 90 minutes, etc.
S103, acquiring historical barrage data of the user;
in order to avoid misjudgment of a normal user, in this embodiment, the user's historical barrage data is considered, and whether the user is a real target user is comprehensively judged according to the barrage data of the user on the same day and the historical barrage data.
The historical barrage data of the user can be the historical barrage data obtained by the server from the terminal, or the terminal historical barrage data stored by the server.
And S104, judging whether the user is a target user or not based on the characteristic value of the user on the current day and the historical barrage data.
Based on the characteristic value of the user on the same day and the historical barrage data, calculating the comprehensive characteristic value of the user, judging whether the comprehensive characteristic value of the user is larger than a preset threshold value, and if the comprehensive characteristic value of the user is larger than the preset threshold value, the user is a target user.
The server obtains the comprehensive score of the recent period by adopting the characteristic value of the user on the same day and the historical barrage data, so that the evaluation of the target user is more personalized and objective. After the user is judged to be the target user, operations such as forbidden language or blocking the account can be implemented on the target user according to the platform rule.
In the embodiment of the invention, the text characteristics of all the barrages released by the user on the same day are obtained, the logistic regression model is constructed according to the text characteristics, the characteristic value of the user on the same day is obtained, the historical barrage data of the user is obtained, whether the user is a target user or not is judged based on the characteristic value of the user on the same day and the historical barrage data, the user is comprehensively identified by combining the historical barrage data released by the user, the identification accuracy of the target user is improved, and the problem that the normal user cannot normally use the barrage function to key barrage information is prevented.
Referring to fig. 2, fig. 2 is a flowchart of a target user identification method according to an embodiment of the invention, where the method can be applied to an electronic device, and the electronic device includes: a mobile phone, a tablet personal computer (PAD), a notebook computer, a personal digital assistant (Personal Digital Assistant, PDA), etc., the method mainly comprises the following steps:
s201, acquiring text characteristics of all barrages released by a user on the same day;
for punctuation features, expression features and negative word features, the punctuation features, expression features and negative word features can be calculated by adopting a regularized matching method. The feature calculation method using regularization matching is well known to those skilled in the art, and will not be described here.
TF-IDF values for each entry in all barrages. And extracting punctuation features and expression features, then, word segmentation is carried out on the rest text, and TF-IDF values of the entries in all the barrages are calculated. Specifically, a preset number of words in front of word frequency in the barrage can be taken as words that we need to process, for example 10000 or 20000, and other words are discarded.
S202, constructing a logistic regression model according to the text characteristics to obtain the characteristic value of the user on the same day;
representing the obtained text features as { x } 1 ,x 2 ,x 3 ,…,x n Let Nth feature in text feature be X n X, th n Corresponding preset weight coefficient is theta i The user's global feature value on the same day is f (x), and the logistic regression model may be:
where e is a natural constant and b is a natural number.
More, the model carries out sigmoid function processing on the output result on the basis of linear regression, so that the final output value of the model is a continuous value between 0 and 1. By using a maximum likelihood estimation method, defining a loss function when training a model as likelihood probability of the model, and solving a logarithm convenient derivative:
wherein,representing for target user X i Model output of y i Representing target user x i Is a real tag of (a). The solution of the model is usually a gradient ascent method, which can be summarized simply as follows: substituting f (x) into the loss function, then deriving the loss function, taking the gradient rising direction, carrying out parameter iteration, and repeating the steps until convergence. The gradient ascent algorithm is well known to those skilled in the art, and will not be described in detail herein.
S203, acquiring historical barrage data of the user;
in an embodiment of the present invention, the user's historical barrage data includes a number of days of activity of the user, all barrage sets of the user's history, target barrage sets of the user's history, all barrage sets of the user's day, target barrage sets of the user's day, and characteristic values of the user on a previous day.
In one embodiment of the present invention, the user's historical bullet screen data includes the number of days the user was active, and the user's characteristic value of the previous day.
More, the calculation process of the feature value of the user on the previous day is the same as the process of calculating the integrated feature value of the user on the current day described in the above steps S201 to S202. Therefore, the description is omitted.
S204, judging whether the user is a target user or not based on the characteristic value of the user on the current day and the historical barrage data.
In one embodiment of the present invention, the number of active days N of the user is obtained, and all bullet screen sets T of the user history are obtained ttotal Target barrage set T of user history spam All barrages set C for the user on the same day total Target barrage set C for user on the same day spam
Let the comprehensive characteristic value of the user be Y, the characteristic value f (x) of the user on the current day, the characteristic value of the user on the previous day beThen:
wherein sigma is a preset parameter, and sigma is more than or equal to 1 and less than or equal to 1.5. In this formula, we integrate the user's history into feature valuesCharacteristic value of the day of the user->Fusion is performed. For the historical integrated characteristic value, we do a certain proportion +.>Then the characteristic value of the current day is weighted +.>The method is added into the comprehensive characteristic value, and has the advantages of retaining the influence of the historical characteristic value and guaranteeing that the characteristic value of the current day is reflected in the comprehensive characteristic value. On the other hand, for weight given, we borrow from the specific content of the user's current day barrage and the specific content C of the user's history barrage total 、C spatm 、T total 、T spam Characterized by the content of the target barrage in the barrage content, use +.>Fusion of the characteristic values of the day with +.>The characteristic values of the history are fused, wherein,representing the ratio of the current target barrage in the current barrage, +.>Representing the ratio of the history object barrage in the history barrage, +.>Representing the ratio of the total target backdrop (current target backdrop and history target backdrop) to the total backdrop (current backdrop plus history backdrop). In addition, in order to be more flexible in the actual use process, for the given weight, a preset parameter sigma is added, so that the model can be changed according to the actual scene. Further, in order to expand the influence of the current eigenvalue on the integrated eigenvalue in the fusion process, the weight correspondence of the sum of the current eigenvalue contributions is set to +.>Correspondingly, in order to reduce the influence of the history feature values on the integrated feature values in the fusion process, the history feature values are contributedThe weight of the sum is set to +.>Therefore, in the formula, at least one-day contribution of the characteristic value of the current day enters the comprehensive characteristic value, characteristic solidification of the comprehensive characteristic value caused by accumulation of the historical characteristic value is effectively prevented, and the final comprehensive characteristic value can reflect the historical integral information of the user more easily under the influence of the combination of the characteristic value of the current day.
The method for acquiring the target barrage can be performed by a common method for a person skilled in the art. For example, keyword filtering can be performed through regularized expressions, and then a target barrage containing the keywords.
More, in order to prevent the influence of the characteristic value of the current day caused by the overlarge N from being smaller, we cut N, and set the maximum value of N to be 30. At the same time, for some special users, e.g. batch registered users, it is desirable that the user has a faster current day score weight, so we add the parameter σ to strengthen the impact of current day score on the history score. For parameter σ, 1.ltoreq.σ.ltoreq.1.5 is generally defined, the larger the parameter, the higher the current day score impact. The specific value of sigma is generally determined by data analysis during actual use.
For example, in general, σ will typically be set smaller for newly registered users because newly registered users have less historical barrage data, which is easily ignored if the barrage data weight of the day is too large. For old users, sigma is generally set larger, because the historical barrage data of the old users are rich, the influence of the barrage data on the current day on the historical barrage data is not too large, so that the influence of the barrage data on the current day can be enhanced by setting the sigma larger. Wherein, the setting of the specific value of sigma is set according to the registration duration of the user. For example, the registration period is classified into six levels, one within one month, two levels from one month to three months, three levels from three months to one year, four levels from one year to three years, five levels from three years to five years, six levels above five years, 1 for the first level user σ, 1.1 for the second level user σ, 1.2 for the third level user σ, 1.3 for the fourth level user σ, 1.4 for the fifth level user σ, and 1.5 for the sixth level user σ.
The following illustrates a specific calculation procedure:
assuming that the overall characteristic value of a certain user on the previous day is 60 points, the characteristic value of the user on the current day is 90 points, and the number of active days is >30 days. On the same day, the user sends 20 target barrages after weight removal, and the total barrage amount is 30 barrages after weight removal. The total amount of the historical target barrages of the user is 2000 after the barrages are de-duplicated, and the total amount of the historical barrages is 5000 after the barrages are de-duplicated. After the current bullet screen is added to the historical bullet screen, the total amount of the target bullet screens of the user after the weight is removed is 2010, and the total amount of the target bullet screens after the weight is removed is 5020. Defining sigma=1, and presetting a threshold value as the comprehensive characteristic value of the user on the same day as follows:
it can be seen that the user has reached 90 minutes on the day, although the feature value is higher. However, since the user is historically a normal user, the overall feature value is low. Since the integrated feature value is < 70 points, the user is not considered as the target user on the same day.
The algorithm uses the difference between the current target barrage ratio of the user and the historical target barrage ratio to strengthen the current characteristic value of the user. If the user has a higher duty ratio of the target barrage in the barrage of the day, the characteristic value of the day has a higher specific gravity when fusion is performed. On the other hand, since the spoken target barrage is easily repeatedly transmitted, the barrage like "spoken Buddhist" appears in large numbers, so that the target barrage duty ratio is increased. Therefore, in statistics, the target barrage and the history barrage are subjected to deduplication counting to obtain the value of |C spam |、|T spam I and T spam +C spam |。
In an embodiment of the present invention, the number of active days N of the user is obtained, so that the comprehensive feature value of the user is Y, the feature value of the user on the same day is f (x), and the feature value of the user on the previous day isThen:
in this formula, the historical active days of the user are N, plus the weighting fusion of the day, i.e., (N+1), we use the idea of exponential averaging. For the historical eigenvalues of N days, the contribution of each day to the integrated eigenvalue is the same, namelySumming the historical contributions of N days to get +.>However, in order to make the influence of the current day eigenvalue on the final result prominent, we expand the weight of the current day eigenvalue by 2, namely +.>This allows the current day eigenvalues to be weighted twice as much as the historical day eigenvalues. Correspondingly, the weight correspondence of the sum of contributions to the historical characteristic values is reduced to +.>The processing is to expand the influence of the characteristic value of the current day in the fusion process, and the final comprehensive characteristic value can more easily reflect the historical integral information of the user by utilizing the idea of the index average. Wherein, the weight of expanding the characteristic value of the current day can be 3, 4, 5 and other numbers without departing from the inventive concept of the formula, and the weight of the sum of the history contributions can be correspondingly reduced to +.>And the like, and is not particularly limited.
Further, in the above formula, the coefficient of f (x) is calculated byThe denominator can be obtained after the coefficients of (2) are added, and the calculated comprehensive characteristic value can be ensured to be finally in [0,1 ]]This facilitates user filtering through a given threshold during actual application. Regardless of threshold variations due to distribution variations after long iterations.
More, in order to prevent the influence of the characteristic value of the current day caused by the overlarge N from being smaller, we cut N, and set the maximum value of N to be 30. And combining the historical barrage data issued by the user to comprehensively identify the user, so that the identification accuracy of the target user is improved, and the problem that the barrage information cannot be entered by the normal user through the barrage function is prevented.
In detail, please refer to the related description of the embodiment shown in fig. 1, and the detailed description is omitted.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a target user identifying apparatus according to an embodiment of the present invention, the apparatus may be built into an electronic device, and the apparatus mainly includes:
a first acquisition module 301, a construction module 302, a second acquisition module 303 and a judgment module 304;
the first obtaining module 301 is configured to obtain text features of all the barrages released by the user on the same day;
in a live web site, a user communicates with a host or other user by publishing a barrage. The content characteristics of the barrage released from the user may be such that the user is a normal user or a potential target user. The content characteristics of the barrage mainly comprise punctuation mark characteristics, expression characteristics, negative word characteristics and TF-IDF values of various entries in all barrages in the embodiment of the invention.
The construction module 302 is configured to construct a logistic regression model according to the text feature, so as to obtain a feature value of the user on the same day;
representing the obtained text features as { x } 1 ,x 2 ,x 3 ,…,x n Let Nth in text featureIs characterized by X n X, th n Corresponding preset weight coefficient is theta i The user's global feature value on the same day is f (x), and the logistic regression model may be:
where e is a natural constant and b is a natural number.
A second obtaining module 303, configured to obtain historical barrage data of the user;
in an embodiment of the present invention, the user's historical barrage data includes a number of days of activity of the user, all barrage sets of the user's history, target barrage sets of the user's history, all barrage sets of the user's day, target barrage sets of the user's day, and feature values of the user on a previous day are.
In an embodiment of the present invention, the historical bullet screen data of the user includes an active day of the user, and a characteristic value of the user on a previous day is.
And the judging module 304 is configured to judge whether the user is a target user based on the characteristic value of the user on the current day and the historical barrage data.
In an embodiment of the present invention, the user's historical barrage data includes the number of days of activity of the user, all barrage sets of the user's history, target barrage sets of the user's history, all barrage sets of the user's day, target barrage sets of the user's day, and the characteristic value of the user on the previous day is
In one embodiment of the present invention, the user's historical barrage data includes the number of days the user was active, and the characteristic value of the user on the previous day is
More, the calculation process of the feature value of the user on the previous day is the same as the process of calculating the integrated feature value of the user on the current day described in the above steps S201 to S202. Therefore, the description is omitted.
In one embodiment of the present invention, the number of active days N of the user is obtained, and all bullet screen sets T of the user history are obtained total Target barrage set T of user history spam All barrages set C for the user on the same day total Target barrage set C for user on the same day spam
Let the comprehensive characteristic value of the user be Y, the characteristic value f (x) of the user on the current day, the characteristic value of the user on the previous day beThen:
wherein sigma is a preset parameter, and sigma is more than or equal to 1 and less than or equal to 1.5.
More, in order to prevent the influence of the characteristic value of the current day caused by the overlarge N from being smaller, we cut N, and set the maximum value of N to be 30. At the same time, for some special users, e.g. batch registered users, it is desirable that the user has a faster current day score weight, so we add the parameter σ to strengthen the impact of current day score on the history score. For parameter σ, 1.ltoreq.σ.ltoreq.1.5 is generally defined, the larger the parameter, the higher the current day score impact. The specific value of sigma is generally determined by data analysis during actual use.
The algorithm uses the difference between the current target barrage ratio of the user and the historical target barrage ratio to strengthen the current characteristic value of the user. If the user has a higher duty ratio of the target barrage in the barrage of the day, the characteristic value of the day has a higher specific gravity when fusion is performed. On the other hand, since the spoken target barrage is easily repeatedly transmitted, the barrage like "spoken Buddhist" appears in large numbers, so that the target barrage duty ratio is increased. Therefore, in statistics, the target barrage and the history barrage are subjected to deduplication counting to obtain the value of |C spam |、|T spam I and T spam +C spam |。
In an embodiment of the present invention, the number of active days N of the user is obtained, so that the comprehensive feature value of the user is Y, the feature value of the user on the same day is f (x), and the feature value of the user on the previous day isThen:
more, in order to prevent the influence of the characteristic value of the current day caused by the overlarge N from being smaller, we cut N, and set the maximum value of N to be 30. The method is simple in algorithm, and combines the historical barrage data issued by the user to comprehensively identify the user, so that the identification accuracy of the target user is improved, and the problem that the normal user cannot normally use the barrage function to key barrage information is prevented.
In detail, please refer to the related descriptions of the embodiments shown in fig. 1 to 2, which are not repeated here.
Referring to fig. 4, fig. 4 shows a hardware configuration diagram of an electronic device.
The electronic device described in the present embodiment includes:
the memory 41, the processor 42 and the computer program stored in the memory 41 and executable on the processor, the processor executing the program implements the target user identification method described in the embodiments shown in the foregoing fig. 1 to 2.
Further, the electronic device further includes:
at least one input device 43; at least one output device 44.
The memory 41, the processor 42, the input device 43 and the output device 44 are connected by a bus 45.
The input device 43 may be a camera, a touch panel, a physical button, a mouse, or the like. The output device 44 may be in particular a display screen.
The memory 41 may be a high-speed random access memory (RAM, random Access Memory) memory or a non-volatile memory (non-volatile memory), such as a disk memory. Memory 41 is used to store a set of executable program code and processor 42 is coupled to memory 41.
Further, the embodiment of the present invention also provides a computer readable storage medium, which may be provided in the terminal in each of the above embodiments, and the computer readable storage medium may be a memory in the embodiment shown in fig. 4. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the target user identification method described in the embodiments shown in the foregoing fig. 1-2. Further, the computer-readable medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, etc. which may store the program code.
In the various embodiments provided herein, it should be understood that the disclosed apparatus and methods may be implemented in other ways. For example, the embodiments described above are merely illustrative, e.g., the division of the modules is merely a logical function division, and there may be additional divisions of actual implementation, e.g., multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication links shown or discussed with each other may be indirect coupling or communication links through interfaces, modules, or in electrical, mechanical, or other forms.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.
It should be noted that, for the sake of simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the present invention.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.
The foregoing describes the target user identifying method, apparatus, electronic device and storage medium provided by the present invention, and those skilled in the art will recognize that there are variations in terms of specific embodiments and application scope according to the concepts of the embodiments of the present invention, and in summary, the disclosure should not be construed as limiting the invention.

Claims (8)

1. A method for identifying a target user, comprising:
acquiring text characteristics of all barrages released by a user in the same day;
constructing a logistic regression model according to the text characteristics to obtain the characteristic value of the user on the same day;
acquiring historical barrage data of the user;
judging whether the user is a target user or not based on the characteristic value of the user on the current day and the historical barrage data; the determining whether the user is a target user based on the characteristic value of the user on the current day and the historical barrage data comprises:
calculating the comprehensive characteristic value of the user based on the characteristic value of the user on the current day and the historical barrage data;
judging whether the comprehensive characteristic value of the user is larger than a preset threshold value or not;
if the comprehensive characteristic value of the user is larger than the preset threshold value, the user is a target user;
the calculating the comprehensive characteristic value of the user based on the characteristic value of the user on the current day and the historical barrage data comprises the following steps:
acquiring the number N of active days of the user, and collecting all bullet screen sets T of the user history total The target barrage set T of the user history spam All barrages set C of the user on the same day total The target barrage set C of the user on the same day spam
Let the comprehensive characteristic value of the user be Y, the characteristic value f (x) of the user on the current day beThen:
wherein sigma is a preset parameter, and sigma is more than or equal to 1 and less than or equal to 1.5.
2. The method of claim 1, wherein calculating the integrated feature value of the user based on the feature value of the user on the current day and the historical barrage data comprises:
acquiring the active days N of the user;
let the comprehensive characteristic value of the user be Y, the characteristic value of the user on the current day be f (x), and the characteristic value of the user on the previous day beThen:
3. the method of any one of claims 1 to 2, wherein the text features include punctuation features, expressive features, negative word features, and TF-IDF values for the respective entries in all of the barrages.
4. The method for identifying a target user according to claim 3, wherein constructing a logistic regression model according to the text feature, and obtaining the feature value of the user on the current day comprises:
acquiring preset weight coefficients corresponding to all the characteristics in the text characteristics;
let the Nth feature in the text feature be X n X, th n Corresponding preset weight coefficient is theta i The characteristic value of the user on the same day is f (x), and then:
where e is a natural constant and b is a natural number.
5. The method of claim 1, wherein when the text feature includes TF-IDF values of respective terms in all of the barrages, the obtaining text features of all of the barrages issued by the user on the same day comprises:
let the TF-IDF value of the ith term in the jth barrage be TF-IDF i,j Then:
wherein n is i,j Indicating that the ith term appears in the jth barrageIs the number of times of sigma k n k,j Representing the number of all words in the jth barrage, |D| represents the total number of barrages of all barrages, |{ j: t is t i ∈d j And the number of the bullet screen strips containing the ith entry in all bullet screens is represented by the number of the bullet screen strips.
6. A target user identification apparatus, comprising:
the first acquisition module is used for acquiring text characteristics of all the barrages released by the user in the same day;
the construction module is used for constructing a logistic regression model according to the text characteristics to obtain the characteristic value of the user on the same day;
the second acquisition module is used for acquiring the historical barrage data of the user;
the judging module is used for judging whether the user is a target user or not based on the characteristic value of the user on the current day and the historical barrage data; the determining whether the user is a target user based on the characteristic value of the user on the current day and the historical barrage data comprises:
calculating the comprehensive characteristic value of the user based on the characteristic value of the user on the current day and the historical barrage data;
judging whether the comprehensive characteristic value of the user is larger than a preset threshold value or not;
if the comprehensive characteristic value of the user is larger than the preset threshold value, the user is a target user;
the calculating the comprehensive characteristic value of the user based on the characteristic value of the user on the current day and the historical barrage data comprises the following steps:
acquiring the number N of active days of the user, and collecting all bullet screen sets T of the user history total The target barrage set T of the user history spam All barrages set C of the user on the same day total The target barrage set C of the user on the same day spam
Let the comprehensive characteristic value of the user be Y, the characteristic value f (x) of the user on the current day beThen:
wherein sigma is a preset parameter, and sigma is more than or equal to 1 and less than or equal to 1.5.
7. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the target user identification method according to any one of claims 1 to 5 when executing the computer program.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the target user identification method of any of claims 1 to 5.
CN201910649162.0A 2019-07-17 2019-07-17 Target user identification method, device, electronic equipment and storage medium Active CN110362831B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910649162.0A CN110362831B (en) 2019-07-17 2019-07-17 Target user identification method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910649162.0A CN110362831B (en) 2019-07-17 2019-07-17 Target user identification method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110362831A CN110362831A (en) 2019-10-22
CN110362831B true CN110362831B (en) 2024-02-23

Family

ID=68220752

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910649162.0A Active CN110362831B (en) 2019-07-17 2019-07-17 Target user identification method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110362831B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107656918B (en) * 2017-05-10 2019-07-05 平安科技(深圳)有限公司 Obtain the method and device of target user
CN108108912A (en) * 2018-01-10 2018-06-01 百度在线网络技术(北京)有限公司 Method of discrimination, device, server and the storage medium of interactive low quality user
CN109766435A (en) * 2018-11-06 2019-05-17 武汉斗鱼网络科技有限公司 The recognition methods of barrage classification, device, equipment and storage medium

Also Published As

Publication number Publication date
CN110362831A (en) 2019-10-22

Similar Documents

Publication Publication Date Title
CN110472675B (en) Image classification method, image classification device, storage medium and electronic equipment
CN111371767B (en) Malicious account identification method, malicious account identification device, medium and electronic device
CN109840413B (en) Phishing website detection method and device
CN111291817B (en) Image recognition method, image recognition device, electronic equipment and computer readable medium
EP2786221A2 (en) Classifying attribute data intervals
CN113220886A (en) Text classification method, text classification model training method and related equipment
CN110135681A (en) Risk subscribers recognition methods, device, readable storage medium storing program for executing and terminal device
US20190220924A1 (en) Method and device for determining key variable in model
CN111159481B (en) Edge prediction method and device for graph data and terminal equipment
CN111898675B (en) Credit wind control model generation method and device, scoring card generation method, machine readable medium and equipment
CN111694954B (en) Image classification method and device and electronic equipment
CN117349899A (en) Sensitive data processing method, system and storage medium based on forgetting model
CN112257689A (en) Training and recognition method of face recognition model, storage medium and related equipment
CN108076032B (en) Abnormal behavior user identification method and device
CN110362831B (en) Target user identification method, device, electronic equipment and storage medium
CN109600627B (en) Video identification method and device
CN111507850A (en) Authority guaranteeing method and related device and equipment
CN110688451A (en) Evaluation information processing method, evaluation information processing device, computer device, and storage medium
CN112463964B (en) Text classification and model training method, device, equipment and storage medium
CN113420699A (en) Face matching method and device and electronic equipment
CN113112347A (en) Determination method of hasty collection decision, related device and computer storage medium
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN112507912A (en) Method and device for identifying illegal picture
CN110543634A (en) corpus data set processing method and device, electronic equipment and storage medium
CN116663648B (en) Model training method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20231227

Address after: Room 205, Room 206, Room J1447, No. 1045 Tianyuan Road, Tianhe District, Guangzhou City, Guangdong Province, 510000

Applicant after: Guangzhou Caimeng Technology Co.,Ltd.

Address before: 430000 room 007, A301, third floor, building B1, software industry phase 4.1, No. 1, Software Park East Road, Donghu New Technology Development Zone, Wuhan City, Hubei Province (Wuhan area of free trade zone)

Applicant before: WUHAN DOUYU YULE NETWORK TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant