CN110971501A - Method, system, device and storage medium for determining advertisement message - Google Patents

Method, system, device and storage medium for determining advertisement message Download PDF

Info

Publication number
CN110971501A
CN110971501A CN201811158478.1A CN201811158478A CN110971501A CN 110971501 A CN110971501 A CN 110971501A CN 201811158478 A CN201811158478 A CN 201811158478A CN 110971501 A CN110971501 A CN 110971501A
Authority
CN
China
Prior art keywords
message
historical
data
current
advertisement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811158478.1A
Other languages
Chinese (zh)
Other versions
CN110971501B (en
Inventor
李晨
金姿
林金明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201811158478.1A priority Critical patent/CN110971501B/en
Publication of CN110971501A publication Critical patent/CN110971501A/en
Application granted granted Critical
Publication of CN110971501B publication Critical patent/CN110971501B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method, a system, equipment and a storage medium for determining advertisement messages, wherein the method for determining comprises the following steps: acquiring a plurality of received historical messages, and processing the historical messages to acquire an advertisement data set; obtaining a received current message, and calculating a plurality of first similarities between the current message and historical messages in the advertisement data set; and judging whether a first similarity larger than a first set threshold exists in the plurality of first similarities, and if so, determining that the current message is an advertisement message. The method and the system can effectively identify whether the received message is the advertisement message and intercept the advertisement message, reduce disturbance caused by the advertisement message, and effectively inhibit the spreading of grey industry while not influencing certain indexes (such as the return rate) of sellers.

Description

Method, system, device and storage medium for determining advertisement message
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method, a system, a device, and a storage medium for determining an advertisement message.
Background
As shopping software goes deep into people's lives, instant messaging systems are configured in existing shopping software to facilitate direct communication between users and sellers. However, especially for the seller, the instant messaging system is full of a lot of repeated and meaningless messages (such as advertisement messages of various products recommended by the manufacturer to the seller, advertisement messages of agent operation or customer service, advertisement messages of selling shops or purchasing shops, advertisement messages of home-entry service, advertisement of gray industry such as bill brushing and brushing rank aiming at the seller, etc.) while establishing a communication bridge between the seller and the user, and thus the instant messaging system seriously interferes with normal information communication, thereby not only affecting some indexes (such as a response rate) of the seller, but also promoting the spread of gray industry.
In the prior art, corresponding blocking operations have been performed for various advertisement messages in an instant messaging system, which mainly include the following steps: 1) controlling the frequency, namely judging whether the message is an advertisement message or not by the frequency of the message sent in a certain time period, the number of times of the seller being accessed and the number of repeated messages; 2) judging whether the message is an advertisement message or not by setting keywords; 3) and identifying the message through a text identification algorithm to judge whether the message is an advertisement message.
However, the above interception methods have the following problems: 1) according to the method for judging whether the messages are advertisements or not according to the frequency, the advertisements with the number lower than the set threshold value cannot be intercepted, a normal user can also send a plurality of messages, when the ratio of the normal messages to the advertisement messages is relatively close, the interception method cannot be effectively distinguished, the frequency control mode can only set a higher threshold value, so that the advertisement messages with relatively small number are intercepted, and the interception effect is not ideal; 2) the method for judging whether the message is the advertisement or not by setting the keywords cannot realize effective interception after the keywords are replaced; 3) the advertisement message in the previous time period is used as a training set to intercept the advertisement message, new advertisement words and structures are generated at intervals of two or three days, and the text recognition algorithm can only capture the advertisement with the characteristics of a few days ago, so that the timeliness and the detection rate are low, the advertisement is easy to bypass by an advertisement sender, and the effective interception cannot be realized.
Disclosure of Invention
The technical problem to be solved by the invention is that the advertisement message interception methods in the instant messaging system in the prior art can not effectively intercept the advertisement message, and the like, and aims to provide a method, a system, equipment and a storage medium for determining the advertisement message.
The invention solves the technical problems through the following technical scheme:
the invention provides a method for determining an advertisement message, which comprises the following steps:
acquiring a plurality of received historical messages, and processing the historical messages to acquire an advertisement data set;
obtaining a received current message, and calculating a plurality of first similarities between the current message and historical messages in the advertisement data set;
judging whether a first similarity larger than a first set threshold exists in the plurality of first similarities, and if so, determining that the current message is an advertisement message; otherwise, determining that the current message is a non-advertisement message.
Preferably, the step of processing the history message further comprises:
obtaining a non-advertising data set;
the step of obtaining the received current message further comprises:
calculating a plurality of second similarities for the current message to historical messages in the non-advertising dataset;
judging whether a second similarity larger than a second set threshold exists in the plurality of second similarities, and if so, determining that the current message is a non-advertisement message; otherwise, determining the current message as an advertisement message;
when the first similarity is greater than the first set threshold and the second similarity is greater than the second set threshold, or when the first similarity is less than or equal to the first set threshold and the second similarity is less than or equal to the second set threshold, adjusting the first set threshold and/or the second set threshold until the first similarity is greater than the first set threshold and the second similarity is less than or equal to the second set threshold, or the first similarity is less than or equal to the first set threshold and the second similarity is greater than the second set threshold.
Preferably, the first set threshold is equal to the second set threshold.
Preferably, after the step of acquiring the received history messages and before the step of processing the history messages, the method further includes:
acquiring user history data, wherein the user history data comprises at least one of user historical account information, the number of historical messages sent by a user in different time periods and the historical access times of a seller;
the step of processing the historical messages to obtain an advertising data set and the step of obtaining a non-advertising data set include:
extracting first characteristic data corresponding to the historical information and the user historical data;
wherein the first characteristic data comprises at least one of a length of the historical message, a frequency of repeated occurrences of the historical message within a set time window, whether the historical message contains a keyword, a number of keywords contained in the historical message, a percentage of keywords in the historical message, the user historical account information, a number of historical messages sent by the user over different time periods, and the historical number of visits to the seller;
processing the first characteristic data to obtain second characteristic data;
mapping the historical message into a corresponding first numerical sequence according to the second characteristic data;
and carrying out classification training on the first digital sequences of the plurality of historical messages by adopting a Bayesian classification algorithm or a logistic regression algorithm to obtain the advertisement data set and the non-advertisement data set.
Preferably, the step of processing the first feature data and acquiring the second feature data includes:
deleting part of the first feature data to obtain second feature data, and/or equivalent a plurality of first feature data with correlation in the first feature data to one first feature data to obtain second feature data.
Preferably, after the step of obtaining the received current message and before the step of calculating a plurality of first similarities between the current message and the historical messages in the advertisement data set, the method further comprises:
acquiring current data of a user, wherein the current data of the user comprises current account information of the user, at least one of the number of current messages sent by the user in different time periods and the current access times of a seller;
extracting third characteristic data corresponding to the current message and the current user data;
wherein the third characteristic data further comprises at least one of a length of the current message, a frequency of repeated occurrences of the current message within a set time window, whether the current message contains a keyword, a number of keywords contained in the current message, a percentage of keywords in the current message, current account information of the user, a number of the current messages sent by the user in different time periods, and the current number of visits to the seller;
mapping the current message into a corresponding second digital sequence according to the third characteristic data;
the step of calculating a plurality of first similarities for the current message to historical messages in the advertisement dataset comprises:
calculating a first similarity of the second digit sequence to each first digit sequence in the advertisement data set using a logistic regression algorithm;
said step of calculating a plurality of second similarities for said current message to historical messages in said non-advertising dataset comprises:
calculating a second similarity of the second digit sequence to each first digit sequence in the non-advertising data set using a logistic regression algorithm.
The invention also provides a system for determining the advertising message, which comprises a historical message acquisition module, a data set acquisition module, a current message acquisition module, a calculation module, a judgment module and a determination module;
the history message acquisition module is used for acquiring a plurality of received history messages;
the data set acquisition module is used for processing the historical information to acquire an advertisement data set;
the current message acquisition module is used for acquiring the received current message;
the computing module is configured to compute a plurality of first similarities between the current message and historical messages in the advertisement dataset;
the judging module is used for judging whether a first similarity larger than a first set threshold exists in the plurality of first similarities, and if so, the determining module is called to determine that the current message is an advertisement message; otherwise, determining that the current message is a non-advertisement message.
Preferably, the data set obtaining module is further configured to obtain a non-advertisement data set;
the computing module is further configured to compute a plurality of second similarities between the current message and historical messages in the non-advertising dataset;
the judging module is further configured to judge whether a second similarity greater than a second set threshold exists in the plurality of second similarities, and if so, invoke the determining module to determine that the current message is a non-advertisement message; otherwise, determining the current message as an advertisement message;
the determination system further comprises a threshold adjustment module;
the threshold adjusting module is configured to adjust the first set threshold and/or the second set threshold when the first similarity is greater than the first set threshold and the second similarity is greater than the second set threshold, or when the first similarity is less than or equal to the first set threshold and the second similarity is less than or equal to the second set threshold, until the first similarity is greater than the first set threshold and the second similarity is less than or equal to the second set threshold, or the first similarity is less than or equal to the first set threshold and the second similarity is greater than the second set threshold.
Preferably, the first set threshold is equal to the second set threshold.
Preferably, the determination system comprises a user data acquisition module;
the data set acquisition module comprises a characteristic data extraction unit, a characteristic data processing unit, a mapping unit and a classification training unit;
the user data acquisition module is used for acquiring user historical data, wherein the user historical data comprises at least one of user historical account information, the number of historical messages sent by a user in different time periods and the historical access times of a seller;
the characteristic data extraction unit is used for extracting first characteristic data corresponding to the historical information and the user historical data;
wherein the first characteristic data comprises at least one of a length of the historical message, a frequency of repeated occurrences of the historical message within a set time window, whether the historical message contains a keyword, a number of keywords contained in the historical message, a percentage of keywords in the historical message, the user historical account information, a number of historical messages sent by the user over different time periods, and the historical number of visits to the seller;
the characteristic data processing unit is used for processing the first characteristic data to obtain second characteristic data;
the mapping unit is used for mapping the historical message into a corresponding first numerical sequence according to the second characteristic data;
the classification training unit is used for performing classification training on the first digital sequences of the plurality of historical messages by adopting a Bayesian classification algorithm or a logistic regression algorithm to obtain the advertisement data set and the non-advertisement data set.
Preferably, the feature data processing unit is configured to delete a part of the first feature data in the first feature data to obtain second feature data, and/or to obtain the second feature data by equivalently converting a plurality of first feature data having a correlation in the first feature data into one first feature data.
Preferably, the user data obtaining module is further configured to obtain current user data, where the current user data includes current user account information, at least one of a current number of messages sent by the user in different time periods and a current number of visits made to the seller;
the characteristic data extraction unit is also used for extracting third characteristic data corresponding to the current message and the current user data;
wherein the third characteristic data further comprises at least one of a length of the current message, a frequency of repeated occurrences of the current message within a set time window, whether the current message contains a keyword, a number of keywords contained in the current message, a percentage of keywords in the current message, current account information of the user, a number of the current messages sent by the user in different time periods, and the current number of visits to the seller;
the mapping unit is further configured to map the current message into a corresponding second digit sequence according to the third feature data;
the calculation module is configured to calculate a first similarity between the second digit sequence and each first digit sequence in the advertisement data set using a logistic regression algorithm;
the calculation module is further configured to calculate a second similarity of the second digit sequence to each first digit sequence in the non-advertisement data set using a logistic regression algorithm.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the method for determining the advertisement message when executing the computer program.
The invention also provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for determining an advertising message as described above.
The positive progress effects of the invention are as follows:
the method and the system can effectively identify whether the received message is the advertisement message and intercept the advertisement message, reduce disturbance caused by the advertisement message, and effectively inhibit the spreading of grey industry while not influencing certain indexes (such as the return rate) of sellers.
Drawings
Fig. 1 is a flowchart of a method for determining an advertisement message according to embodiment 1 of the present invention.
Fig. 2 is a flowchart of a method for determining an advertisement message according to embodiment 2 of the present invention.
Fig. 3 is a flowchart of a method for determining an advertisement message according to embodiment 3 of the present invention.
Fig. 4 is a block diagram of an advertisement message determination system according to embodiment 4 of the present invention.
Fig. 5 is a block diagram of an advertisement message determination system according to embodiment 5 of the present invention.
Fig. 6 is a block diagram of an advertisement message determination system according to embodiment 6 of the present invention.
Fig. 7 is a schematic structural diagram of an electronic device implementing the method for determining an advertisement message in embodiment 7 of the present invention.
Detailed Description
The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.
Example 1
The application environment of the embodiment is an instant messaging system configured in online shopping software and used for facilitating direct communication between a user and a seller, and is mainly used for effectively identifying and intercepting advertisement information received in the instant messaging system (particularly, an instant messaging system used by a seller).
As shown in fig. 1, the method for determining an advertisement message of this embodiment includes:
s101, obtaining a plurality of received historical messages;
s102, processing the historical information to obtain an advertisement data set;
s103, acquiring the received current message;
s1041, calculating a plurality of first similarity of the current message and the historical messages in the advertisement data set;
s1051, judging whether a first similarity larger than a first set threshold exists in the plurality of first similarities, if so, determining that the current message is an advertisement message; otherwise, determining that the current message is a non-advertisement message. The first set threshold can be directly adjusted according to actual needs.
In the embodiment, a plurality of historical messages received in an instant messaging system are obtained and processed to obtain an advertisement data set; the method comprises the steps of obtaining a received current message, calculating a plurality of first similarities between the current message and historical messages of an advertisement data set, and when the first similarities are larger than a first set threshold value, determining that the current message is the advertisement message, so that whether the received message is the advertisement message can be effectively identified and intercepted, disturbance caused by the advertisement message is reduced, and the spreading of grey industry is effectively restrained while certain indexes (such as a reply rate) of a seller are not influenced.
Example 2
As shown in fig. 2, the method for determining an advertisement message of this embodiment is a further improvement of embodiment 1, and specifically:
step S102 includes:
s1021, processing the historical message to obtain an advertisement data set and a non-advertisement data set;
step S103 is followed by:
s1042, calculating a plurality of second similarity of the current message and the historical message in the non-advertisement data set;
step S1041 and step S1042 both belong to step S104.
After the step S1042, the method further includes:
s1052, judging whether a second similarity larger than a second set threshold exists in the plurality of second similarities, and if so, determining that the current message is a non-advertisement message; otherwise, the current message is determined to be an advertisement message. The second setting threshold can be directly adjusted according to actual needs.
Step S1051 and step S1052 both belong to step 105.
In this embodiment, when the first similarity is greater than the first set threshold and the second similarity is greater than the second set threshold, or the first similarity is less than or equal to the first set threshold and the second similarity is less than or equal to the second set threshold, that is, when the determination results of the steps S1051 and S1052 determine whether the current message is the advertisement message are contradictory to each other, the first set threshold and/or the second set threshold are/is adjusted until the first similarity is greater than the first set threshold and the second similarity is less than or equal to the second set threshold, or the first similarity is less than or equal to the first set threshold and the second similarity is greater than the second set threshold. In addition, the first set threshold value and the second set threshold value may be equal to each other.
Specifically, when the users in the normal industry and the users in the grey industry are in the vacation state in the coming years, the advertisement messages received in the instant messaging system are relatively less, and at the moment, the first set threshold value and/or the second set threshold value can be properly increased, otherwise, the false killing amount for the normal users is larger and does not meet the requirements; for a particular festival (e.g., twenty-first, etc.), the received advertisement message is larger, and the first threshold and/or the second threshold may be adjusted down appropriately to increase the probability of detecting the advertisement message.
In the embodiment, a plurality of historical messages received in an instant messaging system are obtained and classified, and an advertisement data set and a non-advertisement data set are obtained; the method comprises the steps of obtaining a received current message, respectively calculating a plurality of first similarities of the current message and historical messages of an advertisement data set and a plurality of second similarities of the current message and the historical messages of a non-advertisement data set, and determining the current message as the advertisement message when the first similarities are larger than a first set threshold value; when the second similarity which is larger than the second set threshold value exists in the plurality of second similarities, the current message is determined to be the non-advertisement message, so that whether the received message is the advertisement message or not can be effectively identified and intercepted, disturbance caused by the advertisement message is reduced, and the spreading of the grey industry is effectively restrained while certain indexes (such as the reply rate) of a seller are not influenced.
Example 3
As shown in fig. 3, the method for determining an advertisement message of this embodiment is a further improvement of embodiment 2, and specifically:
after step S101 and before step S1021, the method further includes:
s1020, acquiring historical data of a user;
wherein the user history data comprises at least one of user historical account information, the number of historical messages sent by the user in different time periods, and the historical access times of the seller;
step S1021 specifically includes:
s10211, extracting first feature data corresponding to the history information and the user history data;
the first characteristic data comprises at least one of the length of the historical messages, the repeated occurrence frequency of the historical messages in a set time window, whether the historical messages contain keywords, the number of the keywords contained in the historical messages, the proportion of the keywords in the historical messages, the historical account information of the user, the number of the historical messages sent by the user in different time periods and the historical access times of the seller;
s10212, processing the first characteristic data to obtain second characteristic data;
specifically, part of the first feature data is deleted to obtain second feature data, and/or a plurality of first feature data having correlation in the first feature data are equivalent to one first feature data to obtain second feature data.
For example, since the relevance of the user age information to whether the message sent by the user is advertisement information is not large, the user age information in the user history account information may be deleted;
for some first feature data to have an accurate correlation or a high correlation, it is necessary to equate a plurality of first feature data to one first feature data, and if the number of messages sent by the user and the number of visits to the seller are positively correlated, the number of messages sent by the user and the number of visits to the seller may be equated to one number of messages sent by the user or the number of visits to the seller.
In addition, when deleting a certain first feature data causes another first feature data to be overestimated, the number of samples should be increased rather than decreasing the dimension of the feature data in the first feature data, that is, the number of messages sent by the user and the number of visits to the seller need to be concerned at the same time.
S10213, mapping the historical message into a corresponding first number sequence according to the second characteristic data;
specifically, the fields of the general history message are character strings composed of numbers, Chinese, English, symbols and the like, and the non-number fields of the history message need to be converted into a form capable of being interpreted by numbers, such as a number sequence.
For example: the digital sequence corresponding to a historical message sequentially comprises: the length of the history message, the repeated frequency of the history message in a set time window, whether the history message contains the keywords, the number of the keywords contained in the history message, the occupation ratio of the keywords in the history message, the historical account information of the user, the number of the history messages sent by the user in different time periods and the historical access times of the seller comprise the following numerical values: 10, 4, 1, 3, 0.7, 001, 8, 12, the corresponding history message is represented by a number sequence (10, 4, 1, 3, 0.7, 001, 8, 12), and so on, and other history messages also correspond to their own number sequence.
Wherein, some fields included in the history message, such as network operators, are difficult to directly measure the size, and have no grade or degree, and can be represented by constructing variables represented by 0 and 1, such as mobile operator 000, unicom operator 100, telecom operator 010, and iron communication operator 001.
In addition, a Bag of words model can be adopted to extract each word in each historical message and respectively convert the word into a word vector, the similarity of each historical message after being compared with the pure text advertisement message set by the system is calculated by utilizing the cosine included angle of each word vector, the similarity is expressed by a value between 0 and 1, and the similarity can also be used as a characteristic dimension in a digital sequence.
S10214, classifying and training the first digit sequences of the plurality of historical messages by adopting a Bayesian classification algorithm or a logistic regression algorithm, and acquiring an advertisement data set and a non-advertisement data set.
After step S103 and before step S104, the method further includes:
s1031, obtaining current data of the user;
wherein the user current data comprises at least one of user current account information, current number of messages sent by the user in different time periods and current access times of the seller;
s1032, extracting third feature data corresponding to the current message and the current data of the user;
the third characteristic data also comprises at least one of the length of the current message, the frequency of repeated appearance of the current message in a set time window, whether the current message contains a keyword, the number of the keywords contained in the current message, the proportion of the keywords in the current message, the current account information of the user, the number of the current messages sent by the user in different time periods and the current access times of the seller;
s1033, mapping the current message into a corresponding second digital sequence according to the third characteristic data;
in addition, step S1041 specifically includes:
calculating a first similarity of the second digit sequence to each first digit sequence in the advertisement data set using a logistic regression algorithm;
step S1042 specifically includes:
a logistic regression algorithm is used to calculate a second similarity of the second digit sequence to each of the first digit sequences in the non-advertising data set.
Specifically, a binary classification algorithm in the logistic regression algorithm is adopted to calculate the first similarity and the second similarity. In this embodiment, the first batch of advertisement data sets and non-advertisement data sets may be selected by manual screening, and the subsequent advertisement data sets and non-advertisement data sets may be obtained based on the classification result in the previous period, and only a small amount of manual intervention is required on the premise of ensuring the accuracy.
In the embodiment, a plurality of historical messages received in an instant messaging system are obtained and classified, and an advertisement data set and a non-advertisement data set are obtained; the method comprises the steps of obtaining a received current message, respectively calculating a plurality of first similarities of the current message and historical messages of an advertisement data set and a plurality of second similarities of the current message and the historical messages of a non-advertisement data set, and determining the current message as the advertisement message when the first similarities are larger than a first set threshold value; when the second similarity which is larger than the second set threshold value exists in the plurality of second similarities, the current message is determined to be the non-advertisement message, so that whether the received message is the advertisement message or not can be effectively identified and intercepted, disturbance caused by the advertisement message is reduced, and the spreading of the grey industry is effectively restrained while certain indexes (such as the reply rate) of a seller are not influenced.
Example 4
The application environment of the embodiment is an instant messaging system configured in online shopping software and used for facilitating direct communication between a user and a seller, and is mainly used for effectively identifying and intercepting advertisement information received in the instant messaging system (particularly, an instant messaging system used by a seller).
As shown in fig. 4, the advertisement message determination system of the present embodiment includes a history message obtaining module 1, a data set obtaining module 2, a current message obtaining module 3, a calculating module 4, a judging module 5, and a determining module 6.
The history message acquisition module 1 is used for acquiring a plurality of received history messages;
the data set acquisition module 2 is used for processing the historical information to obtain an advertisement data set;
the current message obtaining module 3 is used for obtaining the received current message;
the calculation module 4 is configured to calculate a plurality of first similarities between the current message and the historical messages in the advertisement data set;
the judging module 5 is configured to judge whether a first similarity greater than a first set threshold exists in the plurality of first similarities, and if so, invoke the determining module 6 to determine that the current message is an advertisement message; otherwise, the call determination module 6 determines that the current message is a non-advertisement message. The first set threshold can be directly adjusted according to actual needs.
In the embodiment, a plurality of historical messages received in an instant messaging system are obtained and processed to obtain an advertisement data set; the method comprises the steps of obtaining a received current message, calculating a plurality of first similarities between the current message and historical messages of an advertisement data set, and determining the current message as an advertisement message when the first similarities are larger than a first set threshold; therefore, whether the received message is an advertisement message or not can be effectively identified and intercepted, harassment caused by the advertisement message is reduced, and the extension of grey industry is effectively restrained while certain indexes (such as a reply rate) of sellers are not influenced.
Example 5
As shown in fig. 5, the advertisement message determination system of the present embodiment is a further improvement of embodiment 4, specifically:
the system for determining advertisement messages further comprises a threshold adjustment module 7.
The data set acquisition module 2 is also used for acquiring a non-advertisement data set;
the calculation module 4 is further configured to calculate a plurality of second similarities between the current message and the historical messages in the non-advertisement data set;
the judging module 5 is further configured to judge whether a second similarity greater than a second set threshold exists in the plurality of second similarities, and if so, invoke the determining module 6 to determine that the current message is a non-advertisement message; otherwise, the call determination module 6 determines that the current message is an advertisement message. The second setting threshold can be directly adjusted according to actual needs.
The threshold adjustment module 7 is configured to adjust the first set threshold and/or the second set threshold when the first similarity is greater than the first set threshold and the second similarity is greater than the second set threshold, or when the first similarity is less than or equal to the first set threshold and the second similarity is less than or equal to the second set threshold, until the first similarity is greater than the first set threshold and the second similarity is less than or equal to the second set threshold, or the first similarity is less than or equal to the first set threshold and the second similarity is greater than the second set threshold. In addition, the first set threshold value and the second set threshold value may be equal to each other.
Specifically, when the users in the normal industry and the users in the grey industry are in the vacation state in the coming years, the advertisement messages received in the instant messaging system are relatively less, and at the moment, the first set threshold value and/or the second set threshold value can be properly increased, otherwise, the false killing amount for the normal users is larger and does not meet the requirements; for a particular festival (e.g., twenty-first, etc.), the received advertisement message is larger, and the first threshold and/or the second threshold may be adjusted down appropriately to increase the probability of detecting the advertisement message.
In the embodiment, a plurality of historical messages received in an instant messaging system are obtained and classified, and an advertisement data set and a non-advertisement data set are obtained; the method comprises the steps of obtaining a received current message, respectively calculating a plurality of first similarities of the current message and historical messages of an advertisement data set and a plurality of second similarities of the current message and the historical messages of a non-advertisement data set, and determining the current message as the advertisement message when the first similarities are larger than a first set threshold value; when the second similarity which is larger than the second set threshold value exists in the plurality of second similarities, the current message is determined to be the non-advertisement message, so that whether the received message is the advertisement message or not can be effectively identified and intercepted, disturbance caused by the advertisement message is reduced, and the spreading of the grey industry is effectively restrained while certain indexes (such as the reply rate) of a seller are not influenced.
Example 6
As shown in fig. 6, the advertisement message determination system of the present embodiment is a further improvement of embodiment 5, specifically:
the advertisement message determination system of the present embodiment further includes a user data acquisition module 8.
The data set acquisition module 2 includes a feature data extraction unit 21, a feature data processing unit 22, a mapping unit 23, and a classification training unit 24.
The user data acquisition module 8 is used for acquiring user historical data;
wherein the user history data comprises at least one of user historical account information, the number of historical messages sent by the user in different time periods, and the historical access times of the seller;
the feature data extraction unit 21 is configured to extract first feature data corresponding to the history message and the user history data;
the first characteristic data comprises at least one of the length of the historical messages, the repeated occurrence frequency of the historical messages in a set time window, whether the historical messages contain keywords, the number of the keywords contained in the historical messages, the proportion of the keywords in the historical messages, the historical account information of the user, the number of the historical messages sent by the user in different time periods and the historical access times of the seller;
the feature data processing unit 22 is configured to process the first feature data to obtain second feature data;
specifically, the feature data processing unit 22 is configured to delete a part of the first feature data in the first feature data to obtain second feature data, and/or equivalent a plurality of first feature data having a correlation in the first feature data to one first feature data to obtain the second feature data.
For example, since the relevance of the user age information to whether the message sent by the user is advertisement information is not large, the user age information in the user history account information may be deleted;
for some first feature data to have an accurate correlation or a high correlation, it is necessary to equate a plurality of first feature data to one first feature data, and if the number of messages sent by the user and the number of visits to the seller are positively correlated, the number of messages sent by the user and the number of visits to the seller may be equated to one number of messages sent by the user or the number of visits to the seller.
In addition, when deleting a certain first feature data causes another first feature data to be overestimated, the number of samples should be increased rather than decreasing the dimension of the feature data in the first feature data, that is, the number of messages sent by the user and the number of visits to the seller need to be concerned at the same time. The mapping unit 23 is configured to map the historical message into a corresponding first number sequence according to the second feature data;
specifically, the fields of the general history message are character strings composed of numbers, Chinese, English, symbols and the like, and the non-number fields of the history message need to be converted into a form capable of being interpreted by numbers, such as a number sequence.
For example: the digital sequence corresponding to a historical message sequentially comprises: the length of the history message, the repeated frequency of the history message in a set time window, whether the history message contains the keywords, the number of the keywords contained in the history message, the occupation ratio of the keywords in the history message, the historical account information of the user, the number of the history messages sent by the user in different time periods and the historical access times of the seller comprise the following numerical values: 10, 4, 1, 3, 0.7, 001, 8, 12, the corresponding history message is represented by a number sequence (10, 4, 1, 3, 0.7, 001, 8, 12), and so on, and other history messages also correspond to their own number sequence.
Wherein, some fields included in the history message, such as network operators, are difficult to directly measure the size, and have no grade or degree, and can be represented by constructing variables represented by 0 and 1, such as mobile operator 000, unicom operator 100, telecom operator 010, and iron communication operator 001.
In addition, a Bag of words model can be adopted to extract each word in each historical message and respectively convert the word into a word vector, the similarity of each historical message after being compared with the pure text advertisement message set by the system is calculated by utilizing the cosine included angle of each word vector, the similarity is expressed by a value between 0 and 1, and the similarity can also be used as a characteristic dimension in a digital sequence.
The classification training unit 24 is configured to perform classification training on the first digit sequences of the plurality of historical messages by using a bayesian classification algorithm or a logistic regression algorithm, and obtain an advertisement data set and a non-advertisement data set.
The user data obtaining module 8 is further configured to obtain current user data, where the current user data includes current user account information, at least one of the number of current messages sent by the user in different time periods and the current number of accesses to the seller;
the feature data extracting unit 21 is further configured to extract third feature data corresponding to the current message and the current user data;
the third characteristic data also comprises at least one of the length of the current message, the frequency of repeated appearance of the current message in a set time window, whether the current message contains a keyword, the number of the keywords contained in the current message, the proportion of the keywords in the current message, the current account information of the user, the number of the current messages sent by the user in different time periods and the current access times of the seller;
the mapping unit 23 is further configured to map the current message into a corresponding second number sequence according to the third feature data;
the calculation module 4 is configured to calculate a first similarity between the second digit sequence and each first digit sequence in the advertisement data set by using a logistic regression algorithm;
the calculation module 4 is further configured to calculate a second similarity between the second digit sequence and each first digit sequence in the non-advertisement data set by using a logistic regression algorithm.
Specifically, a binary classification algorithm in the logistic regression algorithm is adopted to calculate the first similarity and the second similarity. In this embodiment, the first batch of advertisement data sets and non-advertisement data sets may be selected by manual screening, and the subsequent advertisement data sets and non-advertisement data sets may be obtained based on the classification result in the previous period, and only a small amount of manual intervention is required on the premise of ensuring the accuracy.
In the embodiment, a plurality of historical messages received in an instant messaging system are obtained and classified, and an advertisement data set and a non-advertisement data set are obtained; the method comprises the steps of obtaining a received current message, respectively calculating a plurality of first similarities of the current message and historical messages of an advertisement data set and a plurality of second similarities of the current message and the historical messages of a non-advertisement data set, and determining the current message as the advertisement message when the first similarities are larger than a first set threshold value; when the second similarity which is larger than the second set threshold value exists in the plurality of second similarities, the current message is determined to be the non-advertisement message, so that whether the received message is the advertisement message or not can be effectively identified and intercepted, disturbance caused by the advertisement message is reduced, and the spreading of the grey industry is effectively restrained while certain indexes (such as the reply rate) of a seller are not influenced.
Example 7
Fig. 7 is a schematic structural diagram of an electronic device according to embodiment 7 of the present invention. The electronic device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to implement the method for determining the advertisement message in any one of embodiments 1 to 3. The electronic device 30 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 7, the electronic device 30 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, and a bus 33 connecting the various system components (including the memory 32 and the processor 31).
The bus 33 includes a data bus, an address bus, and a control bus.
The memory 32 may include volatile memory, such as Random Access Memory (RAM)321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.
Memory 32 may also include a program/utility 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The processor 31 executes various functional applications and data processing, such as a method for determining an advertisement message in any one of embodiments 1 to 3 of the present invention, by running a computer program stored in the memory 32.
The electronic device 30 may also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). Such communication may be through input/output (I/O) interfaces 35. Also, model-generating device 30 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 36. As shown in FIG. 7, network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 30, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.
It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Example 8
The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the steps in the method for determining an advertisement message in any one of embodiments 1 to 3.
More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In a possible implementation, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps of the method for determining an advertisement message as in any of the embodiments 1 to 3 when the program product is run on the terminal device.
Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that these are by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims (14)

1. A method for determining an advertisement message, the method comprising:
acquiring a plurality of received historical messages, and processing the historical messages to acquire an advertisement data set;
obtaining a received current message, and calculating a plurality of first similarities between the current message and historical messages in the advertisement data set;
judging whether a first similarity larger than a first set threshold exists in the plurality of first similarities, and if so, determining that the current message is an advertisement message; otherwise, determining that the current message is a non-advertisement message.
2. The method for determining advertising messages as recited in claim 1, wherein the step of processing the historical messages further comprises, after:
obtaining a non-advertising data set;
the step of obtaining the received current message further comprises:
calculating a plurality of second similarities for the current message to historical messages in the non-advertising dataset;
judging whether a second similarity larger than a second set threshold exists in the plurality of second similarities, and if so, determining that the current message is a non-advertisement message; otherwise, determining the current message as an advertisement message;
when the first similarity is greater than the first set threshold and the second similarity is greater than the second set threshold, or when the first similarity is less than or equal to the first set threshold and the second similarity is less than or equal to the second set threshold, adjusting the first set threshold and/or the second set threshold until the first similarity is greater than the first set threshold and the second similarity is less than or equal to the second set threshold, or the first similarity is less than or equal to the first set threshold and the second similarity is greater than the second set threshold.
3. The method of determining an advertisement message according to claim 2, wherein the first set threshold value is equal to the second set threshold value.
4. The method for determining advertising messages according to claim 2, wherein the step of obtaining the received plurality of history messages is further followed by the step of processing the history messages, and further comprising:
acquiring user history data, wherein the user history data comprises at least one of user historical account information, the number of historical messages sent by a user in different time periods and the historical access times of a seller;
the step of processing the historical messages to obtain an advertising data set and the step of obtaining a non-advertising data set include:
extracting first characteristic data corresponding to the historical information and the user historical data;
wherein the first characteristic data comprises at least one of a length of the historical message, a frequency of repeated occurrences of the historical message within a set time window, whether the historical message contains a keyword, a number of keywords contained in the historical message, a percentage of keywords in the historical message, the user historical account information, a number of historical messages sent by the user over different time periods, and the historical number of visits to the seller;
processing the first characteristic data to obtain second characteristic data;
mapping the historical message into a corresponding first numerical sequence according to the second characteristic data;
and carrying out classification training on the first digital sequences of the plurality of historical messages by adopting a Bayesian classification algorithm or a logistic regression algorithm to obtain the advertisement data set and the non-advertisement data set.
5. The method for determining advertisement messages according to claim 4, wherein the step of processing the first characteristic data and obtaining the second characteristic data comprises:
deleting part of the first feature data to obtain second feature data, and/or equivalent a plurality of first feature data with correlation in the first feature data to one first feature data to obtain second feature data.
6. The method of determining advertising messages as recited in claim 4, wherein the step of obtaining the received current message is further followed by the step of calculating a plurality of first similarities between the current message and historical messages in the advertising data set, and further comprising:
acquiring current data of a user, wherein the current data of the user comprises current account information of the user, at least one of the number of current messages sent by the user in different time periods and the current access times of a seller;
extracting third characteristic data corresponding to the current message and the current user data;
wherein the third characteristic data further comprises at least one of a length of the current message, a frequency of repeated occurrences of the current message within a set time window, whether the current message contains a keyword, a number of keywords contained in the current message, a percentage of keywords in the current message, current account information of the user, a number of the current messages sent by the user in different time periods, and the current number of visits to the seller;
mapping the current message into a corresponding second digital sequence according to the third characteristic data;
the step of calculating a plurality of first similarities for the current message to historical messages in the advertisement dataset comprises:
calculating a first similarity of the second digit sequence to each first digit sequence in the advertisement data set using a logistic regression algorithm;
said step of calculating a plurality of second similarities for said current message to historical messages in said non-advertising dataset comprises:
calculating a second similarity of the second digit sequence to each first digit sequence in the non-advertising data set using a logistic regression algorithm.
7. The system for determining the advertisement message is characterized by comprising a historical message acquisition module, a data set acquisition module, a current message acquisition module, a calculation module, a judgment module and a determination module;
the history message acquisition module is used for acquiring a plurality of received history messages;
the data set acquisition module is used for processing the historical information to acquire an advertisement data set;
the current message acquisition module is used for acquiring the received current message;
the computing module is configured to compute a plurality of first similarities between the current message and historical messages in the advertisement dataset;
the judging module is used for judging whether a first similarity larger than a first set threshold exists in the plurality of first similarities, and if so, the determining module is called to determine that the current message is an advertisement message; otherwise, determining that the current message is a non-advertisement message.
8. The advertisement message determination system of claim 7, wherein the data set acquisition module is further configured to obtain a non-advertisement data set;
the computing module is further configured to compute a plurality of second similarities between the current message and historical messages in the non-advertising dataset;
the judging module is further configured to judge whether a second similarity greater than a second set threshold exists in the plurality of second similarities, and if so, invoke the determining module to determine that the current message is a non-advertisement message; otherwise, determining the current message as an advertisement message;
the determination system further comprises a threshold adjustment module;
the threshold adjusting module is configured to adjust the first set threshold and/or the second set threshold when the first similarity is greater than the first set threshold and the second similarity is greater than the second set threshold, or when the first similarity is less than or equal to the first set threshold and the second similarity is less than or equal to the second set threshold, until the first similarity is greater than the first set threshold and the second similarity is less than or equal to the second set threshold, or the first similarity is less than or equal to the first set threshold and the second similarity is greater than the second set threshold.
9. The advertisement message determination system of claim 8, wherein the first set threshold is equal to the second set threshold.
10. The advertisement message determination system of claim 8, wherein the determination system comprises a user data acquisition module;
the data set acquisition module comprises a characteristic data extraction unit, a characteristic data processing unit, a mapping unit and a classification training unit;
the user data acquisition module is used for acquiring user historical data, wherein the user historical data comprises at least one of user historical account information, the number of historical messages sent by a user in different time periods and the historical access times of a seller;
the characteristic data extraction unit is used for extracting first characteristic data corresponding to the historical information and the user historical data;
wherein the first characteristic data comprises at least one of a length of the historical message, a frequency of repeated occurrences of the historical message within a set time window, whether the historical message contains a keyword, a number of keywords contained in the historical message, a percentage of keywords in the historical message, the user historical account information, a number of historical messages sent by the user over different time periods, and the historical number of visits to the seller;
the characteristic data processing unit is used for processing the first characteristic data to obtain second characteristic data;
the mapping unit is used for mapping the historical message into a corresponding first numerical sequence according to the second characteristic data;
the classification training unit is used for performing classification training on the first digital sequences of the plurality of historical messages by adopting a Bayesian classification algorithm or a logistic regression algorithm to obtain the advertisement data set and the non-advertisement data set.
11. The system for determining an advertisement message according to claim 10, wherein the feature data processing unit is configured to delete a part of the first feature data in the first feature data to obtain the second feature data, and/or to obtain the second feature data by equivalence of a plurality of first feature data having a correlation in the first feature data into one first feature data.
12. The system for determining advertising messages as claimed in claim 10, wherein the user data obtaining module is further configured to obtain user current data, the user current data including user current account information, at least one of a number of current messages sent by the user during different time periods and a current number of visits to the seller;
the characteristic data extraction unit is also used for extracting third characteristic data corresponding to the current message and the current user data;
wherein the third characteristic data further comprises at least one of a length of the current message, a frequency of repeated occurrences of the current message within a set time window, whether the current message contains a keyword, a number of keywords contained in the current message, a percentage of keywords in the current message, current account information of the user, a number of the current messages sent by the user in different time periods, and the current number of visits to the seller;
the mapping unit is further configured to map the current message into a corresponding second digit sequence according to the third feature data;
the calculation module is configured to calculate a first similarity between the second digit sequence and each first digit sequence in the advertisement data set using a logistic regression algorithm;
the calculation module is further configured to calculate a second similarity of the second digit sequence to each first digit sequence in the non-advertisement data set using a logistic regression algorithm.
13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of determining an advertisement message according to any of claims 1-6 when executing the computer program.
14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of determining an advertising message according to any one of claims 1 to 6.
CN201811158478.1A 2018-09-30 2018-09-30 Method, system, device and storage medium for determining advertisement message Active CN110971501B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811158478.1A CN110971501B (en) 2018-09-30 2018-09-30 Method, system, device and storage medium for determining advertisement message

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811158478.1A CN110971501B (en) 2018-09-30 2018-09-30 Method, system, device and storage medium for determining advertisement message

Publications (2)

Publication Number Publication Date
CN110971501A true CN110971501A (en) 2020-04-07
CN110971501B CN110971501B (en) 2022-11-08

Family

ID=70029034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811158478.1A Active CN110971501B (en) 2018-09-30 2018-09-30 Method, system, device and storage medium for determining advertisement message

Country Status (1)

Country Link
CN (1) CN110971501B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929872A (en) * 2011-08-08 2013-02-13 阿里巴巴集团控股有限公司 Computer-implemented message filtering method, message filtering device and system
CN106844430A (en) * 2016-12-12 2017-06-13 天格科技(杭州)有限公司 A kind of improved real-time social platform advertisement and sensitive information quickly know method for distinguishing
CN107657286A (en) * 2017-10-19 2018-02-02 北京深极智能科技有限公司 A kind of advertisement recognition method and computer-readable recording medium
CN108470290A (en) * 2018-03-28 2018-08-31 百度在线网络技术(北京)有限公司 Commercial detection method, device and server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929872A (en) * 2011-08-08 2013-02-13 阿里巴巴集团控股有限公司 Computer-implemented message filtering method, message filtering device and system
CN106844430A (en) * 2016-12-12 2017-06-13 天格科技(杭州)有限公司 A kind of improved real-time social platform advertisement and sensitive information quickly know method for distinguishing
CN107657286A (en) * 2017-10-19 2018-02-02 北京深极智能科技有限公司 A kind of advertisement recognition method and computer-readable recording medium
CN108470290A (en) * 2018-03-28 2018-08-31 百度在线网络技术(北京)有限公司 Commercial detection method, device and server

Also Published As

Publication number Publication date
CN110971501B (en) 2022-11-08

Similar Documents

Publication Publication Date Title
CN105590055B (en) Method and device for identifying user credible behaviors in network interaction system
WO2018157818A1 (en) Method and apparatus for inferring preference of user, terminal device, and storage medium
US20180248879A1 (en) Method and apparatus for setting access privilege, server and storage medium
TW201812668A (en) Aggregating service data for transmission and risk analysis
CN107908616B (en) Method and device for predicting trend words
CN113239275B (en) Information pushing method, device, electronic equipment and storage medium
CN114363019B (en) Training method, device, equipment and storage medium for phishing website detection model
WO2015185020A1 (en) Information category obtaining method and apparatus
CN113765873A (en) Method and apparatus for detecting abnormal access traffic
CN112818230A (en) Content recommendation method and device, electronic equipment and storage medium
CN110750707A (en) Keyword recommendation method and device and electronic equipment
CN114244795A (en) Information pushing method, device, equipment and medium
CN113205189B (en) Method for training prediction model, prediction method and device
CN113746790B (en) Abnormal flow management method, electronic equipment and storage medium
CN113904943A (en) Account detection method and device, electronic equipment and storage medium
CN112287208B (en) User portrait generation method, device, electronic equipment and storage medium
CN110971501B (en) Method, system, device and storage medium for determining advertisement message
CN108768742B (en) Network construction method and device, electronic equipment and storage medium
CN113596011B (en) Flow identification method and device, computing device and medium
CN115858815A (en) Method for determining mapping information, advertisement recommendation method, device, equipment and medium
CN115544558A (en) Sensitive information detection method and device, computer equipment and storage medium
CN113239687B (en) Data processing method and device
CN114862479A (en) Information pushing method and device, electronic equipment and medium
CN112887426B (en) Information stream pushing method and device, electronic equipment and storage medium
CN112085566B (en) Product recommendation method and device based on intelligent decision and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant