CN108230037B

CN108230037B - Advertisement library establishing method, advertisement data identification method and storage medium

Info

Publication number: CN108230037B
Application number: CN201810031871.8A
Authority: CN
Inventors: 马恒
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2018-01-12
Filing date: 2018-01-12
Publication date: 2022-10-11
Anticipated expiration: 2038-01-12
Also published as: CN108230037A

Abstract

The invention provides an advertisement library establishing method, an advertisement data identification method and a storage medium, wherein the advertisement library establishing method comprises the following steps: receiving corpora, and storing a first amount of corpora according to user granularity; calculating the information entropy of the stored first amount of corpora; and when the numerical value of the calculated information entropy is lower than a preset threshold value, putting the corpus or the trunk of the corpus into an advertisement library. The method provided by the invention classifies the materials by adopting the entropy model, can capture the advertisement messages from the messages sent by each user, can quickly and accurately mine a large amount of data without manually marking and screening the data, screens out the required data and adds the data into the advertisement library.

Description

Advertisement library establishing method, advertisement data identification method and storage medium

Technical Field

The invention relates to the field of network data processing, in particular to an advertisement library establishing method, an advertisement data identification method and a storage medium.

Background

In the existing online game and live broadcast platform, no ready-made pre-labeled advertisement data set (advertisement library) exists. If it is desired to classify the received user utterance data into advertisements or not by using the conventional text classification method to create the advertisement data set, an initial tagging method is required. However, manually labeling millions of pieces of data is too labor intensive and inefficient. Therefore, it is time-consuming, costly and inefficient to manually mark the advertisement for data volumes approaching tens of millions per day. And after the new advertisement type comes out, the advertisement type is manually identified and processed after a long time, so that the advertisement type is added into the advertisement library. This greatly affects the effectiveness of classifying the advertisements.

How to classify data by a method without manual supervision is an urgent problem to be solved.

Disclosure of Invention

In view of the above, the present invention provides an advertisement library establishing method and an advertisement data identifying method without manual supervision, so as to solve at least one defect in the prior art.

In one aspect, an embodiment of the present invention provides a method for establishing an advertisement library, where the method includes the following steps:

receiving corpora, and storing a first amount of corpora according to user granularity;

calculating the information entropy of the stored corpora of the first quantity; and

and when the calculated numerical value of the information entropy is lower than a preset threshold value, putting the corpus or the trunk of the corpus into an advertisement library.

Preferably, the information entropy of the first number of corpora is calculated by the following formula:

wherein i is the ith character in the corpus, p _i Is the probability of the occurrence of the ith character.

Preferably, the method further comprises: and determining the probability that the user is the advertisement type user based on the calculated information entropy, and updating the advertisement user probability of the corresponding user in the advertisement user probability library based on the currently determined probability of the advertisement type user.

Preferably, the method further comprises: and for each received corpus, performing similarity matching with the corpus in the advertisement library, and identifying whether the corpus is the advertisement or not based on a matching result.

Preferably, the similarity matching step includes: similarity of the received corpus and the corpus in the advertisement library is calculated based on fuzzy matching of fuzzywuzzy.

Preferably, the method further comprises: and filtering out the linguistic data determined as the advertisements in the chat window.

Preferably, the step of updating the advertisement user probability of the corresponding user in the user advertisement probability base based on the probability of the currently determined advertisement type user comprises: if the user ID is not in the advertisement user probability library or the probability of the user ID recorded in the advertisement user probability library is 0, recording the currently determined probability as the advertisement user probability of the user in the advertisement user probability library; if the user ID exists in the advertiser probability database and the probability that the user ID is recorded in the advertiser probability database is not 0, updating the user ID advertisement probability based on the following formula: updated advertiser probability = (original advertiser probability + original update times + currently determined advertiser probability)/(original update times + 1).

On the other hand, an embodiment of the present invention provides an advertisement data identification method, including the following steps:

calculating the information entropy of the stored first amount of corpora; and

and identifying whether the linguistic data are advertisement type linguistic data and/or identifying whether the user is an advertisement type user based on the calculated numerical value of the information entropy.

Preferably, the advertisement data identification method further includes: and when the calculated numerical value of the information entropy is lower than a preset threshold value, putting the corpus or the trunk of the corpus into an advertisement library.

In another aspect, the present invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method in any one of the above advertisement library establishing method and advertisement data identifying method.

The invention adopts the information entropy to identify the advertisement data, thereby rapidly and accurately mining a large amount of data without manually marking and screening the data, screening the advertisement data needing to be identified, and automatically and rapidly establishing the advertisement library.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to what has been particularly described hereinabove, and that the above and other objects that can be achieved with the present invention will be more clearly understood from the following detailed description.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

Further objects, features and advantages of the present invention will become apparent from the following description of embodiments of the invention, with reference to the accompanying drawings, in which:

fig. 1 schematically shows a flow chart of an advertisement repository establishment and advertisement data identification method in an embodiment of the present invention.

Fig. 2 schematically shows a flow chart of an advertisement repository establishment and advertisement identification method in another embodiment of the present invention.

Detailed Description

The objects and functions of the present invention and methods for accomplishing the same will be apparent by reference to the exemplary embodiments. However, the present invention is not limited to the exemplary embodiments disclosed below; it can be implemented in different forms. The nature of the description is merely to assist those skilled in the relevant art in a comprehensive understanding of the specific details of the invention.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings, the same reference numerals denote the same or similar parts, or the same or similar steps.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.

In the currently used advertisement library establishment method, there is a method of establishing an advertisement library by using the similarity between texts, in which a sliding window is used, and the time for establishing the advertisement library increases quadratically with the size of the sliding window. And this method is not sensitive to the advertising user, for example, if a user sends a message every few seconds, the message sent by the user is diluted by the messages sent by other users, so that the advertising user, i.e., the advertisement sent by the user, is not easily identified.

In the embodiment of the invention, the information entropy is utilized to classify and identify the advertisement user data and the advertisement by taking the user as granularity, and an advertisement library can be quickly established based on the identification result. Information entropy is a concept used in information theory to measure the amount of information, which c.e. shannon borrowed from thermodynamics to represent the uncertainty of the source. Shannon consults the concept of entropy in thermodynamics, refers to the average information quantity after the redundancy is eliminated in the information as the information entropy, and provides a mathematical expression for calculating the information entropy.

According to the information entropy mathematical expression given by shannon, for any random variable X, the information entropy is defined as follows, and the unit is bit (bit):

wherein, H (X) is the value of information entropy, called entropy value for short, and X is the output of random variable X.

Where the logarithm is typically taken to be base 2 and the units are bits. However, other logarithmic bases can be adopted, and other corresponding units can be adopted, and the conversion between the logarithmic bases and the corresponding units can be realized by using a base conversion formula.

The more ordered the information in a source, the lower the entropy of the information.

In the embodiment of the invention, an entropy model is provided, which uses information entropy to distinguish whether a message sent by a user in a game or a live platform is an advertisement, so that a normal player (i.e. a normal user) and an abnormal player (i.e. an abnormal user, such as an advertisement player or an advertisement user) can be distinguished.

The messages sent by abnormal players, namely advertising players, are similar, the information entropy does not increase steeply along with the number of the advertisements sent, and the information entropy tends to be smooth. Therefore, the larger the number of the speeches of the abnormal player is, the larger the calculated entropy of the abnormal player is, and the larger the entropy of the abnormal player is, and the larger the entropy of the abnormal player is. In addition, for the abnormal player, some random words may be added to the speech, so that the number of different words increases, but since the speech backbone of the abnormal player is consistent, although the number of different words of the abnormal player increases, the entropy value of the abnormal player is still small, and tends to be fixed at a certain threshold value.

While the messages sent by normal players are various, a new message will likely contain a different amount of new information. Therefore, the larger the number of speech pieces of a normal player, the larger the entropy of the player calculated from the word frequency of the player.

The following describes an advertisement library establishment method and an advertisement data identification method using entropy in accordance with an embodiment of the present invention, taking classification of game player data as an example.

Example 1

The embodiment provides a method for establishing an advertisement library and identifying advertisement information based on information entropy. The method may be implemented using advertisement recognition software running in a computer processor, such as a game plug-in, but is not so limited. The method can be realized at the user terminal side and also can be realized at the server side. In this embodiment, for each game player, for each predetermined number (e.g., 10) of messages (also referred to as corpora or sentences) collected, the following value of information entropy (abbreviated as entropy value) of the player is calculated: if the entropy of the player is smaller than the predetermined value (such as 3.5), preliminarily judging that the message of the player is an advertisement message, namely the player is an abnormal player (such as an advertisement player), extracting an advertisement trunk by using the sentences of the player, and adding the advertisement trunk into an online advertisement library so as to establish the advertisement library; after judging that the chat sentences of the players in the preset number are finished, clearing the chat sentences of the players and restarting to collect 10 sentences; if the entropy value of the player is larger than 3.5, no processing is carried out, the chat sentences of the player are emptied, and 10 sentences are collected again. This is repeated. As shown in fig. 1, the method for establishing an advertisement library and identifying an advertisement of the present embodiment includes the following steps:

steps S110-S120 receive the corpora, and store a first number (N) of corpora according to the user granularity.

The stored corpus typically includes information such as message content and user ID.

In the embodiment of the invention, N language materials are received by taking a user as granularity, namely the N language materials are stored by taking the user as a unit. As an example, N may be set to 10, but is not limited thereto, and may be other larger or smaller integers.

In the method for establishing an advertisement library based on the similarity of the text adopted in the industry at present, the method is easily limited by a sliding window, and is also easily limited by the fact that messages of multiple servers enter the same chat monitor (chat monitor), because the messages of the multiple servers enter the sliding window, and the sequence of the original chat server is disturbed.

In the embodiment of the invention, the user information entropy counts the chat linguistic data by taking the user as a unit, so that the sequence is not disturbed.

In addition, other current methods of advertisement identification do not classify the advertisement with user granularity, so that a user sends a message every few seconds, which is diluted by messages sent by other users. In the embodiment of the invention, the chat linguistic data is counted by taking the user as a unit and is not diluted.

In step S130, after storing N (e.g. 10) corpora of the same user, the entropy of these corpora can be calculated by the following formula:

in this formula, i is the ith character in the corpus, p _i Is the probability of the occurrence of the ith character.

In a preferred embodiment, base 2 is taken as the logarithm in the above equation, i.e., the above equation becomes:

here, the base 2 of the logarithm is merely an example, and other logarithm bases may be taken, and other corresponding units may be used, and they may be converted by a base-changing formula.

The following describes the calculation of the information entropy by taking only 4 corpora as an example.

Example (c):

player 1 speaks as follows:

the first sentence: yime Wenxuede first Chonghao gift

The second sentence is: add my Wenxideo rechargeable good gift

The third sentence: add my little letter and get will gift bag

The fourth sentence: add my Wenxin head-filling gift

The probabilities of the characters in the corpus of player 1 are shown in table 1.

TABLE 1 probability of character occurrence in Player 1's corpus:

h =0.0845 can be calculated from equation (3). For more corpora, the same method can be used for calculation.

In step S140, the entropy value H obtained by the above calculation is determined, and if the entropy value is smaller than the predetermined first threshold, the stored corpus of the user is determined to be an advertisement, and then step S150 is continuously performed, otherwise step S180 is performed.

For a normal player, the larger the number of the speaking bars, the larger the entropy of the player calculated according to the word frequency of the player. For the abnormal player, the larger the number of the speech pieces is, the larger the calculated entropy of the abnormal player is, and the calculated entropy tends to a certain fixed value.

In the case where the logarithm is base 2, N =10, the first threshold may be set to 3.5, where 3.5 is a reasonable value selected based on actual test results. If the base of the logarithm and/or the number of corpus entries N vary, the first threshold value is adjusted to a different value. Accordingly, the value of 3.5 is merely exemplary and may be adjusted to other reasonable values as desired.

In step S150, the stored corpus of the current user or the corpus stem extracted from the corpus is put into the advertisement library.

Preferably, the embodiment of the invention extracts the main stem from the corpus of the current user and puts the extracted main stem into the advertisement library, thereby quickly and automatically establishing the advertisement library without manually marking and screening data.

The corpus trunks are selected to replace the corpora to be placed in the advertisement library, so that the occupied capacity of the advertisement library can be reduced. The corpus trunk extraction method is as follows: the common characters in all the linguistic data of the user stored this time can be extracted and used as a main stem; or deleting characters with the frequency lower than a preset threshold value from the N linguistic data to obtain a sentence stem.

N main trunks can be obtained from the N corpora, and the N main trunks can be placed in an advertisement library. Preferably, however, in order to prevent excessive redundant data from appearing in the advertisement library, similarity calculation may also be performed on the basis of every two of the N trunks, and one sentence trunk is selected and placed into the advertisement library on the basis of the similarity and the frequency of occurrence of the sentence trunks; or selecting the corpus trunk with the most occurrence times in all the corpus trunks of the user stored this time; a method of obtaining a sentence backbone is also disclosed in the chinese patent application No. 201710980185.0 entitled "an advertisement recognition method and computer readable storage medium" filed by the present applicant at 2017, 10/19, the contents of which are hereby incorporated by reference in their entirety as if fully set forth herein. Other ways to obtain the corpus main stem can be chosen, and are not listed here.

In step S180, the stored corpus is emptied, and then the process returns to step S110 again.

In the embodiment of the present invention, the process of establishing an advertisement library and identifying advertisement information based on information entropy as shown in fig. 1 is also referred to as an entropy model in the present invention.

In the embodiment of the present invention, since the processing of the entropy model shown in fig. 1 is performed according to the user granularity, the above steps can be performed in parallel for different users.

In the embodiment of the present invention, when it is determined that the entropy H is smaller than the first threshold, the terminal or the server implementing the method shown in fig. 1 may also identify that the received corpus is an advertisement, and may also identify that the user is an advertisement user correspondingly.

At this time, the terminal or the server may also directly mask the advertisement based on the recognition result, or mask the message of the advertisement user.

Alternatively, in an embodiment of the present invention, the method may further include: the terminal or server may determine a probability that the user is an advertising user based on the entropy value H and update the probability that the current user is an advertising user in the advertising probability library based on the currently determined probability. The probability that the user to which the entropy value H corresponds is an advertising user may be determined based on the test result.

After corpus experiment tests of a large number of players, entropy values of normal users are found to be above 3.5, entropy values of advertising users are below 3.5, and if entropy of a user is below 3.5, the probability that the user is an advertising user can be set to be 0.85 (wherein 3.5 is obtained by experiment data, but is not limited to 3.5). Different entropy values may correspond to a probability value, and the invention is not limited in this respect.

For example, if the user ID is not in the advertiser probability database or the probability of the user ID recorded in the advertiser probability database is 0, recording the currently determined probability as the advertiser probability of the user in the advertiser probability database; if the user ID exists in the advertiser probability database and the probability that the user ID is recorded in the advertiser probability database is not 0, updating the user ID advertisement probability based on the following formula: updated advertiser probability = (original advertiser probability + original update times + currently determined advertiser probability)/(original update times + 1).

In this way, when the updated advertiser probability for a user is above a predetermined value, the server may add the user to a blacklist, i.e., block all messages for the user.

Based on the method, the entropy model is adopted to establish the advertisement library and identify the advertisement data, so that a large amount of data can be mined quickly and accurately without manually marking and screening the data, the advertisement data needing to be identified is screened, and the advertisement library can be established automatically and quickly. In addition, since the language material of the chat is counted in units of users, the language material is not diluted, so that the advertisement sent every few seconds can be identified.

Example 2

In the embodiment, another method for establishing an advertisement library and identifying an advertisement is provided, in which an entropy model is used for identifying an advertisement user who performs establishment of the advertisement library, and in addition, a thread (hereinafter referred to as an advertisement identification thread) which performs finer advertisement identification based on the advertisement library in parallel with a thread which executes the entropy model is added. As shown in fig. 2, the method comprises the steps of:

steps S110 to S120 are performed, and a first number (N) of corpora are received and stored according to the user granularity, that is, the N corpora are stored by using the user as a unit.

Step S130, calculating the entropies of the N corpora.

In step S140, if the entropy is smaller than the predetermined first threshold, it is determined that the stored corpus of the user is an advertisement, and then step S150 is performed, otherwise step S180 is performed.

And S150, putting the corpus trunks extracted from the stored corpus into an advertisement library.

As above, steps S110 to S150 may be the same as steps S110 to S150 in embodiment 1, and are not described again here.

Further, in step S160, a probability that the user is an advertising user may be determined based on the entropy value H.

The probability that the user corresponding to the entropy value H is an advertising user may be determined based on the test result.

Step S170, updating the probability that the current user is the advertising user in the advertising probability base based on the currently determined probability.

For example, if the user ID is not in the advertiser probability database or the probability of the user ID recorded in the advertiser probability database is 0, then the currently determined probability is recorded in the advertiser probability database as the advertiser probability of the user; if the user ID exists in the advertising user probability library and the probability that the user ID is recorded in the advertising user probability library is not 0, updating the user ID advertising probability based on the following formula: updated advertiser probability = (original advertiser probability x original update times + currently determined advertiser probability)/(original update times + 1).

In addition to the above steps in the entropy model, the method of the present embodiment further includes finely advertising the identification thread, as shown in fig. 2, including:

step S210, similarity matching is carried out on each received corpus and the corpus in the online advertisement library.

Here, the online advertisement library may be an advertisement library established based on the aforementioned entropy model. In the initial case, the online advertisement library is empty. At this time, if the advertisement library is empty, it is determined that the received corpus is a non-advertisement.

If the ad library is not empty, the similarity between the received corpus and the corpus in the ad library may be calculated based on fuzzy matching, for example. For example, if the similarity is greater than 0.8, the corpus is considered as an advertisement, otherwise, the corpus is considered as a non-advertisement. Here, the numerical value of 0.8 is merely an example, and the matching judgment criterion may also be set lower or higher based on the desired recognition accuracy.

Preferably, before the similarity matching calculation, the emoticons in the corpus content can be removed, so that the interference of the emoticons on the corpus identification is eliminated.

And step S220, identifying whether the received linguistic data are advertisements or not based on the similarity, and carrying out corresponding labeling.

For corpora determined to be advertisements, they may be filtered (masked) out of the chat window, i.e., not displayed in the chat window.

In another embodiment of the present invention, the similarity matching calculation between the corpus and the corpus in the advertisement library may be performed as follows:

converting the received linguistic data and each linguistic data in the advertisement library into one-hot vectors represented by word frequencies, calculating cosine distances between the one-hot vectors of the user data samples and the one-hot vectors of the data in all the advertisement libraries to obtain a cosine list, and taking values in the cosine list as the similarity between the user data samples and the data in the advertisement library. The way of corpus similarity matching is described in detail in the chinese patent application No. 201710980185.0 entitled "an advertisement recognition method and computer readable storage medium" filed by the present applicant on 2017, 10/19/h, the content of which is hereby incorporated by reference in its entirety as if fully set forth herein.

The invention can capture the advertisement information from the information sent by each user by adopting an unsupervised mode of information entropy, thereby rapidly and accurately mining a large amount of data without manually marking and screening the data, screening the advertisement data needing to be identified and automatically and rapidly establishing an advertisement library.

The advertisement library establishing and advertisement identifying method is not limited by a sliding window and is not limited by a plurality of servers entering one chat monitor. In addition, the method can better identify whether the user sending the message is an advertising user or not by counting the chat information by taking the user as a unit and not diluting the chat information by the information sent by other users, and can effectively capture the advertising message no matter whether the user sends the message frequently or sends the message according to a longer time interval.

The method steps of the invention are not limited to the order of execution shown in the figures, some steps may be permuted or even performed in parallel.

Portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or a combination of the following technologies, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

Features that are described and/or illustrated above with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

Claims

1. An advertisement library establishing method is characterized by comprising the following steps:

calculating the information entropy of the stored first amount of corpora; and

when the numerical value of the calculated information entropy is lower than a preset threshold value, putting the corpus or the trunk of the corpus into an advertisement library;

and, the method further comprises:

determining the probability that the user is an advertisement type user based on the calculated information entropy;

updating the advertisement user probability of the corresponding user in the advertisement user probability library based on the probability of the currently determined advertisement type user;

wherein the step of updating the advertisement user probabilities for respective users in the user advertisement probability library based on the currently determined probabilities for users of the advertisement types comprises:

if the probability of the user ID recorded in the advertisement user probability library is not 0, updating the advertisement probability of the user ID based on a predefined formula;

if the user ID is not in the advertiser probability database or the probability of the user ID recorded in the advertiser probability database is 0, then the currently determined probability is recorded in the advertiser probability database as the advertiser probability of the user.

2. The method of claim 1, wherein the information entropy of the first number of corpora is calculated using the following formula:

H＝∑i-pilogpi；

wherein i is the ith character in the corpus, and pi is the probability of the ith character occurring.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

and for each received corpus, performing similarity matching with the corpus in the advertisement library, and identifying whether the corpus is the advertisement or not based on a matching result.

4. The method of claim 3, wherein the similarity matching step comprises:

similarity of the received corpus and the corpus in the advertisement library is calculated based on fuzzy matching of fuzzywuzzy.

5. The method of claim 3, further comprising: and filtering out the linguistic data determined as the advertisements in the chat window.

6. The method of claim 1, wherein the predefined formula further indicates an updated advertiser probability = (original advertiser probability original update times + currently determined advertiser probability)/(original update times + 1).

7. An advertisement data identification method, characterized in that the method comprises the steps of:

calculating the information entropy of the stored first amount of corpora; and

identifying whether the corpus is an advertisement type corpus and/or identifying whether the user is an advertisement type user based on the calculated numerical value of the information entropy;

and, the method further comprises:

if the probability recorded by the user ID in the advertisement user probability library is not 0, updating the user ID advertisement probability based on a predefined formula;

8. The method of claim 7, further comprising: and when the calculated numerical value of the information entropy is lower than a preset threshold value, putting the corpus or the trunk of the corpus into an advertisement library.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.