CN109670108B - Information filtering method and device - Google Patents

Information filtering method and device Download PDF

Info

Publication number
CN109670108B
CN109670108B CN201811523727.2A CN201811523727A CN109670108B CN 109670108 B CN109670108 B CN 109670108B CN 201811523727 A CN201811523727 A CN 201811523727A CN 109670108 B CN109670108 B CN 109670108B
Authority
CN
China
Prior art keywords
registered
account information
account
determining
likelihood
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811523727.2A
Other languages
Chinese (zh)
Other versions
CN109670108A (en
Inventor
林述民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201811523727.2A priority Critical patent/CN109670108B/en
Publication of CN109670108A publication Critical patent/CN109670108A/en
Application granted granted Critical
Publication of CN109670108B publication Critical patent/CN109670108B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The embodiment of the specification discloses an information filtering method and device. The scheme comprises the following steps: acquiring a registration request; the registration request comprises account information to be registered; determining characters contained in the account information to be registered; judging whether the account information to be registered is a junk account or not according to the characters; and refusing to register the account information to be registered when the account information to be registered is the junk account.

Description

Information filtering method and device
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to an information filtering method and apparatus.
Background
With the development of information technology, websites have been able to provide users with very rich network services, and users often need to register their account information on websites in order to obtain more comprehensive services.
Currently, there are situations where a large number of garbage accounts are automatically generated by a machine, which are not normally used, typically resources that are exploited to obtain more network services. The existence of a large number of junk accounts not only occupies the resources of network service providers, but also seriously affects the network services acquired by other users after too many junk accounts are maliciously utilized, so that the network service resources are unevenly distributed.
In the prior art, filtering of the junk account generally adopts two modes of address information filtering and network behavior filtering, wherein the main method of address information filtering is as follows: when a large number of accounts are registered for a short time with the same media access control (Media Access Control, MAC) address or internet protocol (Internet Protocol, IP) address, the accounts are determined to be automatically generated garbage accounts, and the garbage accounts are filtered. The main implementation mode of the network behavior filtering is as follows: after the registered account is logged in, monitoring the network behavior of the account, judging whether the account is a garbage account according to the monitored network behavior of the account, and carrying out corresponding filtering.
However, once the MAC address or IP address of the device registering account information is modified, the method of address information filtering will fail, and thus the method of address information filtering has a high miss rate. When the network behavior filtering method is executed, the registration of the junk account is completed after all, and more resources are consumed for monitoring the network behavior of the account, so that the efficiency of the network behavior filtering method is lower.
Disclosure of Invention
The embodiment of the application provides an information filtering method and device, which are used for solving the problems of poor filtering accuracy and low efficiency of account information.
The information filtering method provided by the embodiment of the application comprises the following steps:
receiving account information to be registered;
determining a likelihood characterization value of the account information to be registered as a junk account according to characters contained in the account information to be registered;
and refusing to register the account information to be registered when the likelihood representation value is larger than a preset threshold value.
An information filtering device provided in an embodiment of the present application includes: the device comprises a receiving module, a characterization value module and a filtering processing module, wherein,
the receiving module is used for receiving account information to be registered;
the characterization value module is used for determining the possibility characterization value of the account information to be registered as a garbage account according to the characters contained in the account information to be registered;
and the filtering processing module is used for refusing to register the account information to be registered when the likelihood representation value is larger than a preset threshold value.
The embodiment of the application provides an information filtering method and device, which are used for receiving account information to be registered, determining a likelihood representation value of the account information to be registered as a garbage account according to characters contained in the account information to be registered, and refusing to register the account information to be registered when the likelihood representation value is larger than a preset threshold. By adopting the method, the possibility that the account information to be registered is the garbage account is intuitively reflected by the possibility representation value, so that the use possibility representation value is compared with the preset threshold value, and whether the account information is the garbage account or the normal account can be accurately judged.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 is a schematic diagram of an information filtering process provided in an embodiment of the present application;
FIG. 2 is a flowchart of an information filtering process provided in an embodiment of the application in a specific application;
fig. 3 is a schematic structural diagram of an information filtering device provided in an embodiment of the present application.
Detailed Description
For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Fig. 1 is a flowchart of an information filtering process provided in an embodiment of the present application, where the process specifically includes the following steps:
s101: and receiving account information to be registered.
The account information to be registered described in the embodiment of the present application includes, but is not limited to: user names containing english letters, such as Email address information.
In the prior art, filtering account information generally needs to wait for account information to be registered, and then can carry out corresponding judgment on the account information, even if a server adopts an address information filtering method, garbage accounts cannot be filtered in time, when a device registering account information uses a new MAC address or an IP address, a certain number of garbage accounts still can be registered, until the server monitors that a large number of successfully registered account information comes from the same MAC address or IP address, the server can prohibit the MAC address or the IP address from registering again, and in the process, a certain number of garbage accounts are successfully registered.
Therefore, in the above step S101 in the embodiment of the present application, in order to avoid the situation that the garbage account is successfully registered, the server filters the account information to be registered in the registration process. After receiving the account information to be registered, the server determines the possibility that the account information to be registered is a spam account, in step S102.
S102, determining a likelihood characterization value of the account information to be registered as a garbage account according to characters contained in the account information to be registered.
For normal account information to be registered, the account information to be registered is taken as a unique user identifier, a large number of combination modes (such as combinations of various characters of case letters, numbers, symbols and the like) exist for the characters contained in the account information to be registered, the characters contained in the account information to be registered form corresponding character strings through the combination modes, the character string lengths of the account information to be registered are different, and some combinations of the character strings corresponding to the account information to be registered are not regular, but still can be unique combination modes designed by a user in order to avoid the repetition of account names. For example: in account information with a character string of "LXF1989", three english letters "LXF" are likely to correspond to pinyin abbreviations of the user name, and the number "1989" is the year of birth of the user; in the account information with the character string "Sylvia11", the english word "Sylvia" is likely to correspond to the english name of the user, and the number "11" may be a number added by the user in order to avoid collision with account information of other users named Sylvia. It can be seen that, for normal account information to be registered, the character strings therein have corresponding meanings.
However, for account information to be registered (a spam account) automatically registered by a device, the device generally sets the account information to be registered to a longer and randomly combined character string, for example, in order to ensure that registration is successfully completed (i.e., to ensure uniqueness of the account information to be registered): "jvhjvhb", "zjbvb", etc. It can be seen that the strings corresponding to these spam accounts are not pinyin abbreviations for the user's name, nor english words, i.e., the strings are meaningless. In this way, it is indicated that the likelihood that the account information to be registered is a spam account is higher, and in order to visually represent the likelihood, in this embodiment of the present application, the likelihood that the account information to be registered is a spam account is quantified by using a likelihood characterization value, that is, the likelihood that the likelihood characterization value is a quantified value of the likelihood that the account information to be registered is a spam account, where the higher the likelihood characterization value is, the greater the likelihood that the account information to be registered is a spam account, and conversely, the lower the likelihood characterization value is, the less the likelihood that the account information to be registered is a spam account. S103, judging whether the determined likelihood representation value is larger than a preset threshold value, if so, executing a step S104, otherwise, executing a step S105.
S104, refusing to register the account information to be registered.
S105, registering the account information to be registered.
In this embodiment of the present application, the preset threshold may be set as required. Specifically, the likelihood characterization value of each garbage account may be determined in advance according to the characters included in each account information that is registered and confirmed as the garbage account, and then the minimum value of the likelihood characterization value is set as the preset threshold value. The server may determine whether the registered account information is a junk account through various manners such as network behavior filtering and address information filtering in the prior art, which is not limited to the application.
If the likelihood characterization value of the account information to be registered determined in step S102 is greater than the preset threshold, it indicates that the account information to be registered is likely to be a spam account, so the server refuses to register the account information to be registered, and if the likelihood characterization value determined in step S102 is not greater than the preset threshold, it indicates that the account information to be registered is not a spam account, and the server may directly register the account information to be registered.
According to the method, filtering of account information is completed in the process of account information registration, namely, before the account information is registered, whether the account information to be registered is a garbage account or not can be judged, and the account information to be registered, which is confirmed as the garbage account, can be refused to be registered in time, so that a large amount of resources are consumed to monitor the network behavior of the account without registering the account information, server resources are saved greatly, and the filtering efficiency of the account is improved. Moreover, the method shown in fig. 1 determines whether the account information to be registered is a spam account by determining the likelihood characterization value of the account information to be registered as a spam account, and does not depend on the address of the device initiating registration of the account information to be registered, so that the method shown in fig. 1 can accurately filter the spam account even if the address of the device initiating registration is modified.
As can be seen from the method shown in fig. 1, the basis for determining whether the account information to be registered is the garbage account in the present application is: judging whether character strings formed by the characters are character strings with certain meanings according to the characters contained in the account information to be registered, if so, the character strings can be called ideographic character strings so as to determine that the account information to be registered is not a junk account, otherwise, the character strings can be called random character strings so as to determine that the account information to be registered is a junk account. Therefore, in step S102 shown in fig. 1, when the server determines the likelihood representation value according to the characters included in the account information to be registered, the likelihood that the character string formed by the characters is an ideographic character string may be analyzed according to the characters included in the account information to be registered, so as to determine the likelihood representation value of the account information to be registered as a spam account. The higher the likelihood that the string is an ideographic string, the smaller the likelihood that the account information to be registered is a spam account, whereas the lower the likelihood that the string is an ideographic string, the greater the likelihood that the account information to be registered is a spam account. That is, the likelihood that the account information to be registered is a spam account has a value that is inversely proportional to the likelihood that the string is an ideographic string.
However, in the practical application scenario, the character strings corresponding to the account information to be registered generally have uniqueness, so when the possibility that the character strings formed by the characters contained in the account information to be registered are ideographic character strings is analyzed, the possibility cannot be accurately analyzed directly according to the complete character strings in the account information to be registered, and the possibility representation value cannot be accurately determined. In order to accurately determine the likelihood representation value, in step S102 shown in fig. 1, the server may first segment the characters included in the account information to be registered to obtain each judgment word, and then determine, according to each judgment word, the likelihood representation value of the account information to be registered as a garbage account. That is, the possibility that the judgment words are ideographic character strings can be determined according to the judgment words obtained after word segmentation, so that the possibility representation value of the account information to be registered is determined.
Specifically, when the characters contained in the account information to be registered are segmented, the segmentation can be performed according to the N-gram language model, that is, the server can select continuous characters with preset number from the characters contained in the account information to be registered according to preset number, and the character string formed by the selected characters is used as the obtained judgment word.
The N-gram language model divides continuous N characters contained in a certain message into a character string, N is the number of characters contained in the character string to be divided, namely the preset number, and the divided character string is the judgment word.
For example: in the case of 3-gram (i.e. the preset number is 3), assuming that the characters contained in the account information to be registered are "accept", the server may select 3 consecutive characters from the account information to be registered "accept" to form a string, where three selection methods are total, and the strings formed by the three selection methods are respectively: "acb", "cbe", "bed". The 3 character strings are 3 judgment words obtained after word segmentation.
The preset number may be set as needed, for example, according to an average length of the ideographic character strings included in the account information that has been determined to be the normal account.
In addition, considering that in the practical application scenario, the symbol type characters carried in the account information generally only represent separation meaning, even have no meaning, the number type characters generally represent the birth time or other code of the user, and the letter type characters can represent various meanings such as the name of the user, the acronym of the name, the english name and the like. It can be seen that the meaning represented by the letter type characters is finer and more accurate than the meaning represented by the symbol type and the number type characters, that is, the possibility that the character string is an ideographic character string can be more accurately analyzed according to the letter type characters. Therefore, in the embodiment of the application, when the characters contained in the account information to be registered are segmented, the characters of the specified type in the account information to be registered can be extracted, and then the extracted characters are segmented. Wherein the specified type includes a letter type.
That is, the server may extract the letter type characters in the account information to be registered, then select, according to the preset number, a continuous, preset number of characters from the extracted letter type characters, and use a character string formed by the selected characters as the obtained judgment word. Thus, each obtained judgment word is a character string formed by the characters of the letter type, and the possibility that each judgment word is an ideographic character string can be accurately determined later, so that the possibility representation value can be accurately determined.
Further, after the account information to be registered is segmented to obtain judgment words, the possibility that each judgment word is an ideographic character string can be analyzed according to a large amount of account information which is stored in the server and is determined to be a normal account, so that the possibility representation value of the account information to be registered is determined.
Specifically, for a judgment word, if the number of times that the judgment word appears in the account information that has been determined to be a normal account is greater, the likelihood that the judgment word is an ideographic character string is greater, and the likelihood that the account information to be registered is a garbage account is smaller in terms of a characterization value, so in this embodiment of the application, after the server performs word segmentation on the account information to be registered and obtains each judgment word, the method for determining that the account information to be registered is a likelihood characterization value of a garbage account according to each judgment word may specifically be as follows: and determining the occurrence times of the judgment words in the predetermined normal account information according to the obtained judgment words, and determining the likelihood characterization value of the account information to be registered as the garbage account according to the determined times of the judgment words, wherein the likelihood characterization value is inversely proportional to the determined times of the judgment words.
Continuing to use the above example, after the character contained in the account information to be registered is "acbed" is segmented under the condition of 3-gram, the number of occurrences of the 3 judgment words in each normal account information is assumed to be respectively: tf (tf) 1 、tf 2 、tf 3 。tf 1~3 The higher the number of occurrences in each normal account information, the greater the likelihood that "accept" is an ideographic character string is reflected, that is, the smaller the likelihood that the account information to be registered is a spam account is, so in the embodiment of the present application, the number of occurrences of each judgment word in each normal account information is adopted to reflect the likelihood that each judgment word is an ideographic character string, so that the determined likelihood that the account information to be registered is a spam account is represented as a likelihood representation value
Further, only by the number of times that the above judgment word appears in the normal account information, the likelihood characterization value of the account information to be registered as the spam account cannot be accurately determined, because: in the practical application scenario, since a large amount of registered account information exists in the server, account information with a small number of characters is almost registered, so for the device for automatically registering account information, in order to ensure that the generated account information to be registered has uniqueness, the number of characters contained in the set account information to be registered is large, that is, the more characters contained in the account information to be registered are more likely to be garbage accounts, and the likelihood characterization value of the account information to be registered is related to the number of the characters contained in the account information to be registered.
Therefore, in the embodiment of the present application, the method for determining, according to the number of times determined for each judgment word, the likelihood characterization value of the account information to be registered as the spam account may specifically be: and determining a likelihood representation value of the account information to be registered as a garbage account according to the times determined for each judgment word and the number of characters contained in the account information to be registered, wherein the likelihood representation value is in direct proportion to the number of characters contained in the account information to be registered.
In combination with the above method, in the embodiment of the present application, the formula may beTo accurately determine the likelihood characterization value of the account information to be registered as a spam account.
S is a possibility representation value of the account information to be registered as the garbage account.
tf i In order to divide the characters contained in the account information to be registered, the number of times that the obtained i-th judgment word appears in each piece of predetermined normal account information is i=1, 2 … … k, and k is the number of judgment words obtained after the characters contained in the account information to be registered are divided.
a is a preset length penalty coefficient, b is a preset short compensation value, and a and b are constants larger than 0. The length penalty factor a is typically less than 1, for example: a=0.2.
x and y are preset constants greater than 0, for example, x takes on a value of 10 and y takes on a value of 0.2.
n is the number of characters contained in the account information to be registered.
N is the number of characters contained in each judgment word, wherein the number of characters contained in each judgment word is the same. For example: in the case of 3-gram, n=3.
h is a preset integer, and N > h >0.h may be N-1, e.g., where n=3, h=2.
The short degree compensation value b can play a role in compensating the number of judgment words, so that the whole calculation result is maintained at a relatively balanced numerical level. For the short compensation value b, generally, according to all the characters in each registered account information, all the character strings formed by N characters may be traversed, the average value of the number of times of occurrence of the character strings in each predetermined normal account information is determined, and the value of the short compensation value b is set to be 5-10 times of the average value, so as to maintain the overall calculation result at a relatively uniform numerical level, for example, the value of b may be 50.
Under the condition that the parameters are given, the formulas in the above examples are directly adopted to carry out actual measurement on the account information, and the possibility characterization values shown in the table 1 are obtained:
TABLE 1
In table 1, for the account information to be registered with the serial numbers of 1 to 5, the account information to be registered with the serial numbers of 6 to 10 is very similar to the garbage account, and the probability representation value of each account information to be registered in table 1 as the garbage account is obtained after the calculation of the above formula. The likelihood characterization values of the account information to be registered with the serial numbers of 6-10 are 0.5161457, and the minimum value of the likelihood characterization values of the account information to be registered with the serial numbers of 1-5 is 1.3994009. The preset threshold value of the likelihood characterization value is assumed to be 1, and obviously, the likelihood characterization value in the 5 account information to be registered with the serial numbers of 1-5 is larger than the preset threshold value of 1, so the account information to be registered with the serial numbers of 1-5 is a garbage account. Therefore, the above formula in the embodiment of the application can accurately determine the likelihood characterization value of the account information to be registered as the garbage account, so that the account information to be registered can be accurately filtered.
As shown in fig. 2, the application of the information filtering method in the embodiment of the present application is as follows:
s201, the server receives the account information to be registered.
S202, the server extracts alphabetical characters in the account information to be registered.
S203, the server selects continuous characters with preset numbers from the extracted characters according to the preset numbers, and obtains each judgment word of the account information to be registered.
S204, the server determines the occurrence times of each judgment word in all the predetermined normal account information according to each obtained judgment word.
S205, determining the likelihood characterization value of the account information to be registered as a junk account according to the occurrence times of each judgment word in all the predetermined normal account information and the number of characters contained in the account information to be registered.
S206, the server judges whether the likelihood representation value of the account information to be registered is larger than a preset threshold value, if yes, step S207 is executed, and if not, step S208 is executed.
S207, the server refuses to register the account information to be registered.
S208, the server registers account information to be registered.
The above information filtering method provided in the embodiments of the present application further provides an information filtering device based on the same concept, as shown in fig. 3.
The information filtering apparatus in fig. 3, provided in a terminal, includes: a receiving module 301, a characterization value module 302, and a filtering processing module 303, wherein,
the receiving module 301 is configured to receive account information to be registered.
The characterization value module 302 is configured to determine, according to characters included in the account information to be registered, a likelihood characterization value of the account information to be registered as a spam account.
The filtering processing module 303 is configured to reject registration of the account information to be registered when the likelihood representation value is greater than a preset threshold.
The characterization value module 302 is specifically configured to: and dividing words from characters contained in the account information to be registered to obtain judgment words, and determining the likelihood representation value of the account information to be registered as a garbage account according to the judgment words.
For obtaining the judgment word, the characterization value module 302 is specifically configured to select, according to a preset number, a continuous character of the preset number from the characters included in the account information to be registered, and use a character string formed by the selected characters as the obtained judgment word.
The characterization value module 302 is specifically configured to extract a character of a specified type in the account information to be registered, and segment the extracted character.
The characterization value module 302 is specifically configured to determine, for each obtained judgment word, a number of times that the judgment word appears in each piece of predetermined normal account information, and determine, according to the number of times determined for each judgment word, a likelihood characterization value for the account information to be registered as a garbage account, where the likelihood characterization value is inversely proportional to the number of times determined for each judgment word.
The characterization value module 302 is specifically configured to determine, according to the number of times determined for each judgment word and the number of characters included in the account information to be registered, a likelihood characterization value of the account information to be registered as a spam account, where the likelihood characterization value is proportional to the number of characters included in the account information to be registered.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims (18)

1. An information filtering method, comprising:
acquiring a registration request; the registration request comprises account information to be registered;
determining characters contained in the account information to be registered;
judging whether the account information to be registered is a garbage account or not according to the characters, wherein the method specifically comprises the following steps:
judging whether the account information to be registered is a garbage account or not according to the likelihood characterization value of the account information to be registered as the garbage account;
when the account information to be registered is a junk account, refusing to register the account information to be registered;
the determining whether the account information to be registered is a garbage account according to the likelihood characterization value of the account information to be registered as the garbage account specifically includes:
determining the occurrence times of any judgment word in each piece of predetermined normal account information according to the random judgment word; the judging word is used for determining the account information to be registered as a likelihood characterization value of the garbage account;
and determining the possibility representation value of the account information to be registered as the garbage account according to the times, wherein the possibility representation value is inversely proportional to the times.
2. The method of claim 1, wherein the determining whether the account information to be registered is a garbage account according to the character specifically includes:
determining a specified type character from the characters;
determining the likelihood characterization value of the account information to be registered as a garbage account according to the appointed type character;
determining a minimum threshold value of a likelihood characterization value of the garbage account;
and when the likelihood characterization value is larger than the minimum threshold value, determining that the account information to be registered is a garbage account.
3. The method of claim 2, wherein the determining the likelihood representation value of the account information to be registered as the garbage account according to the specified type character specifically comprises:
word segmentation is carried out on the specified type characters according to a preset rule, and a word segmentation result is obtained; the word segmentation result comprises at least one judgment word;
and determining the likelihood characterization value of the account information to be registered as the garbage account according to the judgment word.
4. The method of claim 2, wherein determining the minimum threshold value of the likelihood-characterizing value of the garbage account comprises:
acquiring a preset number of probability characterization values of each garbage account successfully registered;
determining the minimum value in the likelihood characterization values of the garbage accounts which are successfully registered;
the minimum value is used as a minimum threshold value of the possibility of the garbage account.
5. The method of claim 3, wherein the word segmentation of the specified type of character according to a preset rule specifically comprises:
determining the preset character string length of each judgment word;
and selecting continuous characters meeting the preset character string length from the characters contained in the account information to be registered according to the preset character string length, and taking the character string formed by the selected characters as an obtained judgment word.
6. The method of claim 5, wherein the determining the preset string length of each judgment word specifically comprises:
determining the average length of character strings contained in account information of a preset number of normal accounts which are successfully registered;
and taking the average length as the preset character string length.
7. The method of claim 3, wherein the determining, according to the judgment word, the likelihood representation value of the account information to be registered as the garbage account specifically includes:
determining the number of characters contained in the account information to be registered and the occurrence times of each judgment word in the account information of the normal user;
determining a likelihood characterization value of the account information to be registered as a garbage account according to the number of characters and the occurrence frequency; wherein the likelihood representation value is proportional to the number of characters contained in the account information to be registered.
8. The method of claim 1, wherein the determining, according to the number of times, the likelihood representation value of the account information to be registered as a spam account specifically includes:
using the formula
Determining the account information to be registered as a possibility characterization value of the garbage account;
wherein S is a likelihood characterization value; tf (tf) i In order to divide the characters contained in the account information to be registered, the number of times that the obtained i-th judgment word appears in each piece of predetermined normal account information is i=1, 2 … … k, and k is the number of judgment words obtained after the characters contained in the account information to be registered are divided.
9. The method of claim 7, wherein the determining the likelihood representation value of the account information to be registered as a spam account according to the number of characters and the occurrence number specifically comprises:
using the formula
Wherein S is a likelihood characterization value;
tf i in order to divide the characters contained in the account information to be registered, the number of times that the obtained i-th judgment word appears in each piece of predetermined normal account information is i=1, 2 … … k, and k is the number of judgment words obtained after the characters contained in the account information to be registered are divided;
a is a preset length penalty coefficient, b is a preset short compensation value, and a and b are constants larger than 0;
x and y are preset constants greater than 0;
n is the number of characters contained in the account information to be registered;
n is the number of characters contained in each judgment word, wherein the number of characters contained in each judgment word is the same;
h is a preset integer, and N > h >0.
10. An information filtering apparatus comprising:
the acquisition module is used for acquiring the registration request; the registration request comprises account information to be registered;
the character determining module is used for determining characters contained in the account information to be registered;
the judging module is used for judging whether the account information to be registered is a garbage account or not according to the characters; the judging module is specifically configured to judge whether the account information to be registered is a junk account according to a likelihood characterization value of the account information to be registered being the junk account;
the filtering processing module is used for refusing to register the account information to be registered when the account information to be registered is a junk account;
the judging module is specifically configured to determine, for any one judging word, the number of times that the any one judging word appears in each piece of predetermined normal account information; the judging word is used for determining the account information to be registered as a likelihood characterization value of the garbage account; and determining the possibility representation value of the account information to be registered as the garbage account according to the times, wherein the possibility representation value is inversely proportional to the times.
11. The apparatus of claim 10, wherein the judging module is specifically configured to:
determining a specified type character from the characters;
determining the likelihood characterization value of the account information to be registered as a garbage account according to the appointed type character;
determining a minimum threshold value of a likelihood characterization value of the garbage account;
and when the likelihood characterization value is larger than the minimum threshold value, determining that the account information to be registered is a garbage account.
12. The apparatus of claim 11, wherein the judging module is specifically configured to:
word segmentation is carried out on the specified type characters according to a preset rule, and a word segmentation result is obtained; the word segmentation result comprises at least one judgment word;
and determining the likelihood characterization value of the account information to be registered as the garbage account according to the judgment word.
13. The apparatus of claim 11, wherein the judging module is specifically configured to:
acquiring a preset number of probability characterization values of each garbage account successfully registered;
determining the minimum value in the likelihood characterization values of the garbage accounts which are successfully registered;
the minimum value is used as a minimum threshold value of the possibility of the garbage account.
14. The apparatus of claim 12, wherein the judging module is specifically configured to:
determining the preset character string length of each judgment word;
and selecting continuous characters meeting the preset character string length from the characters contained in the account information to be registered according to the preset character string length, and taking the character string formed by the selected characters as an obtained judgment word.
15. The apparatus of claim 14, wherein the judging module is specifically configured to:
determining the average length of character strings contained in account information of a preset number of normal accounts which are successfully registered;
and taking the average length as the preset character string length.
16. The apparatus of claim 12, wherein the judging module is specifically configured to:
determining the number of characters contained in the account information to be registered and the occurrence times of each judgment word in the account information of the normal user;
determining a likelihood characterization value of the account information to be registered as a garbage account according to the number of characters and the occurrence frequency; wherein the likelihood representation value is proportional to the number of characters contained in the account information to be registered.
17. The apparatus of claim 11, wherein the judging module is specifically configured to:
using the formula
Determining the account information to be registered as a possibility characterization value of the garbage account;
wherein S is a likelihood characterization value; tf (tf) i In order to divide the characters contained in the account information to be registered, the number of times that the obtained i-th judgment word appears in each piece of predetermined normal account information is i=1, 2 … … k, and k is the number of judgment words obtained after the characters contained in the account information to be registered are divided.
18. The apparatus of claim 16, wherein the judging module is specifically configured to:
using the formula
Determining the account information to be registered as a possibility characterization value of the garbage account;
wherein S is a likelihood characterization value;
tf i in order to divide the characters contained in the account information to be registered, the number of times that the obtained i-th judgment word appears in each piece of predetermined normal account information is i=1, 2 … … k, and k is the number of judgment words obtained after the characters contained in the account information to be registered are divided;
a is a preset length penalty coefficient, b is a preset short compensation value, and a and b are constants larger than 0; x and y are preset constants greater than 0; n is the number of characters contained in the account information to be registered; n is the number of characters contained in each judgment word, wherein the number of characters contained in each judgment word is the same; h is a preset integer, and N > h >0.
CN201811523727.2A 2014-10-14 2014-10-14 Information filtering method and device Active CN109670108B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811523727.2A CN109670108B (en) 2014-10-14 2014-10-14 Information filtering method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811523727.2A CN109670108B (en) 2014-10-14 2014-10-14 Information filtering method and device
CN201410542510.1A CN105574023B (en) 2014-10-14 2014-10-14 A kind of information filtering method and device

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201410542510.1A Division CN105574023B (en) 2014-10-14 2014-10-14 A kind of information filtering method and device

Publications (2)

Publication Number Publication Date
CN109670108A CN109670108A (en) 2019-04-23
CN109670108B true CN109670108B (en) 2023-08-01

Family

ID=55884169

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201410542510.1A Active CN105574023B (en) 2014-10-14 2014-10-14 A kind of information filtering method and device
CN201811523727.2A Active CN109670108B (en) 2014-10-14 2014-10-14 Information filtering method and device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201410542510.1A Active CN105574023B (en) 2014-10-14 2014-10-14 A kind of information filtering method and device

Country Status (1)

Country Link
CN (2) CN105574023B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255458A (en) * 2018-09-26 2019-01-22 蜜小蜂智慧(北京)科技有限公司 A kind of method and apparatus of identification registration
CN110430245B (en) * 2019-07-17 2022-06-10 北京达佳互联信息技术有限公司 Control method, device, equipment and medium for abnormal account identification

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102185788A (en) * 2011-01-31 2011-09-14 北京开心人信息技术有限公司 Method and system for searching vice accounts on basis of temporary mailbox
CN102790752A (en) * 2011-05-20 2012-11-21 盛乐信息技术(上海)有限公司 Fraud information filtering system and method on basis of feature identification
CN103118043B (en) * 2011-11-16 2015-12-02 阿里巴巴集团控股有限公司 A kind of recognition methods of user account and equipment
US20130311283A1 (en) * 2012-05-18 2013-11-21 Huawei Technologies Co., Ltd. Data mining method for social network of terminal user and related methods, apparatuses and systems

Also Published As

Publication number Publication date
CN105574023B (en) 2019-01-04
CN105574023A (en) 2016-05-11
CN109670108A (en) 2019-04-23

Similar Documents

Publication Publication Date Title
US9582569B2 (en) Targeted content distribution based on a strength metric
CN104899220B (en) Application program recommendation method and system
CN103546446B (en) Phishing website detection method, device and terminal
JP6955102B2 (en) APP push methods, devices, electronic devices and computer readable storage media
US20170289131A1 (en) Prompting login account
CN104866478B (en) Malicious text detection and identification method and device
CN105634855B (en) The abnormality recognition method and device of network address
WO2018001078A1 (en) Url matching method and device, and storage medium
KR20140117536A (en) Techniques for generating outgoing messages based on language, internationalization, and localization preferences of the recipient
CN109977366B (en) Catalog generation method and device
CN108243032B (en) Method, device and equipment for acquiring service level information
CN106959976B (en) Search processing method and device
US10459952B2 (en) Categorizing search terms
CN110677492A (en) Access request processing method and device, electronic equipment and storage medium
CN109670108B (en) Information filtering method and device
US20160248724A1 (en) Social Message Monitoring Method and Apparatus
CN110232156B (en) Information recommendation method and device based on long text
CN109873788B (en) Botnet detection method and device
WO2023202322A1 (en) Theme aggregation method and apparatus, and electronic device
US20160085798A1 (en) Method and system for storing user information
CN109063015B (en) Method, device and equipment for extracting hot content
CN107784054B (en) Page publishing method and device
CN113220949B (en) Construction method and device of private data identification system
CN110728113A (en) Information screening method and device of electronic forms and terminal equipment
CN115328898A (en) Data processing method and device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201013

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201013

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant