CN105718767B

CN105718767B - information processing method and device based on risk identification

Info

Publication number: CN105718767B
Application number: CN201410734967.2A
Authority: CN
Inventors: 郑丹丹; 林述民
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2014-12-04
Filing date: 2014-12-04
Publication date: 2020-01-31
Anticipated expiration: 2034-12-04
Also published as: CN111371761A; CN105718767A; CN111371761B

Abstract

The application discloses information processing methods and devices based on risk identification, the method comprises dividing characters contained in information to be identified into different character sets, respectively determining component risk values corresponding to the character sets, determining a comprehensive risk value of the information to be identified according to the component risk values corresponding to the character sets, and processing the information to be identified according to the comprehensive risk value.

Description

information processing method and device based on risk identification

Technical Field

The present application relates to the field of computer technologies, and in particular, to risk identification-based information processing methods and apparatuses.

Background

With the development of information technology, Mobile Directory Numbers (MDNs), which are also mobile phone numbers, in communication devices used by users have become important user identification information, and users can not only use the numbers to perform operations such as registration and login, but also bind the numbers with corresponding network accounts to perform important network operations such as authentication.

At present, the mobile phone number used by the user has the risk of being stolen, and the stolen mobile phone number can generate great threat to the network operation of the user, and is easy to cause the loss of the user.

In the prior art, for a mobile phone number registered or bound in a website, a server carries out risk identification on the mobile phone number of a user to determine the risk of stealing the mobile phone number so as to carry out corresponding risk prevention and control measures.

The mobile phone number is identified with the value degree, , the value degree of the mobile phone number is deduced according to the sequence and meaning of the digits contained in the mobile phone number, usually, more continuous digits appear in the mobile phone number or the same digits appear repeatedly, the value degree is higher, for example, the value degree of the mobile phone number is 13912345678 or 13888886666, the value degree of the mobile phone number is higher than that of the common mobile phone number, the mobile phone number with higher value degree is easy to be taken as a stealing object, therefore, corresponding wind control operation is carried out on the mobile phone number with higher value degree, for example, the safety monitoring level is improved.

And identifying the danger degree of the mobile phone number, wherein generally monitors whether an account bound with a mobile phone number has illegal operations (such as stealing other accounts or other malicious network behaviors), if so, the mobile phone number is marked as a high-danger mobile phone number, and corresponding wind control operation is performed on the high-danger mobile phone number, for example, the mobile phone number is recorded as a blacklist number to prevent the mobile phone number from being bound or registered.

However, the above method for identifying the mobile phone number still has defects. Specifically, the method comprises the following steps:

the value degree identification of the mobile phone number generally depends on subjective judgment, the value degree of the mobile phone number is judged according to the meaning of the digits in the mobile phone number, the standard judgment standard is not met, and the actual value degree of the mobile phone number cannot be fully and accurately reflected.

The danger degree identification is carried out on the mobile phone number, the mobile phone number which is marked as the high danger degree is possibly discarded by the user, and the mobile phone number is recovered by the telecom operator after time and is distributed to other users again for continuous use.

Disclosure of Invention

The embodiment of the application provides information processing methods and devices based on risk identification, which are used for solving the problem of poor accuracy of risk identification of information.

The information processing methods based on risk identification provided by the embodiment of the application comprise dividing characters contained in information to be identified into different character sets;

respectively determining component risk values corresponding to the character sets;

determining a comprehensive risk value of the information to be identified according to the component risk value corresponding to each character set;

and processing the information to be identified according to the comprehensive risk value.

The information processing devices based on risk identification provided by the embodiment of the application comprise a character dividing module, a risk identification module and a risk identification module, wherein the character dividing module is used for dividing characters contained in information to be identified into different character sets;

the component risk value module is used for respectively determining component risk values corresponding to the character sets;

the comprehensive risk value module is used for determining a comprehensive risk value of the information to be identified according to the component risk value corresponding to each character set;

and the processing module is used for processing the information to be identified according to the comprehensive risk value.

The embodiment of the application provides risk identification-based information processing methods and devices, wherein characters with corresponding meanings in information to be identified are divided into different character sets, and after component risk values corresponding to the character sets are determined, a comprehensive risk value corresponding to the information to be identified can be accurately determined without depending on subjective judgment.

Drawings

The accompanying drawings, which are incorporated herein and constitute part of this application and are included to provide a further understanding of the application, section of the application, illustrate embodiments of the application and together with the description serve to explain the application and not to limit the application.

Fig. 1 is a schematic diagram of an information processing process based on risk identification according to an embodiment of the present application;

fig. 2 is a schematic process diagram of a method for determining a component risk value corresponding to each character set according to an embodiment of the present disclosure;

fig. 3 is a schematic process diagram of a second method for determining a component risk value corresponding to each character set according to the embodiment of the present application;

fig. 4 is a schematic process diagram of a third method for determining a component risk value corresponding to each character set according to the embodiment of the present application;

FIG. 5 is a schematic structural diagram of an information processing apparatus based on risk identification according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a component risk value module when determining an th component risk value according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a component risk value module in determining a second component risk value according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a component risk value module in determining a third component risk value according to an embodiment of the present application.

Detailed Description

For purposes of clarity, technical solutions and advantages of the present application, the present application will be described in detail and in full with reference to specific embodiments of the present application and accompanying drawings.

Fig. 1 is an information processing process based on risk identification according to an embodiment of the present application, where the process specifically includes the following steps:

s101: the characters contained in the information to be recognized are divided into different character sets.

In the scenario of the embodiment of the application, after a user registers account information (e.g., a network account), the user information of the user and the account information are bound to perform identification and authentication during corresponding operations. Therefore, the information to be identified in the embodiment of the present application specifically includes: and the user information is bound with the account information and is used for carrying out authentication identification. The information to be identified includes but is not limited to: the user's cell phone number, certificate number, etc.

In 11-digit mobile phone number 13812348888, the first three digits "138" represent the attribute type of the mobile phone number, and through these three digits, the telecom operator and the corresponding service type to which the mobile phone number belongs can be determined, and the fourth to seventh four digits "1234" is the Home Location Register (HLR) identification code, and through these four digits, the user information (such as the Home Location information of the mobile phone number, the call priority information, etc.) corresponding to the mobile phone number can be determined, and through these four digits, the last four digits "8888" represents the user number, through which a specific user of can be determined.

Therefore, in the above step S101, the characters having the meaning of in the information to be recognized may be divided into different character sets.

It should be noted that, in the step S101, the characters are divided into character sets, specifically, the characters at the designated positions in the information to be recognized are divided into character sets, then, the characters at different designated positions in the information to be recognized are divided into different character sets, so as to obtain a plurality of different character sets, where the union set of each character set includes all the characters in the information to be recognized, and at least two character sets have an intersection.

And S102, respectively determining component risk values corresponding to the character sets.

The method comprises the steps of dividing characters with meaning into different character sets, and determining component risk values of the character sets one by one, wherein the component risk values are quantized values of risk degrees corresponding to the character sets respectively.

It should be noted that the component risk value in the embodiment of the present application reflects the value degree of the characters in the character set, and reflects the risk degree through the value degree.

Specifically, still taking the above-mentioned cell phone number 13812348888 as an example, if the last four digits "8888" in the cell phone number are classified into character sets, it is obvious that the probability that all four digits are repeated in the four digits is very small, that is, the value degree corresponding to the character set containing the four digits is very high, then in an actual application scenario, the information to be identified containing the character set is more likely to be stolen, that is, the risk of stealing the character set is higher.

S103, determining a comprehensive risk value of the information to be identified according to the component risk value corresponding to each character set.

Because the characters contained in each character set are all characters in the information to be recognized, the risk degree of the whole information to be recognized can be reflected through the risk degree corresponding to each character set, that is, the comprehensive risk value of the whole information to be recognized can be determined according to the component risk value corresponding to each character set. Of course, in the embodiment of the present application, the component risk value of each character set may determine the comprehensive risk value of the information to be recognized in a plurality of ways, such as accumulation, averaging, and the like, and this is not limited in this application.

And S104, processing the information to be identified according to the comprehensive risk value.

In this embodiment of the application, the comprehensive risk value reflects a risk degree of the information to be identified, and specifically, the larger the comprehensive risk value is, the higher the risk degree of the information to be identified is, then, the higher the security threat suffered by the information to be identified is, for example: the information to be identified with the excessively high comprehensive risk value needs to be processed by combining a corresponding risk control system, and the processing mode can be to improve the safety monitoring level or increase safety protection measures and the like. In practical application, a corresponding risk threshold value may be preset, and when the determined comprehensive risk value of the information to be identified is higher than the risk threshold value, corresponding wind control processing is performed on the information to be identified.

Through the steps, the characters with corresponding meanings in the information to be recognized are divided into different character sets, after the component risk values corresponding to the character sets are determined, the comprehensive risk value corresponding to the information to be recognized can be accurately determined, without depending on subjective judgment, and when the component risk values corresponding to the character sets are determined, the pre-stored recognized information is used as a basis, so that the actual value degree of the information to be recognized can be more accurately reflected.

In the embodiment of the present application, since the characters in different character sets have different meanings, different manners will be adopted when determining the component risk values corresponding to different character sets. Specifically, the method comprises the following steps:

method :

as shown in fig. 2, the process of determining the component risk value corresponding to each character set in the method specifically includes:

s201, arranging the characters in the character set according to the sequence of the characters in the information to be recognized to obtain a character sequence corresponding to the character set.

For example, the -rd to third digits of the mobile phone number are 138, respectively, then if the , two or three digits of the mobile phone number are designated, then when the -th to third characters of the mobile phone number are character sets, 381 or 813 order may be formed, so that , the three digits in the character set do not have meanings representing attribute types of the mobile phone number, thereby resulting in that the component risk value corresponding to the character set cannot be accurately determined.

Therefore, in the embodiment of the present application, after the characters in the information to be recognized are divided into different character sets, the characters divided into the character sets are arranged, so that the characters conform to the sequence of the characters in the information to be recognized, that is, the character sequence corresponding to the character set is obtained after the characters are arranged, and the meaning of the characters is not changed.

In S202, the ratio of information having the same character sequence among the recognized normal information stored in advance is determined as the th ratio.

In an actual application scenario, account information and information bound to the account information are both stored in corresponding devices (e.g., servers), and illegal operations such as account stealing by a user using the account information may occur, and then the corresponding devices determine whether to identify the information bound to the account information as normal information or abnormal information by monitoring whether the illegal operations occur on the account information. Of course, in practical applications, it is determined whether each identified information is normal information, and modes such as network behavior monitoring and analysis in the prior art may be adopted, which does not constitute a limitation to the present application.

Therefore, in the embodiment of the present application, each piece of pre-stored identified normal information may be information that is pre-stored in the corresponding device and is identified as normal, for example, in a certain website, different mobile phone numbers bound to different account information are identified as normal mobile phone numbers after being subjected to corresponding identification processing, which are each piece of pre-stored identified normal information.

For information containing the above character sequence, it may appear in the recognized normal information or in the recognized abnormal information, then, all information having the character sequence is counted as the percentage of the recognized normal information ( th percentage).

In S203, the ratio of information having the same character sequence among the respective recognized abnormal information stored in advance is determined as a second ratio.

Similarly to the th ratio, the pre-stored abnormal information may be information that is pre-stored in the corresponding device and is considered abnormal, such as a blacklisted mobile phone number obtained after the corresponding identification process.

S204, determining the ratio of the th ratio to the second ratio.

The th to second ratio can indicate the probability of the information containing the character sequence being normal information or abnormal information, specifically, if the ratio of the th to second ratios is much greater than 1, that is, the th to second ratios is much greater than the second ratio, the ratio of the information containing the character sequence in the recognized normal information is much greater than the ratio of the information containing the character sequence in the recognized abnormal information, so that the probability of the information containing the character sequence being normal information can be determined to be high.

S205, determining th component risk values corresponding to the character set according to the ratio.

It should be noted that, in an actual application scenario, since the number of the pre-stored identified information is huge, the ratio of the th proportion to the second proportion may be large, and the computation amount of the subsequent processing is increased, in order to simplify the computation, in this embodiment of the present application, a logarithm computation may be adopted to simplify the ratio, that is, for the step S205, the th component risk value corresponding to the character set is determined according to the ratio, specifically, a logarithm value of the ratio is determined, and the th component risk value corresponding to the character set is determined according to the logarithm value, if the logarithm value of the ratio is directly taken as the th component risk value, since the logarithm value may be smaller than zero (in the logarithm, if the number is smaller than 1, the logarithm result is smaller than zero), when the comprehensive risk value of the information to be identified is determined according to the th component risk value, a determined error may be brought to the comprehensive risk value.

Therefore, more specifically, the step of determining the th component risk value corresponding to the character set according to the logarithm value is to use the sum of the logarithm value and a preset adjusting constant as the th component risk value corresponding to the character set, so can offset the error caused by the logarithm value when the logarithm value is less than zero through the preset adjusting constant.

Therefore, the sum of the logarithm of the ratio of the th ratio to the second ratio of all the character sets and the preset adjusting constant is a numerical value larger than zero, and the condition of being smaller than zero cannot occur.

In the scenarios provided in the embodiment of the present application, if the information to be identified is a mobile phone number to be identified, the character set is a number set formed by a plurality of digits contained in the mobile phone number to be identified, in such a case, when the first three digits contained in the mobile phone number to be identified are divided into a character set, for a character set, the digits in the character set are arranged according to the sequence of the digits in the mobile phone number to be identified, so as to obtain a digit sequence corresponding to the character set, at this time, in combination with the method , the method may be implemented by using a formula

To determine the component risk value to which the th set of characters corresponds.

Wherein S is₁A component risk value corresponding to the th character set.

p₁The ratio of the mobile phone number containing th digit sequence is stored in advance in each identified normal mobile phone number.

p₂The percentage of the cell phone numbers containing the th digit sequence among the previously stored identified abnormal cell phone numbers is shown.

C is a preset constant value.

The method is described in detail below using an application example of :

assuming that the mobile phone number is still 13812348888 and the mobile phone number is bound to the account a, then, after the server receives the registration of the account a, the server identifies the mobile phone number 13812348888 bound to the account a. the server classifies the first three digits of the mobile phone number into a character set, and arranges the digits in the character set according to the sequence of the first three digits in the mobile phone number to obtain a character sequence of "138".

Assuming that the number of mobile phone numbers previously stored in the server and identified as normal is 10000 (in practical application, the number of accounts stored in the server is huge, and only 10000 is taken as an example for convenience of description), 2000 mobile phone numbers containing th character sequence "138" are included in the 10000 normal mobile phone numbers, so that the mobile phone number containing th character sequence "138" can be determined, and the th proportion p in the normal mobile phone numbers is p₁I.e. p₁＝2000/10000＝0.5。

Assuming that the number of mobile phone numbers which are previously stored in the server and are already identified as abnormal is 100, among the 100 abnormal mobile phone numbers, the mobile phone number containing the th character sequence "138" is 2 in total, and therefore, the mobile phone number containing the th character sequence "138" can be determined, and the second percentage p in the previously stored abnormal mobile phone numbers is p₂I.e. p₂＝2/100＝0.02。

Obtaining the th ratio p₁And a second ratio p₂Thereafter, the th ratio p can be determined₁To the second ratio p₂Ratio of (i.e. p)₁/p₂0.5/0.02-25. If the ratio is much greater than 1, it indicates that the mobile phone number containing the character sequence "138" is a normal mobile phone number with a high possibility.

Meanwhile, assuming that the tuning constant value C is 8, the component risk value of the th character set is the same as the above formula

As can be seen from the above example, the component risk value of the character set is determined by adopting the th and second ratios, so that the possibility that the information to be identified is normal information or abnormal information can be quantified more accurately, wherein the higher the th component risk value is, the higher the possibility that the information to be identified is normal information is, the higher the possibility that the information to be identified is, and the lower the possibility that the information to be identified is, the higher the possibility that the information is abnormal information is, and the lower the possibility that the information is stolen is.

The second method comprises the following steps:

as shown in fig. 3, the process of determining the component risk value corresponding to each character set in the second method specifically includes:

s301, arranging the characters in the character set according to the sequence of the characters in the information to be recognized to obtain a character sequence corresponding to the character set.

Similar to the method , when the characters at the designated positions in the information to be recognized are classified into character sets, the characters are not classified into the corresponding character sets according to the sequence of the characters, and therefore, the characters classified into the character sets are arranged to obtain the character sequences corresponding to the character sets.

In step S302, the account information corresponding to the recognized information including the character sequence is specified among the previously stored recognized information.

Since each account information is bound to the corresponding information in the embodiment of the present application, for any identified information, the account information bound to the identified information can be uniquely determined .

In the second method, the following account information is each account information corresponding to the recognized information including the character sequence.

S303, determining the service level of each account information.

In a practical application scenario, a user can use his own account information to obtain various business services, the more business services the user obtains from certain account information, the more the business services the user often uses the account information, and the higher the possibility that the account information is normal account information.

For example, if the level of the business service associated with the bank card is set to 5 in advance, if a certain account information binds the corresponding bank card and opens the business associated with the bank card, the business level corresponding to the account information is 5.

Of course, if a plurality of business services are used in account information, the business grade of the account information is the sum of the business grades of the business services, for example, if two business services are opened in account information, and the business grades of the two business services are respectively 3 and 4, then the business grade of the account information is 7.

In practical applications, the determination of the business level of the account information is not limited to the above-mentioned manner, and the business level of the account information may be determined according to the activity of the account information, the frequency of the business service used by the account information, and the like, which does not limit the present application.

And S304, counting the number of the account information with different service levels according to the service level of each account information.

Generally, the types of business services are limited, and there are many cases where the same business service is used for each account information, that is, the business levels of the account information are the same. In the embodiment of the application, the number of account information with the same service level needs to be determined, so after the service level of each account information is determined, the number of account information corresponding to each service level is counted.

S305, in each account information, the ratio of account information of different business grades is determined.

Under the condition that the number of account information corresponding to each service level is known, the account information corresponding to each service level can be respectively determined, and the account information accounts for all the account information corresponding to the identified information containing the character sequence, so that the degree of using the service by the account information can be visually reflected.

S306, determining a second component risk value corresponding to the character set according to the service level of each account information and the proportion of the account information with different service levels.

After the service level of each account information and the proportion of the account information with different service levels are determined, the service level distribution of all the identified information containing the character sequence can be indicated.

In the scenarios provided in this embodiment of the present application, if the information to be identified is a mobile phone number to be identified, the character set is a number set formed by a plurality of numbers included in the mobile phone number to be identified, in such a case, when the first seven numbers included in the mobile phone number to be identified are divided into a second character set, for the second character set, the numbers in the second character set are arranged according to the sequence of the numbers in the mobile phone number to be identified, so as to obtain a second number sequence corresponding to the second character set

S₂＝Σ(w(i)*Prob(i))

And determining a second component risk value corresponding to the second character set.

Wherein S is₂And the second component risk value corresponds to the second character set.

w (i) represents: the ith service class in each service class determined is w (i).

Prob (i) is: and the account information of the ith service level is used for determining the ratio of each account information.

It should be noted that, in the embodiment of the present application, the first seven digits included in the ten -digit mobile phone number are divided into the second character set, because the mobile phone numbers having the same call priority under a certain attribute type (e.g., the same as carrier) or the mobile phone numbers having the same home location under a certain attribute type can be determined by the first three digits and the fourth to seventh four digits of the mobile phone number, that is, the mobile phone numbers having the same feature can be determined by the first seven digits.

The second method is specifically described by using application examples as follows:

assuming that the cell phone number is still 13812348888, the cell phone number is bound to account a, and then when the server receives the registration of account a, the server identifies the cell phone number 13812348888 bound to account a. The server divides the first seven digits of the mobile phone number into a second character set, and arranges the digits in the character set according to the sequence of the first seven digits in the mobile phone number to obtain a second character sequence '1381234'.

The server will determine all cell phone numbers containing the second character sequence "1381234" among the pre-stored recognized cell phone numbers. It is assumed that the number of mobile phone numbers containing the second character sequence "1381234" is 1000 in total. Then, the server will determine the account information bound to the 1000 mobile phone numbers respectively, and correspondingly, the server will determine 1000 account information.

Then, the server determines the service levels of the 1000 pieces of account information according to a preset service level standard. The server may determine the service level according to the service used by the account information, and of course, the server may determine the service level of the account information in various manners such as a preset level standard of each service, and during actual application, the server may adjust and set according to the actual application requirements, which does not limit the present application.

Assume that two kinds of service levels are present in the 1000 pieces of account information, and a service level of 900 pieces of account information is the 1 st service level w (1) and w (1) is 5, and a service level of 100 pieces of account information is the 2 nd service level w (2) and w (2) is 4. Then, the account information having the business rank of 5 is the account information having the business rank of 900/1000-0.9 in the proportion Prob (1) among the 1000 pieces of account information, and the account information having the business rank of 4 is the account information having the business rank of 100/1000-0.1 in the proportion Prob (2) among the 1000 pieces of account information.

Thus, the server can determine that the second character sequence is contained according to the formulaSecond component risk value S corresponding to second character set of "1381234₂0.9 × 5+0.1 × 4 × 4.9. The second component risk value is close to the service level w (1), that is, the service level of the account information corresponding to the mobile phone number containing the second character sequence "1381234" is maintained at the level of w (1).

As can be seen from the above example, the account information corresponding to the identified information containing the character sequence is determined, the service level of the account information is determined, the degree of the service used by the account information can be reflected, and meanwhile, the service level of the account information corresponding to the identified information containing the character sequence can be integrally quantized by combining the counted number of account information corresponding to different service levels. The larger the second component risk value is, the higher the possibility that the information to be identified is normal information is and the higher the possibility that the information is at risk of theft is, and conversely, the higher the possibility that the information is abnormal information is and the lower the possibility that the information is at risk of theft is.

The third method comprises the following steps:

as shown in fig. 4, the process of determining the component risk value corresponding to each character set in the third method specifically includes:

s401, arranging the characters in the character set according to the sequence of the characters in the information to be recognized to obtain a character sequence corresponding to the character set.

Similar to the method and the second method, the characters in the character set are arranged after the corresponding characters are divided into the character set.

S402, identifying characteristic characters in the character sequence.

In the embodiment of the application, the characteristic characters comprise repeated characters and/or sequential characters, wherein the repeated characters are at least two continuous identical characters, such as aaa, bb, cccc and the like, and the sequential characters are at least three continuous characters arranged according to the character sequence , such as abcd, 789, 321, 1234 and the like.

In addition, for the recognition of the characteristic character, a character recognition algorithm in the prior art can be adopted, and the method does not constitute a limitation to the application.

S403, when the characteristic character is recognized, determining the weight value and the characteristic value of the characteristic character.

In the character sequence, a great number of permutation and combination modes exist for different characters, the permutation and combination of the multi-number characters are random and unordered, and the feature characters are permutated and combined only in a few cases, namely, the feature characters have fixed probability.

Therefore, in the embodiment of the present application, the weight value of the feature character is quantized according to the probability of occurrence of the feature character, and the feature value of the feature character is quantized according to the number of characters included in the feature character. That is, the step S403 specifically includes: determining the probability of the characteristic character appearing in the character sequence; determining the weight value of the characteristic character according to the probability; performing word segmentation on the characteristic characters to obtain character units; and determining the characteristic value of the characteristic character according to the obtained number of the character units.

In this embodiment, when the N-gram language model is used to segment the characteristic character, the characteristic character is divided into the smallest character units (N ═ 1 in this case), and the number of characters in the character units is sequentially increased until the characteristic character is divided into character units (N ═ the number of characters included in the characteristic character in this case).

For example: aiming at the characteristic character 8888, an N-gram language model is adopted for word segmentation, under the 1-gram word segmentation method, the characteristic character is divided into 4 character units 8, 8 and 8, under the 2-gram word segmentation method, the characteristic character is divided into 3 character units 88, 88 and 88, under the 3-gram word segmentation method, the characteristic character is divided into 2 character units 888 and 888, and under the 4-gram word segmentation method, the characteristic character is divided into 1 character unit 8888.

S404, determining a third component risk value corresponding to the character set according to the weight value and the characteristic value of the characteristic character.

For the third method, in scenarios provided in this embodiment of the application, when the information to be identified is a mobile phone number to be identified, and when the last eight digits included in the mobile phone number to be identified are divided into a third character set, for the third character set, the digits in the third character set are arranged according to the sequence of the digits in the mobile phone number to be identified, so as to obtain a third digit sequence corresponding to the third character set.

When repeated characters are identified, word segmentation is carried out on the repeated characters to obtain different digital units, and at the moment, the different digital units can be obtained through formulas

Determining a feature value of the repeated words.

Wherein S is_c(n) is a characteristic value of the repeated number, and the argument n represents the number of digits contained in the repeated number.

tf_jThe number of character units is obtained after the repeated characters are segmented.

j represents a j-th word segmentation method, and the number of characters contained in each digital unit obtained by adopting the j-th word segmentation method is j. Of course, j is the value of N when the N-gram language model is used for word segmentation.

Specific examples thereof include: in the above example, on the basis of the N-gram language model for the characteristic character "8888" for division, the above formula is used to determine the characteristic value of the repeated character "8888" as:

S_c(n)＝1*(4-1)+2*(3-1)+3*(2-1)+4*(1-1)＝10。

wherein, for 2 × 3-1, the characteristic character "8888" is divided into 3 character units 88, 88 and 88 based on a 2-gram word segmentation method, the number "2" is the number of characters contained in a character unit, and the number "3" is the number of character units. By analogy, the values in the formula can be obtained.

In practical application scenarios, at least three digits of sequential numbers are usually included, that is, when the sequential numbers are segmented, at least three digits of sequential numbers are segmented, and when the repeated digits are segmented, at least two digits of repeated numbers are segmented, it can be seen that the number of characters included in the sequential numbers is bits less than the number of characters included in the repeated digits when determining the characteristic values.

Thus, when a sequential number is identified, the number of characters contained in the sequential number is determined, which may be by a formula

S_s(n′)＝S_c(n′-1)

Determining a characteristic value of the sequential number.

Wherein S is_sIs a sequential numerical characteristic value.

The argument n' is the number of characters included in the sequential number.

Specific examples thereof include: in determining a feature value for the five-digit sequential number "12345", the feature value is associated with a repeating number, such as: "8888" is the same, and using the above formula, the eigenvalue of the ordinal number "12345" is determined to be:

S_s(5)＝S_c(4)＝1*(4-1)+2*(3-1)+3*(2-1)+4*(1-1)＝10。

after determining the characteristic values of the repeated digits and/or the sequential digits, a formula can be used

S₃＝w(S_c+S_s+1)

And determining a third component risk value corresponding to the third character set.

Wherein S is₃And the risk value of the third component corresponding to the third character set.

w is the inverse of the probability value that the identified repeated and sequential digits appear in the third digit sequence.

If only repeated digits or only sequential digits appear in the third digit sequence, the probability value of the repeated digits (or sequential digits) appearing in the third digit sequence is determined, and the reciprocal of the probability value is used as the weight value w of the feature character. If the repeated number and the sequential number appear in the third number sequence at the same time, determining the probability value of the repeated number and the sequential number appearing in the third number sequence at the same time, and taking the reciprocal of the probability value as the weight value of the characteristic character when the repeated number and the sequential number appear at the same time.

assuming that the cell phone number is still 13812348888, the cell phone number is bound to account a, and then when the server receives the registration of account a, the server identifies the cell phone number 13812348888 bound to account a. The server divides the last eight digits of the mobile phone number into a third character set, and arranges the digits in the character set according to the sequence of the last eight digits in the mobile phone number to obtain a third character sequence '12348888'.

Obviously, the third character sequence "12348888" has characteristic characters, i.e., contains both the sequential number "1234" and the repetitive number "8888". In order to determine the weight value w of the feature character, the probability value of the simultaneous occurrence of the sequential number and the repeated number in the same eight-bit number as the third character sequence is determined.

Specifically, 10 possible values of numbers 0 to 9 exist at each positions of the third character sequence, so that the total number of permutation and combination modes of the numbers at eight positions of the third character sequence is 10⁸. In these permutations, the simultaneous occurrence of sequential numbers "1234" and repeat numbers "8888" is only two cases: "12348888" and "88881234" such that, in the third character sequence, the probability value of the simultaneous occurrence of the sequential number and the repeated number is 2/10⁸. Then, according to the above formula, it can be determined that w is 10⁸/2. Obviously, the value of w is large and inconvenient for subsequent calculation, so in practical application, the value of w may be simplified by squaring and taking a logarithm, and it is assumed that in this application example, the value of w is squared 7 times, so that the simplified value of w ≈ 22.4.

Then, the server determines the feature values of the repeated number "8888" and the sequential number "1234", respectively, and for the repeated number "8888", the feature value S thereof_c(4) For the ordinal number "1234", 10, its characteristic value S_s(4)＝S_c(3)＝4。

Thus, according to the above formula, the third component risk value S of the third character sequence₃＝22.4*(10+4+1)＝336。

As can be seen from the above example, when the third component risk value of the third character set is determined in the third method, if the number of bits of the feature character included in the third character set is larger, the weight value and the feature value of the feature character are also larger, which indicates that, in such a case, the information to be recognized has a higher value. The larger the third component risk value is, the higher the possibility that the information to be identified is normal information is and the higher the possibility that the information is at risk of theft is, and conversely, the higher the possibility that the information is abnormal information is and the lower the possibility that the information is at risk of theft is.

To this end, the three methods respectively determine three component risk values of the information to be identified, so that an overall comprehensive risk value of the information to be identified can be determined according to the component risk values, in the embodiment of the present application, the determining a comprehensive risk value of the information to be identified specifically includes: and carrying out geometric average on the component risk values corresponding to the character sets to obtain a comprehensive risk value of the information to be identified.

For example, the comprehensive risk value of the cell phone number "13812348888" is obtained by following the examples of the methods -III

The larger the comprehensive risk value of the information to be identified is, the higher the value degree of the information to be identified is, the larger the risk of the information to be identified being stolen is, so that in practical application, when the determined comprehensive risk value of the information to be identified is larger than a certain preset risk value, the monitoring level of the information to be identified and the account information bound with the information to be identified can be controlled, and the condition that the information to be identified is stolen is avoided.

In addition, after the comprehensive risk value of the information to be identified bound with account information is determined by using the method, new information to be identified is bound to the account information at a certain time , but the comprehensive risk value of the new information to be identified is far lower than that of the original information to be identified, so that the account information is likely to be stolen, and the monitoring level of the account information can be improved.

Of course, the information to be identified is only described as an example of a mobile phone number, and the information processing method based on risk identification provided in the embodiment of the present application may also be used to identify risks of other information to be identified and perform processing based on the risks, for example, the information to be identified may also be an email address, a certificate number, and the like.

Based on the same idea, the information processing method based on risk identification provided in the embodiment of the present application further provides information processing apparatuses based on risk identification, as shown in fig. 5.

The information processing apparatus based on risk identification in fig. 5 includes: a character segmentation module 501, a component risk value module 502, a composite risk value module 503, and a processing module 504, wherein,

a character dividing module 501, configured to divide characters included in the information to be recognized into different character sets.

The component risk value module 502 is configured to determine component risk values corresponding to the character sets respectively.

And the comprehensive risk value module 503 is configured to determine a comprehensive risk value of the information to be identified according to the component risk value corresponding to each character set.

And the processing module 504 is configured to process the information to be identified according to the comprehensive risk value.

The character division module 501 is specifically configured to divide the characters at the specified positions in the information to be recognized into character sets, where a union set of each character set includes all the characters in the information to be recognized, and at least two character sets have an intersection.

as shown in fig. 6, when determining the th component risk value, the component risk value module specifically includes:

the character arrangement submodule 601 is configured to arrange the characters in the character set according to the sequence of the characters in the information to be recognized, so as to obtain a character sequence corresponding to the character set.

The th sub-module 602 is used to determine the ratio of the information having the same character sequence among the pre-stored normal information, as the th ratio.

The second proportion sub-module 603 is configured to determine, as a second proportion, a proportion of information having the same character sequence among the pieces of recognized abnormal information stored in advance.

A ratio sub-module 604 for determining a ratio of the th ratio to the second ratio.

And the th component risk value sub-module 605 is used for determining the th component risk value corresponding to the character set according to the ratio.

When the th component risk value is too large, in order to simplify subsequent operations, the th component risk value sub-module 605 is specifically configured to determine a logarithm value of the ratio, and determine a th component risk value corresponding to the character set according to the logarithm value.

In another manners of the embodiment of the present application, the component risk value sub-module 605 is specifically configured to use a sum of the logarithm value and a preset adjustment constant as the component risk value corresponding to the character set.

As shown in fig. 7, when determining the second component risk value, the component risk value module specifically includes:

the character arrangement submodule 701 is configured to arrange the characters in the character set according to the sequence of the characters in the information to be recognized, so as to obtain a character sequence corresponding to the character set.

The account information sub-module 702 is configured to determine, among the pieces of recognized information stored in advance, pieces of account information corresponding to the pieces of recognized information including the character sequence.

The service level sub-module 703 is configured to determine a service level of each account information, and count the number of account information of different service levels according to the service level of each account information.

And the proportion submodule 704 is used for respectively determining the proportion of the account information with different service levels in each account information.

And the second component risk value sub-module 705 is configured to determine a second component risk value corresponding to the character set according to the service level of each account information and the proportion of account information of different service levels.

As shown in fig. 8, when determining the third component risk value, the component risk value module specifically includes:

and the character arrangement submodule 801 is configured to arrange the characters in the character set according to the sequence of the characters in the information to be identified, so as to obtain a character sequence corresponding to the character set.

A recognition sub-module 802 for recognizing the characteristic characters in the character sequence.

The feature character sub-module 803 is configured to determine a weight value and a feature value of the feature character when the feature character is recognized.

And the third component risk value sub-module 804 is configured to determine a third component risk value corresponding to the character set according to the weight value and the feature value of the feature character.

Wherein the characteristic characters comprise repeated characters and/or sequential characters.

The feature character sub-module 803 is specifically configured to: determining the probability of the characteristic character appearing in the character sequence, determining the weight value of the characteristic character according to the probability, performing word segmentation on the characteristic character to obtain character units, and determining the characteristic value of the characteristic character according to the number of the obtained character units.

In the scenes of the embodiment of the application, the information to be recognized is specifically a mobile phone number to be recognized, the character set is specifically a number set formed by a plurality of numbers contained in the mobile phone number to be recognized, and the character dividing module 501 is specifically configured to divide the first three digits contained in the mobile phone number to be recognized into a character set, divide the first seven digits contained in the mobile phone number to be recognized into a second character set, and divide the last eight digits contained in the mobile phone number to be recognized into a third character set.

In this scenario, when determining the th component risk value, the component risk value module is specifically configured to, for the th character set, arrange the digits in the th character set according to the sequence of the digits in the mobile phone number to be recognized, so as to obtain a th digit sequence corresponding to the th character set;

using a formula

Determining a component risk value corresponding to the th character set;

wherein S is₁A component risk value corresponding to the th character set;

p₁the ratio of the mobile phone number containing th digit sequence in the pre-stored normal mobile phone numbers;

p₂the ratio of the mobile phone number containing th digit sequence in each pre-stored identified abnormal mobile phone number;

c is a preset constant value.

When determining the second component risk value, the component risk value module is specifically configured to: aiming at a second character set, arranging the digits in the second character set according to the sequence of the digits in the mobile phone number to be recognized to obtain a second digit sequence corresponding to the second character set;

determining account information corresponding to the identified mobile phone number containing the second digit sequence in each pre-stored identified information;

determining the service level of each account information;

using the formula S₂Determining a second component risk value corresponding to the second set of characters;

wherein S is₂A second component risk value corresponding to the second character set;

w (i) represents: determining the ith service grade in each service grade as w (i);

When determining the third component risk value, the component risk value module is specifically configured to: aiming at a third character set, arranging the digits in the third character set according to the sequence of the digits in the mobile phone number to be recognized to obtain a third digit sequence corresponding to the third character set;

identifying repeating and/or sequential digits in the third digit sequence;

when repeated characters are identified, performing word segmentation on the repeated characters to obtain different digital units, and adopting a formula

Determining a feature value of the repeated words;

wherein S is_cIs the characteristic value of the repeated number;

tf_jthe number of character units is obtained after the repeated characters are segmented;

j represents a j-th word segmentation method, and the number of the characters contained in each digital unit obtained by adopting the j-th word segmentation method is j;

when a sequential number is identified, the number of characters contained in the sequential number is determined, using equation S_s(n′)＝S_c(n' -1) determining a characteristic value of the sequential number;

wherein S is_sA characteristic value that is a sequential number;

n' is the number of characters included in the sequential number;

using the formula S₃＝w(S_c+S_s+1) determining a third component risk value corresponding to the third character set;

wherein S is₃A third component risk value corresponding to the third character set;

After the th to third component risk values are determined, the comprehensive risk value module is specifically configured to perform geometric averaging on the component risk values corresponding to the character sets to obtain a comprehensive risk value of the information to be identified.

In typical configurations, a computing device includes or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises an series of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Moreover, the present application may take the form of a computer program product embodied on or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1, information processing method based on risk identification, which is characterized by comprising:

dividing characters contained in information to be identified into different character sets, wherein the information to be identified is user information which is bound with account information and used for authentication and identification, the union set of the different character sets contains all characters in the information to be identified, and at least two character sets have intersection;

determining th component risk value, a second component risk value and a third component risk value corresponding to the different character sets respectively based on the probability of occurrence of the characters in the different character sets, the proportion under specific conditions and the weights of the characters, wherein the th component risk value, the second component risk value and the third component risk value are quantized values of risk degrees corresponding to the different character sets respectively;

determining a comprehensive risk value of the information to be identified according to th component risk values, second component risk values and third component risk values corresponding to the different character sets, wherein the comprehensive risk value reflects the comprehensive risk value of the whole information to be identified;

2. The method of claim 1, wherein dividing the characters included in the information to be recognized into different character sets specifically comprises:

the characters at the specified positions in the information to be recognized are classified into character sets.

3. The method of claim 1, wherein the determining th component risk values corresponding to the different character sets respectively comprises:

arranging the characters in the character set according to the sequence of the characters in the information to be recognized to obtain a character sequence corresponding to the character set;

determining the ratio of information having the same character sequence among the previously stored recognized normal information as th ratio;

determining the ratio of information with the same character sequence in each piece of recognized abnormal information stored in advance as a second ratio;

determining a ratio of said th ratio to said second ratio;

and determining component risk values corresponding to the character set according to the ratio.

4. The method of claim 3, wherein determining the th component risk value corresponding to the character set according to the ratio comprises:

determining a logarithmic value of the ratio;

and determining component risk values corresponding to the character set according to the logarithm values.

5. The method of claim 4, wherein determining the th component risk value corresponding to the character set according to the logarithm value specifically comprises:

and taking the sum of the logarithm value and a preset adjusting constant as an component risk value corresponding to the character set.

6. The method of claim 1, wherein determining the second component risk values corresponding to the different character sets respectively comprises:

determining account information corresponding to the recognized information containing the character sequence from the pre-stored recognized information;

determining the service level of each account information;

counting the number of account information of different service levels according to the service level of each account information;

in each account information, the ratio of account information of different service levels is respectively determined;

and determining a second component risk value corresponding to the character set according to the service level of each account information and the ratio of account information of different service levels.

7. The method of claim 1, wherein determining the third component risk values corresponding to the different character sets respectively comprises:

identifying characteristic characters in the character sequence;

when the characteristic character is identified, determining a weight value and a characteristic value of the characteristic character;

determining a third component risk value corresponding to the character set according to the weight value and the characteristic value of the characteristic character;

8. The method of claim 7, wherein determining the weight value and the feature value of the feature character specifically comprises:

determining the probability of the characteristic character appearing in the character sequence;

determining the weight value of the characteristic character according to the probability;

performing word segmentation on the characteristic characters to obtain character units;

and determining the characteristic value of the characteristic character according to the obtained number of the character units.

9. The method according to claim 1, wherein the information to be identified is specifically: a mobile phone number to be identified;

the character set specifically includes: and the number set is composed of a plurality of numbers contained in the mobile phone number to be identified.

10. The method of claim 9, wherein dividing characters included in the identity information to be recognized into different character sets specifically comprises:

dividing the first three digits contained in the mobile phone number to be identified into th character set;

dividing the first seven digits contained in the mobile phone number to be identified into a second character set;

and dividing the last eight digits contained in the mobile phone number to be identified into a third character set.

11. The method of claim 10, wherein the determining th component risk values corresponding to the different character sets respectively comprises:

aiming at the th character set, arranging the numbers in the th character set according to the sequence of the numbers in the mobile phone number to be identified to obtain a th digit sequence corresponding to the th character set;

using a formula

Determining a component risk value corresponding to the th character set;

wherein the content of the first and second substances,S ₁a component risk value corresponding to the th character set;

p ₁the ratio of the mobile phone number containing th digit sequence in the pre-stored normal mobile phone numbers;

p ₂the ratio of the mobile phone number containing th digit sequence in each pre-stored identified abnormal mobile phone number;

c is a preset constant value.

12. The method according to claim 10, wherein determining the second component risk values corresponding to the different character sets respectively comprises:

aiming at a second character set, arranging the digits in the second character set according to the sequence of the digits in the mobile phone number to be recognized to obtain a second digit sequence corresponding to the second character set;

determining the service level of each account information;

using a formula

Determining a second component risk value corresponding to the second character set;

wherein the content of the first and second substances,S ₂a second component risk value corresponding to the second character set;

w（i) Represents: the ith service grade in each determined service grade isw（i）；

Prob（i) Comprises the following steps: and the account information of the ith service level is used for determining the ratio of each account information.

13. The method of claim 10, wherein determining the third component risk values corresponding to the different character sets respectively comprises:

aiming at a third character set, arranging the digits in the third character set according to the sequence of the digits in the mobile phone number to be recognized to obtain a third digit sequence corresponding to the third character set;

identifying repeating and/or sequential digits in the third digit sequence;

when repeated numbers are recognizedThen, the repeated characters are segmented to obtain different digital units, and formulas are adopted

Determining a feature value of the repeated words;

wherein the content of the first and second substances,S _cis the characteristic value of the repeated number;

tf _jthe number of character units is obtained after the repeated characters are segmented;

jis shown asjA method of word segmentation, and adoptsjThe number of characters contained in each digital unit obtained by the word segmentation method isj；

nThe number of digits contained in the repeated digits;

when a sequential number is identified, the number of characters contained in the sequential number is determined, using a formulaDetermining a characteristic value of the sequential number;

wherein the content of the first and second substances,S _sa characteristic value that is a sequential number;

n’the number of characters included in the sequential number;

using a formula

Determining a third component risk value corresponding to the third character set;

wherein the content of the first and second substances,S ₃a third component risk value corresponding to the third character set;

wthe inverse of the probability value that the identified repeated and sequential digits appear in the third digit sequence.

14. The method as claimed in any one of claims 1 to 13 and , wherein determining the comprehensive risk value of the information to be identified according to the , the second and the third component risk values corresponding to the different character sets comprises:

and geometrically averaging the th component risk value, the second component risk value and the third component risk value corresponding to the different character sets to obtain a comprehensive risk value of the information to be identified.

15, kinds of risk identification-based information processing apparatuses, characterized by comprising:

the character dividing module is used for dividing characters contained in information to be identified into different character sets, wherein the information to be identified is user information which is bound with account information and used for authentication and identification, the union set of the different character sets contains all characters in the information to be identified, and at least two character sets have intersection;

a component risk value module, configured to determine th, second and third component risk values corresponding to the different character sets respectively based on probabilities of occurrence of characters in the different character sets, ratios under specific conditions, and weights of the characters, where the th, second and third component risk values are quantized values of risk degrees corresponding to the different character sets respectively;

a comprehensive risk value module, configured to determine a comprehensive risk value of the information to be identified according to an th component risk value, a second component risk value, and a third component risk value corresponding to the different character sets, where the comprehensive risk value reflects a comprehensive risk value of the whole information to be identified;

16. The apparatus of claim 15, wherein the character dividing module is specifically configured to divide the characters at the designated positions in the information to be recognized into character sets.

17. The apparatus of claim 15, wherein the component risk value module specifically comprises:

the character arrangement submodule is used for arranging the characters in the character set according to the sequence of the characters in the information to be recognized to obtain a character sequence corresponding to the character set;

an th ratio sub-module for determining the ratio of information having the same character sequence among the previously stored recognized normal information as a th ratio;

the second proportion submodule is used for determining the proportion of the information with the same character sequence in each piece of recognized abnormal information which is stored in advance as a second proportion;

a ratio sub-module for determining a ratio of the th ratio to the second ratio;

and the component risk value submodule is used for determining the component risk value corresponding to the character set according to the ratio.

18. The apparatus of claim 17, wherein the th component risk value submodule is further configured to determine a logarithm value of the ratio, and determine the th component risk value corresponding to the character set according to the logarithm value.

19. The apparatus of claim 18, wherein the th component risk value submodule is configured to use a sum of the logarithm value and a preset adjustment constant as the th component risk value corresponding to the character set.

20. The apparatus of claim 15, wherein the component risk value module specifically comprises:

the account information submodule is used for determining each account information corresponding to the identified information containing the character sequence in each pre-stored identified information;

the business grade submodule is used for determining the business grade of each account information and counting the quantity of the account information with different business grades according to the business grade of each account information;

the proportion submodule is used for respectively determining the proportion of the account information of different service levels in each account information;

and the second component risk value submodule is used for determining a second component risk value corresponding to the character set according to the service level of each account information and the proportion of the account information of different service levels.

21. The apparatus of claim 15, wherein the component risk value module specifically comprises:

the recognition submodule is used for recognizing the characteristic characters in the character sequence;

the characteristic character submodule is used for determining a weight value and a characteristic value of the characteristic character when the characteristic character is identified;

the third component risk value sub-module is used for determining a third component risk value corresponding to the character set according to the weight value and the characteristic value of the characteristic character;

22. The apparatus of claim 21, wherein the characteristic character submodule is specifically configured to: determining the probability of the characteristic character appearing in the character sequence; determining the weight value of the characteristic character according to the probability; performing word segmentation on the characteristic characters to obtain character units; and determining the characteristic value of the characteristic character according to the obtained number of the character units.

23. The apparatus according to claim 15, wherein the information to be identified is specifically: a mobile phone number to be identified;

24. The apparatus of claim 23, wherein the character division module is specifically configured to:

25. The device of claim 24, wherein the component risk value module is specifically configured to, for an th character set, arrange the digits in the th character set according to the sequence of the digits in the mobile phone number to be recognized, to obtain a th digit sequence corresponding to the th character set;

using a formula

Determining a component risk value corresponding to the th character set;

c is a preset constant value.

26. The apparatus of claim 24, wherein the component risk value module is specifically configured to: aiming at a second character set, arranging the digits in the second character set according to the sequence of the digits in the mobile phone number to be recognized to obtain a second digit sequence corresponding to the second character set;

determining the service level of each account information;

using a formula

27. The apparatus of claim 24, wherein the component risk value module is specifically configured to: aiming at a third character set, arranging the digits in the third character set according to the sequence of the digits in the mobile phone number to be recognized to obtain a third digit sequence corresponding to the third character set;

identifying repeating and/or sequential digits in the third digit sequence;

Determining a feature value of the repeated words;

nThe number of digits contained in the repeated digits;

when a sequential number is identified, the number of characters contained in the sequential number is determined, using a formula

Determining a characteristic value of the sequential number;

n’the number of characters included in the sequential number;

using a formula

28. The apparatus according to any one of claims of claims 15 to 27, wherein the comprehensive risk value module is specifically configured to perform geometric averaging on a th component risk value, a second component risk value, and a third component risk value corresponding to the different character sets to obtain a comprehensive risk value of the information to be identified.