Detailed Description
For purposes of clarity, technical solutions and advantages of the present application, the present application will be described in detail and in full with reference to specific embodiments of the present application and accompanying drawings.
Fig. 1 is an information processing process based on risk identification according to an embodiment of the present application, where the process specifically includes the following steps:
s101: the characters contained in the information to be recognized are divided into different character sets.
In the scenario of the embodiment of the application, after a user registers account information (e.g., a network account), the user information of the user and the account information are bound to perform identification and authentication during corresponding operations. Therefore, the information to be identified in the embodiment of the present application specifically includes: and the user information is bound with the account information and is used for carrying out authentication identification. The information to be identified includes but is not limited to: the user's cell phone number, certificate number, etc.
In 11-digit mobile phone number 13812348888, the first three digits "138" represent the attribute type of the mobile phone number, and through these three digits, the telecom operator and the corresponding service type to which the mobile phone number belongs can be determined, and the fourth to seventh four digits "1234" is the Home Location Register (HLR) identification code, and through these four digits, the user information (such as the Home Location information of the mobile phone number, the call priority information, etc.) corresponding to the mobile phone number can be determined, and through these four digits, the last four digits "8888" represents the user number, through which a specific user of can be determined.
Therefore, in the above step S101, the characters having the meaning of in the information to be recognized may be divided into different character sets.
It should be noted that, in the step S101, the characters are divided into character sets, specifically, the characters at the designated positions in the information to be recognized are divided into character sets, then, the characters at different designated positions in the information to be recognized are divided into different character sets, so as to obtain a plurality of different character sets, where the union set of each character set includes all the characters in the information to be recognized, and at least two character sets have an intersection.
And S102, respectively determining component risk values corresponding to the character sets.
The method comprises the steps of dividing characters with meaning into different character sets, and determining component risk values of the character sets one by one, wherein the component risk values are quantized values of risk degrees corresponding to the character sets respectively.
It should be noted that the component risk value in the embodiment of the present application reflects the value degree of the characters in the character set, and reflects the risk degree through the value degree.
Specifically, still taking the above-mentioned cell phone number 13812348888 as an example, if the last four digits "8888" in the cell phone number are classified into character sets, it is obvious that the probability that all four digits are repeated in the four digits is very small, that is, the value degree corresponding to the character set containing the four digits is very high, then in an actual application scenario, the information to be identified containing the character set is more likely to be stolen, that is, the risk of stealing the character set is higher.
S103, determining a comprehensive risk value of the information to be identified according to the component risk value corresponding to each character set.
Because the characters contained in each character set are all characters in the information to be recognized, the risk degree of the whole information to be recognized can be reflected through the risk degree corresponding to each character set, that is, the comprehensive risk value of the whole information to be recognized can be determined according to the component risk value corresponding to each character set. Of course, in the embodiment of the present application, the component risk value of each character set may determine the comprehensive risk value of the information to be recognized in a plurality of ways, such as accumulation, averaging, and the like, and this is not limited in this application.
And S104, processing the information to be identified according to the comprehensive risk value.
In this embodiment of the application, the comprehensive risk value reflects a risk degree of the information to be identified, and specifically, the larger the comprehensive risk value is, the higher the risk degree of the information to be identified is, then, the higher the security threat suffered by the information to be identified is, for example: the information to be identified with the excessively high comprehensive risk value needs to be processed by combining a corresponding risk control system, and the processing mode can be to improve the safety monitoring level or increase safety protection measures and the like. In practical application, a corresponding risk threshold value may be preset, and when the determined comprehensive risk value of the information to be identified is higher than the risk threshold value, corresponding wind control processing is performed on the information to be identified.
Through the steps, the characters with corresponding meanings in the information to be recognized are divided into different character sets, after the component risk values corresponding to the character sets are determined, the comprehensive risk value corresponding to the information to be recognized can be accurately determined, without depending on subjective judgment, and when the component risk values corresponding to the character sets are determined, the pre-stored recognized information is used as a basis, so that the actual value degree of the information to be recognized can be more accurately reflected.
In the embodiment of the present application, since the characters in different character sets have different meanings, different manners will be adopted when determining the component risk values corresponding to different character sets. Specifically, the method comprises the following steps:
method :
as shown in fig. 2, the process of determining the component risk value corresponding to each character set in the method specifically includes:
s201, arranging the characters in the character set according to the sequence of the characters in the information to be recognized to obtain a character sequence corresponding to the character set.
For example, the -rd to third digits of the mobile phone number are 138, respectively, then if the , two or three digits of the mobile phone number are designated, then when the -th to third characters of the mobile phone number are character sets, 381 or 813 order may be formed, so that , the three digits in the character set do not have meanings representing attribute types of the mobile phone number, thereby resulting in that the component risk value corresponding to the character set cannot be accurately determined.
Therefore, in the embodiment of the present application, after the characters in the information to be recognized are divided into different character sets, the characters divided into the character sets are arranged, so that the characters conform to the sequence of the characters in the information to be recognized, that is, the character sequence corresponding to the character set is obtained after the characters are arranged, and the meaning of the characters is not changed.
In S202, the ratio of information having the same character sequence among the recognized normal information stored in advance is determined as the th ratio.
In an actual application scenario, account information and information bound to the account information are both stored in corresponding devices (e.g., servers), and illegal operations such as account stealing by a user using the account information may occur, and then the corresponding devices determine whether to identify the information bound to the account information as normal information or abnormal information by monitoring whether the illegal operations occur on the account information. Of course, in practical applications, it is determined whether each identified information is normal information, and modes such as network behavior monitoring and analysis in the prior art may be adopted, which does not constitute a limitation to the present application.
Therefore, in the embodiment of the present application, each piece of pre-stored identified normal information may be information that is pre-stored in the corresponding device and is identified as normal, for example, in a certain website, different mobile phone numbers bound to different account information are identified as normal mobile phone numbers after being subjected to corresponding identification processing, which are each piece of pre-stored identified normal information.
For information containing the above character sequence, it may appear in the recognized normal information or in the recognized abnormal information, then, all information having the character sequence is counted as the percentage of the recognized normal information ( th percentage).
In S203, the ratio of information having the same character sequence among the respective recognized abnormal information stored in advance is determined as a second ratio.
Similarly to the th ratio, the pre-stored abnormal information may be information that is pre-stored in the corresponding device and is considered abnormal, such as a blacklisted mobile phone number obtained after the corresponding identification process.
S204, determining the ratio of the th ratio to the second ratio.
The th to second ratio can indicate the probability of the information containing the character sequence being normal information or abnormal information, specifically, if the ratio of the th to second ratios is much greater than 1, that is, the th to second ratios is much greater than the second ratio, the ratio of the information containing the character sequence in the recognized normal information is much greater than the ratio of the information containing the character sequence in the recognized abnormal information, so that the probability of the information containing the character sequence being normal information can be determined to be high.
S205, determining th component risk values corresponding to the character set according to the ratio.
It should be noted that, in an actual application scenario, since the number of the pre-stored identified information is huge, the ratio of the th proportion to the second proportion may be large, and the computation amount of the subsequent processing is increased, in order to simplify the computation, in this embodiment of the present application, a logarithm computation may be adopted to simplify the ratio, that is, for the step S205, the th component risk value corresponding to the character set is determined according to the ratio, specifically, a logarithm value of the ratio is determined, and the th component risk value corresponding to the character set is determined according to the logarithm value, if the logarithm value of the ratio is directly taken as the th component risk value, since the logarithm value may be smaller than zero (in the logarithm, if the number is smaller than 1, the logarithm result is smaller than zero), when the comprehensive risk value of the information to be identified is determined according to the th component risk value, a determined error may be brought to the comprehensive risk value.
Therefore, more specifically, the step of determining the th component risk value corresponding to the character set according to the logarithm value is to use the sum of the logarithm value and a preset adjusting constant as the th component risk value corresponding to the character set, so can offset the error caused by the logarithm value when the logarithm value is less than zero through the preset adjusting constant.
Therefore, the sum of the logarithm of the ratio of the th ratio to the second ratio of all the character sets and the preset adjusting constant is a numerical value larger than zero, and the condition of being smaller than zero cannot occur.
In the scenarios provided in the embodiment of the present application, if the information to be identified is a mobile phone number to be identified, the character set is a number set formed by a plurality of digits contained in the mobile phone number to be identified, in such a case, when the first three digits contained in the mobile phone number to be identified are divided into a character set, for a character set, the digits in the character set are arranged according to the sequence of the digits in the mobile phone number to be identified, so as to obtain a digit sequence corresponding to the character set, at this time, in combination with the method , the method may be implemented by using a formula
To determine the component risk value to which the th set of characters corresponds.
Wherein S is1A component risk value corresponding to the th character set.
p1The ratio of the mobile phone number containing th digit sequence is stored in advance in each identified normal mobile phone number.
p2The percentage of the cell phone numbers containing the th digit sequence among the previously stored identified abnormal cell phone numbers is shown.
C is a preset constant value.
The method is described in detail below using an application example of :
assuming that the mobile phone number is still 13812348888 and the mobile phone number is bound to the account a, then, after the server receives the registration of the account a, the server identifies the mobile phone number 13812348888 bound to the account a. the server classifies the first three digits of the mobile phone number into a character set, and arranges the digits in the character set according to the sequence of the first three digits in the mobile phone number to obtain a character sequence of "138".
Assuming that the number of mobile phone numbers previously stored in the server and identified as normal is 10000 (in practical application, the number of accounts stored in the server is huge, and only 10000 is taken as an example for convenience of description), 2000 mobile phone numbers containing th character sequence "138" are included in the 10000 normal mobile phone numbers, so that the mobile phone number containing th character sequence "138" can be determined, and the th proportion p in the normal mobile phone numbers is p1I.e. p1=2000/10000=0.5。
Assuming that the number of mobile phone numbers which are previously stored in the server and are already identified as abnormal is 100, among the 100 abnormal mobile phone numbers, the mobile phone number containing the th character sequence "138" is 2 in total, and therefore, the mobile phone number containing the th character sequence "138" can be determined, and the second percentage p in the previously stored abnormal mobile phone numbers is p2I.e. p2=2/100=0.02。
Obtaining the th ratio p1And a second ratio p2Thereafter, the th ratio p can be determined1To the second ratio p2Ratio of (i.e. p)1/p20.5/0.02-25. If the ratio is much greater than 1, it indicates that the mobile phone number containing the character sequence "138" is a normal mobile phone number with a high possibility.
Meanwhile, assuming that the tuning constant value C is 8, the component risk value of the th character set is the same as the above formula
As can be seen from the above example, the component risk value of the character set is determined by adopting the th and second ratios, so that the possibility that the information to be identified is normal information or abnormal information can be quantified more accurately, wherein the higher the th component risk value is, the higher the possibility that the information to be identified is normal information is, the higher the possibility that the information to be identified is, and the lower the possibility that the information to be identified is, the higher the possibility that the information is abnormal information is, and the lower the possibility that the information is stolen is.
The second method comprises the following steps:
as shown in fig. 3, the process of determining the component risk value corresponding to each character set in the second method specifically includes:
s301, arranging the characters in the character set according to the sequence of the characters in the information to be recognized to obtain a character sequence corresponding to the character set.
Similar to the method , when the characters at the designated positions in the information to be recognized are classified into character sets, the characters are not classified into the corresponding character sets according to the sequence of the characters, and therefore, the characters classified into the character sets are arranged to obtain the character sequences corresponding to the character sets.
In step S302, the account information corresponding to the recognized information including the character sequence is specified among the previously stored recognized information.
Since each account information is bound to the corresponding information in the embodiment of the present application, for any identified information, the account information bound to the identified information can be uniquely determined .
In the second method, the following account information is each account information corresponding to the recognized information including the character sequence.
S303, determining the service level of each account information.
In a practical application scenario, a user can use his own account information to obtain various business services, the more business services the user obtains from certain account information, the more the business services the user often uses the account information, and the higher the possibility that the account information is normal account information.
For example, if the level of the business service associated with the bank card is set to 5 in advance, if a certain account information binds the corresponding bank card and opens the business associated with the bank card, the business level corresponding to the account information is 5.
Of course, if a plurality of business services are used in account information, the business grade of the account information is the sum of the business grades of the business services, for example, if two business services are opened in account information, and the business grades of the two business services are respectively 3 and 4, then the business grade of the account information is 7.
In practical applications, the determination of the business level of the account information is not limited to the above-mentioned manner, and the business level of the account information may be determined according to the activity of the account information, the frequency of the business service used by the account information, and the like, which does not limit the present application.
And S304, counting the number of the account information with different service levels according to the service level of each account information.
Generally, the types of business services are limited, and there are many cases where the same business service is used for each account information, that is, the business levels of the account information are the same. In the embodiment of the application, the number of account information with the same service level needs to be determined, so after the service level of each account information is determined, the number of account information corresponding to each service level is counted.
S305, in each account information, the ratio of account information of different business grades is determined.
Under the condition that the number of account information corresponding to each service level is known, the account information corresponding to each service level can be respectively determined, and the account information accounts for all the account information corresponding to the identified information containing the character sequence, so that the degree of using the service by the account information can be visually reflected.
S306, determining a second component risk value corresponding to the character set according to the service level of each account information and the proportion of the account information with different service levels.
After the service level of each account information and the proportion of the account information with different service levels are determined, the service level distribution of all the identified information containing the character sequence can be indicated.
In the scenarios provided in this embodiment of the present application, if the information to be identified is a mobile phone number to be identified, the character set is a number set formed by a plurality of numbers included in the mobile phone number to be identified, in such a case, when the first seven numbers included in the mobile phone number to be identified are divided into a second character set, for the second character set, the numbers in the second character set are arranged according to the sequence of the numbers in the mobile phone number to be identified, so as to obtain a second number sequence corresponding to the second character set
S2=Σ(w(i)*Prob(i))
And determining a second component risk value corresponding to the second character set.
Wherein S is2And the second component risk value corresponds to the second character set.
w (i) represents: the ith service class in each service class determined is w (i).
Prob (i) is: and the account information of the ith service level is used for determining the ratio of each account information.
It should be noted that, in the embodiment of the present application, the first seven digits included in the ten -digit mobile phone number are divided into the second character set, because the mobile phone numbers having the same call priority under a certain attribute type (e.g., the same as carrier) or the mobile phone numbers having the same home location under a certain attribute type can be determined by the first three digits and the fourth to seventh four digits of the mobile phone number, that is, the mobile phone numbers having the same feature can be determined by the first seven digits.
The second method is specifically described by using application examples as follows:
assuming that the cell phone number is still 13812348888, the cell phone number is bound to account a, and then when the server receives the registration of account a, the server identifies the cell phone number 13812348888 bound to account a. The server divides the first seven digits of the mobile phone number into a second character set, and arranges the digits in the character set according to the sequence of the first seven digits in the mobile phone number to obtain a second character sequence '1381234'.
The server will determine all cell phone numbers containing the second character sequence "1381234" among the pre-stored recognized cell phone numbers. It is assumed that the number of mobile phone numbers containing the second character sequence "1381234" is 1000 in total. Then, the server will determine the account information bound to the 1000 mobile phone numbers respectively, and correspondingly, the server will determine 1000 account information.
Then, the server determines the service levels of the 1000 pieces of account information according to a preset service level standard. The server may determine the service level according to the service used by the account information, and of course, the server may determine the service level of the account information in various manners such as a preset level standard of each service, and during actual application, the server may adjust and set according to the actual application requirements, which does not limit the present application.
Assume that two kinds of service levels are present in the 1000 pieces of account information, and a service level of 900 pieces of account information is the 1 st service level w (1) and w (1) is 5, and a service level of 100 pieces of account information is the 2 nd service level w (2) and w (2) is 4. Then, the account information having the business rank of 5 is the account information having the business rank of 900/1000-0.9 in the proportion Prob (1) among the 1000 pieces of account information, and the account information having the business rank of 4 is the account information having the business rank of 100/1000-0.1 in the proportion Prob (2) among the 1000 pieces of account information.
Thus, the server can determine that the second character sequence is contained according to the formulaSecond component risk value S corresponding to second character set of "138123420.9 × 5+0.1 × 4 × 4.9. The second component risk value is close to the service level w (1), that is, the service level of the account information corresponding to the mobile phone number containing the second character sequence "1381234" is maintained at the level of w (1).
As can be seen from the above example, the account information corresponding to the identified information containing the character sequence is determined, the service level of the account information is determined, the degree of the service used by the account information can be reflected, and meanwhile, the service level of the account information corresponding to the identified information containing the character sequence can be integrally quantized by combining the counted number of account information corresponding to different service levels. The larger the second component risk value is, the higher the possibility that the information to be identified is normal information is and the higher the possibility that the information is at risk of theft is, and conversely, the higher the possibility that the information is abnormal information is and the lower the possibility that the information is at risk of theft is.
The third method comprises the following steps:
as shown in fig. 4, the process of determining the component risk value corresponding to each character set in the third method specifically includes:
s401, arranging the characters in the character set according to the sequence of the characters in the information to be recognized to obtain a character sequence corresponding to the character set.
Similar to the method and the second method, the characters in the character set are arranged after the corresponding characters are divided into the character set.
S402, identifying characteristic characters in the character sequence.
In the embodiment of the application, the characteristic characters comprise repeated characters and/or sequential characters, wherein the repeated characters are at least two continuous identical characters, such as aaa, bb, cccc and the like, and the sequential characters are at least three continuous characters arranged according to the character sequence , such as abcd, 789, 321, 1234 and the like.
In addition, for the recognition of the characteristic character, a character recognition algorithm in the prior art can be adopted, and the method does not constitute a limitation to the application.
S403, when the characteristic character is recognized, determining the weight value and the characteristic value of the characteristic character.
In the character sequence, a great number of permutation and combination modes exist for different characters, the permutation and combination of the multi-number characters are random and unordered, and the feature characters are permutated and combined only in a few cases, namely, the feature characters have fixed probability.
Therefore, in the embodiment of the present application, the weight value of the feature character is quantized according to the probability of occurrence of the feature character, and the feature value of the feature character is quantized according to the number of characters included in the feature character. That is, the step S403 specifically includes: determining the probability of the characteristic character appearing in the character sequence; determining the weight value of the characteristic character according to the probability; performing word segmentation on the characteristic characters to obtain character units; and determining the characteristic value of the characteristic character according to the obtained number of the character units.
In this embodiment, when the N-gram language model is used to segment the characteristic character, the characteristic character is divided into the smallest character units (N ═ 1 in this case), and the number of characters in the character units is sequentially increased until the characteristic character is divided into character units (N ═ the number of characters included in the characteristic character in this case).
For example: aiming at the characteristic character 8888, an N-gram language model is adopted for word segmentation, under the 1-gram word segmentation method, the characteristic character is divided into 4 character units 8, 8 and 8, under the 2-gram word segmentation method, the characteristic character is divided into 3 character units 88, 88 and 88, under the 3-gram word segmentation method, the characteristic character is divided into 2 character units 888 and 888, and under the 4-gram word segmentation method, the characteristic character is divided into 1 character unit 8888.
S404, determining a third component risk value corresponding to the character set according to the weight value and the characteristic value of the characteristic character.
For the third method, in scenarios provided in this embodiment of the application, when the information to be identified is a mobile phone number to be identified, and when the last eight digits included in the mobile phone number to be identified are divided into a third character set, for the third character set, the digits in the third character set are arranged according to the sequence of the digits in the mobile phone number to be identified, so as to obtain a third digit sequence corresponding to the third character set.
When repeated characters are identified, word segmentation is carried out on the repeated characters to obtain different digital units, and at the moment, the different digital units can be obtained through formulas
Determining a feature value of the repeated words.
Wherein S isc(n) is a characteristic value of the repeated number, and the argument n represents the number of digits contained in the repeated number.
tfjThe number of character units is obtained after the repeated characters are segmented.
j represents a j-th word segmentation method, and the number of characters contained in each digital unit obtained by adopting the j-th word segmentation method is j. Of course, j is the value of N when the N-gram language model is used for word segmentation.
Specific examples thereof include: in the above example, on the basis of the N-gram language model for the characteristic character "8888" for division, the above formula is used to determine the characteristic value of the repeated character "8888" as:
Sc(n)=1*(4-1)+2*(3-1)+3*(2-1)+4*(1-1)=10。
wherein, for 2 × 3-1, the characteristic character "8888" is divided into 3 character units 88, 88 and 88 based on a 2-gram word segmentation method, the number "2" is the number of characters contained in a character unit, and the number "3" is the number of character units. By analogy, the values in the formula can be obtained.
In practical application scenarios, at least three digits of sequential numbers are usually included, that is, when the sequential numbers are segmented, at least three digits of sequential numbers are segmented, and when the repeated digits are segmented, at least two digits of repeated numbers are segmented, it can be seen that the number of characters included in the sequential numbers is bits less than the number of characters included in the repeated digits when determining the characteristic values.
Thus, when a sequential number is identified, the number of characters contained in the sequential number is determined, which may be by a formula
Ss(n′)=Sc(n′-1)
Determining a characteristic value of the sequential number.
Wherein S issIs a sequential numerical characteristic value.
The argument n' is the number of characters included in the sequential number.
Specific examples thereof include: in determining a feature value for the five-digit sequential number "12345", the feature value is associated with a repeating number, such as: "8888" is the same, and using the above formula, the eigenvalue of the ordinal number "12345" is determined to be:
Ss(5)=Sc(4)=1*(4-1)+2*(3-1)+3*(2-1)+4*(1-1)=10。
after determining the characteristic values of the repeated digits and/or the sequential digits, a formula can be used
S3=w(Sc+Ss+1)
And determining a third component risk value corresponding to the third character set.
Wherein S is3And the risk value of the third component corresponding to the third character set.
w is the inverse of the probability value that the identified repeated and sequential digits appear in the third digit sequence.
If only repeated digits or only sequential digits appear in the third digit sequence, the probability value of the repeated digits (or sequential digits) appearing in the third digit sequence is determined, and the reciprocal of the probability value is used as the weight value w of the feature character. If the repeated number and the sequential number appear in the third number sequence at the same time, determining the probability value of the repeated number and the sequential number appearing in the third number sequence at the same time, and taking the reciprocal of the probability value as the weight value of the characteristic character when the repeated number and the sequential number appear at the same time.
The second method is specifically described by using application examples as follows:
assuming that the cell phone number is still 13812348888, the cell phone number is bound to account a, and then when the server receives the registration of account a, the server identifies the cell phone number 13812348888 bound to account a. The server divides the last eight digits of the mobile phone number into a third character set, and arranges the digits in the character set according to the sequence of the last eight digits in the mobile phone number to obtain a third character sequence '12348888'.
Obviously, the third character sequence "12348888" has characteristic characters, i.e., contains both the sequential number "1234" and the repetitive number "8888". In order to determine the weight value w of the feature character, the probability value of the simultaneous occurrence of the sequential number and the repeated number in the same eight-bit number as the third character sequence is determined.
Specifically, 10 possible values of numbers 0 to 9 exist at each positions of the third character sequence, so that the total number of permutation and combination modes of the numbers at eight positions of the third character sequence is 108. In these permutations, the simultaneous occurrence of sequential numbers "1234" and repeat numbers "8888" is only two cases: "12348888" and "88881234" such that, in the third character sequence, the probability value of the simultaneous occurrence of the sequential number and the repeated number is 2/108. Then, according to the above formula, it can be determined that w is 108/2. Obviously, the value of w is large and inconvenient for subsequent calculation, so in practical application, the value of w may be simplified by squaring and taking a logarithm, and it is assumed that in this application example, the value of w is squared 7 times, so that the simplified value of w ≈ 22.4.
Then, the server determines the feature values of the repeated number "8888" and the sequential number "1234", respectively, and for the repeated number "8888", the feature value S thereofc(4) For the ordinal number "1234", 10, its characteristic value Ss(4)=Sc(3)=4。
Thus, according to the above formula, the third component risk value S of the third character sequence3=22.4*(10+4+1)=336。
As can be seen from the above example, when the third component risk value of the third character set is determined in the third method, if the number of bits of the feature character included in the third character set is larger, the weight value and the feature value of the feature character are also larger, which indicates that, in such a case, the information to be recognized has a higher value. The larger the third component risk value is, the higher the possibility that the information to be identified is normal information is and the higher the possibility that the information is at risk of theft is, and conversely, the higher the possibility that the information is abnormal information is and the lower the possibility that the information is at risk of theft is.
To this end, the three methods respectively determine three component risk values of the information to be identified, so that an overall comprehensive risk value of the information to be identified can be determined according to the component risk values, in the embodiment of the present application, the determining a comprehensive risk value of the information to be identified specifically includes: and carrying out geometric average on the component risk values corresponding to the character sets to obtain a comprehensive risk value of the information to be identified.
For example, the comprehensive risk value of the cell phone number "13812348888" is obtained by following the examples of the methods -III
The larger the comprehensive risk value of the information to be identified is, the higher the value degree of the information to be identified is, the larger the risk of the information to be identified being stolen is, so that in practical application, when the determined comprehensive risk value of the information to be identified is larger than a certain preset risk value, the monitoring level of the information to be identified and the account information bound with the information to be identified can be controlled, and the condition that the information to be identified is stolen is avoided.
In addition, after the comprehensive risk value of the information to be identified bound with account information is determined by using the method, new information to be identified is bound to the account information at a certain time , but the comprehensive risk value of the new information to be identified is far lower than that of the original information to be identified, so that the account information is likely to be stolen, and the monitoring level of the account information can be improved.
Of course, the information to be identified is only described as an example of a mobile phone number, and the information processing method based on risk identification provided in the embodiment of the present application may also be used to identify risks of other information to be identified and perform processing based on the risks, for example, the information to be identified may also be an email address, a certificate number, and the like.
Based on the same idea, the information processing method based on risk identification provided in the embodiment of the present application further provides information processing apparatuses based on risk identification, as shown in fig. 5.
The information processing apparatus based on risk identification in fig. 5 includes: a character segmentation module 501, a component risk value module 502, a composite risk value module 503, and a processing module 504, wherein,
a character dividing module 501, configured to divide characters included in the information to be recognized into different character sets.
The component risk value module 502 is configured to determine component risk values corresponding to the character sets respectively.
And the comprehensive risk value module 503 is configured to determine a comprehensive risk value of the information to be identified according to the component risk value corresponding to each character set.
And the processing module 504 is configured to process the information to be identified according to the comprehensive risk value.
The character division module 501 is specifically configured to divide the characters at the specified positions in the information to be recognized into character sets, where a union set of each character set includes all the characters in the information to be recognized, and at least two character sets have an intersection.
In the embodiment of the present application, since the characters in different character sets have different meanings, different manners will be adopted when determining the component risk values corresponding to different character sets. Specifically, the method comprises the following steps:
as shown in fig. 6, when determining the th component risk value, the component risk value module specifically includes:
the character arrangement submodule 601 is configured to arrange the characters in the character set according to the sequence of the characters in the information to be recognized, so as to obtain a character sequence corresponding to the character set.
The th sub-module 602 is used to determine the ratio of the information having the same character sequence among the pre-stored normal information, as the th ratio.
The second proportion sub-module 603 is configured to determine, as a second proportion, a proportion of information having the same character sequence among the pieces of recognized abnormal information stored in advance.
A ratio sub-module 604 for determining a ratio of the th ratio to the second ratio.
And the th component risk value sub-module 605 is used for determining the th component risk value corresponding to the character set according to the ratio.
When the th component risk value is too large, in order to simplify subsequent operations, the th component risk value sub-module 605 is specifically configured to determine a logarithm value of the ratio, and determine a th component risk value corresponding to the character set according to the logarithm value.
In another manners of the embodiment of the present application, the component risk value sub-module 605 is specifically configured to use a sum of the logarithm value and a preset adjustment constant as the component risk value corresponding to the character set.
As shown in fig. 7, when determining the second component risk value, the component risk value module specifically includes:
the character arrangement submodule 701 is configured to arrange the characters in the character set according to the sequence of the characters in the information to be recognized, so as to obtain a character sequence corresponding to the character set.
The account information sub-module 702 is configured to determine, among the pieces of recognized information stored in advance, pieces of account information corresponding to the pieces of recognized information including the character sequence.
The service level sub-module 703 is configured to determine a service level of each account information, and count the number of account information of different service levels according to the service level of each account information.
And the proportion submodule 704 is used for respectively determining the proportion of the account information with different service levels in each account information.
And the second component risk value sub-module 705 is configured to determine a second component risk value corresponding to the character set according to the service level of each account information and the proportion of account information of different service levels.
As shown in fig. 8, when determining the third component risk value, the component risk value module specifically includes:
and the character arrangement submodule 801 is configured to arrange the characters in the character set according to the sequence of the characters in the information to be identified, so as to obtain a character sequence corresponding to the character set.
A recognition sub-module 802 for recognizing the characteristic characters in the character sequence.
The feature character sub-module 803 is configured to determine a weight value and a feature value of the feature character when the feature character is recognized.
And the third component risk value sub-module 804 is configured to determine a third component risk value corresponding to the character set according to the weight value and the feature value of the feature character.
Wherein the characteristic characters comprise repeated characters and/or sequential characters.
The feature character sub-module 803 is specifically configured to: determining the probability of the characteristic character appearing in the character sequence, determining the weight value of the characteristic character according to the probability, performing word segmentation on the characteristic character to obtain character units, and determining the characteristic value of the characteristic character according to the number of the obtained character units.
In the scenes of the embodiment of the application, the information to be recognized is specifically a mobile phone number to be recognized, the character set is specifically a number set formed by a plurality of numbers contained in the mobile phone number to be recognized, and the character dividing module 501 is specifically configured to divide the first three digits contained in the mobile phone number to be recognized into a character set, divide the first seven digits contained in the mobile phone number to be recognized into a second character set, and divide the last eight digits contained in the mobile phone number to be recognized into a third character set.
In this scenario, when determining the th component risk value, the component risk value module is specifically configured to, for the th character set, arrange the digits in the th character set according to the sequence of the digits in the mobile phone number to be recognized, so as to obtain a th digit sequence corresponding to the th character set;
using a formula
Determining a component risk value corresponding to the th character set;
wherein S is1A component risk value corresponding to the th character set;
p1the ratio of the mobile phone number containing th digit sequence in the pre-stored normal mobile phone numbers;
p2the ratio of the mobile phone number containing th digit sequence in each pre-stored identified abnormal mobile phone number;
c is a preset constant value.
When determining the second component risk value, the component risk value module is specifically configured to: aiming at a second character set, arranging the digits in the second character set according to the sequence of the digits in the mobile phone number to be recognized to obtain a second digit sequence corresponding to the second character set;
determining account information corresponding to the identified mobile phone number containing the second digit sequence in each pre-stored identified information;
determining the service level of each account information;
using the formula S2Determining a second component risk value corresponding to the second set of characters;
wherein S is2A second component risk value corresponding to the second character set;
w (i) represents: determining the ith service grade in each service grade as w (i);
prob (i) is: and the account information of the ith service level is used for determining the ratio of each account information.
When determining the third component risk value, the component risk value module is specifically configured to: aiming at a third character set, arranging the digits in the third character set according to the sequence of the digits in the mobile phone number to be recognized to obtain a third digit sequence corresponding to the third character set;
identifying repeating and/or sequential digits in the third digit sequence;
when repeated characters are identified, performing word segmentation on the repeated characters to obtain different digital units, and adopting a formula
Determining a feature value of the repeated words;
wherein S iscIs the characteristic value of the repeated number;
tfjthe number of character units is obtained after the repeated characters are segmented;
j represents a j-th word segmentation method, and the number of the characters contained in each digital unit obtained by adopting the j-th word segmentation method is j;
when a sequential number is identified, the number of characters contained in the sequential number is determined, using equation Ss(n′)=Sc(n' -1) determining a characteristic value of the sequential number;
wherein S issA characteristic value that is a sequential number;
n' is the number of characters included in the sequential number;
using the formula S3=w(Sc+Ss+1) determining a third component risk value corresponding to the third character set;
wherein S is3A third component risk value corresponding to the third character set;
w is the inverse of the probability value that the identified repeated and sequential digits appear in the third digit sequence.
After the th to third component risk values are determined, the comprehensive risk value module is specifically configured to perform geometric averaging on the component risk values corresponding to the character sets to obtain a comprehensive risk value of the information to be identified.
In typical configurations, a computing device includes or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises an series of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Moreover, the present application may take the form of a computer program product embodied on or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.