CN113886784A

CN113886784A - Password guessing method for improving guessing efficiency of small training set based on corpus

Info

Publication number: CN113886784A
Application number: CN202111478071.9A
Authority: CN
Inventors: 甘晓春; 陈猛; 陈虎; 李东
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-01-04
Anticipated expiration: 2041-12-06
Also published as: CN113886784B

Abstract

The invention discloses a password guessing method for improving guessing efficiency of a small training set based on a corpus, and relates to the technical field of data processing and prediction. The method comprises the following steps: constructing a corpus gamma; based on corpus Γ, training results are generated for training password set PWD _ TRAIN: the probability q (R) of each rule R in the password guessing rule set R, and the probability p (w) of each vocabulary w in Γ; generating a dictionary D (S) with guess times S according to the training result and the corpus gamma; detecting D (S) the rate of cracking the TEST password set PWD _ TEST. The invention can effectively improve the cracking rate of the test password set when the training set is smaller by expanding the vocabulary in the training set PWD _ TRAIN through the corpus gamma.

Description

Password guessing method for improving guessing efficiency of small training set based on corpus

Technical Field

The invention relates to the technical field of data processing and prediction, in particular to a password guessing method for improving guessing efficiency of a small training set based on a corpus.

Background

The basic method of password guessing is to try the password that the user may use until the correct password is found or a predetermined number of guesses is reached and the guess is discarded. Therefore, to improve the efficiency of guessing, it is necessary to guess the password with a higher possibility of use by the user with priority. The existing password guessing method mainly comprises the following steps: violence, roller compaction, Markov models, Probabilistic Context Free Grammar (PCFG), etc.

Brute force is the most traditional password guessing method, and the main defect is that the length of the password which can be guessed is short. Because of the total number of guesses limitation, the length of brute force guesses for full keyboard characters tends to not exceed 9 characters, and the length of brute force guesses containing only lowercase letters and numbers tends to not exceed 11 characters.

The dictionary transformation method (Emin Islam Tath, "Cracking more passwords hashes with patterns", IEEE Trans. on Information forms and Security, vol.10, No.8, pp.1656-1665, 2015.) refers to transforming a source password into a password to be guessed according to a password transformation rule (e.g., rockyou-30000 rule base in olchashcat). This password guessing method is very common in practice, but its validity depends on the source password set, and a valid guess cannot be done for a password that does not appear in the source password set.

The Markov model method (Jerry Ma, Weining Yang, Min Luo, Ninghui Li, "A study of probabilic passswerds," in Proc. IEEE Symposium on Security and Privacy, pp.689-704, 2014 Markus Durmuth, Fabian Angelischerltorf, Claude Casteluci, Daniele Perito, Abdelberi Chamabane, "OMEN: Faster passing using an ordered Markov organ", in Proc. the 7th Symposium on ESSoS, pp.119-132, 2015 password) is to establish a transition probability matrix between letters in a training password set and predict the probability of a certain letter accordingly. The method has the greatest characteristics that the method does not depend on a corpus set, can independently find common words in the password, and can effectively process common deformation forms in the words. But has the disadvantage of requiring a high-order Markov process to "remember" longer lexical content and the semantics are not well defined.

The PCFG method (Matt Weir, Sudhir Afflawa, Breno de Medeeros, Bill Glodek, "passing trading using basic textual context-free grams", in Proc. 30th IEEE Symposium on Security and Privacy, 2009, pp.391-405.) is divided into a training phase and a guessing phase. In the training phase, the passwords of the training set are segmented according to character types, and the probability of each structure and the probability of each vocabulary are generated in a statistical mode. For example, the password "spring 2021!" corresponds to a structure [ A6] [ D4] [ S1], in which A6 represents a character string of 6 letters, D4 represents 4 numerals, and S1 represents 1 special symbol. If 10 out of 1000 training passwords have the structure of [ A6] [ D4] [ S1], the probability of this structure is 0.01. The PCFG method assumes that the probability of a character string becoming a password = the structure probability of the character string × the probability of each word in the character string. In the guessing phase, a NEXT algorithm is used to generate a sequence of password strings to be guessed from high to low in probability. In the modified PCFG method (Shiva Houshmand, Sudhir Aggarwal, Randy Flood, "Next Gen PCFG passive cracking", IEEE trans. on Information dynamics and Security, vol.10, No.8, pp.1776-1791, 2015.), further keyboard string sets are added, and Laplace smoothing is performed on the vocabulary frequency of the corpus. The method makes up the limitation of word segmentation according to character types in the original PCFG method to a certain extent, and can further enrich the content of the corpus. Although the PCFG method produces dictionaries at a slower rate, the guessing efficiency of the method can be effectively estimated using the Monte Carlo sampling method (Dell' Amico, M. & Filippone, M., Monte Carlo Strength Evaluation: Fast and Reliable Passsword Checking, Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, ACM, 2015, 158-.

The overseas scholars (Ji, S.; ang, S.; Hu, X.; Han, W.; Li, Z. & Beyah, R.; Zero-Sum passage Cracking Game: A Large-Scale-dimensional Empirical Study ON the Cracking activity, correction, AND curing of passages, IEEE transaction ON DEPENDABLE AND SECURE COMPUTING, 2017, 14, 550-564. Ur, B.; Segreti, S. M.; Bauer, L.; Christin, N.; Cranor, L. F.; Komanuri, S.; Kurilova, D.; Mazurek †, M. L.; Meelic, W.; R., Shachuring, R., reading-testing, L., Komani S.; Kurilour, D.; M. L., Systemma, Q.M., Q., simulation, AND Q.M., the best evaluation method of PCfg, such as the PCyield, P. Q., but also to different language types. Therefore, the PCFG method has gradually become the mainstream method of password guessing academic research. Furthermore, the PCFG method can also be used for directed attacks (Wang, D.; Zhang, Z.; Wang, P.; Yan, J. & Huang, X., Targeted one passed approval, Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, ACM, 2016.), i.e., to generate a set of guessed passwords from a combination of personal information of a user.

The PCFG method integrates password information of two levels of structure and corpus and has higher efficiency. It still has significant limitations. Mainly expressed in the following aspects: 1) the description of the password structure only adopts character types as marks for distinguishing the password vocabularies, and the password formed by a plurality of vocabularies is difficult to distinguish, for example, the password "ilovemike" is all lower case letters, and is used as a vocabulary in the PCFG, so that the inherent structural rule of the password is difficult to embody. 2) Except for keyboard strings, the vocabulary generated by the PCFG is derived only from the training set. 3) The resulting structure is also the structure that appears in the training set. This directly results in the PCFG approach being highly dependent on the training set, and the guessing dictionary generated by the PCFG cannot contain the vocabulary or structural patterns of passwords that do not appear in the training set.

In summary, methods such as the PCFG method have high guessing efficiency and adaptability to multiple languages. However, the existing password guessing method research mainly centers on the development of large-scale real password sets, and the research on a small training set learning method is relatively deficient. The main difficulty is that the number of passwords of a small training set is limited, and the existing training method lacks necessary vocabulary generalization and structural generalization capability, so that the learnable vocabulary and guessing rules are very limited.

Disclosure of Invention

The method aims to solve the problem that the existing password guessing method has poor effect when the training set is small in scale. The present invention has the following improvements over the traditional PCFG password guessing method: 1) the traditional PCFG method adopts a character type mode for word segmentation of a training password, and is difficult to cut a plurality of words of the same type of characters in the password. The invention can cut out the vocabulary of the same character type in the password by using the word segmentation method based on the corpus. 2) The learning process of the PCFG method can only find the words appearing in the training set and their corresponding probabilities, and the resulting dictionary can only contain the words appearing in the training set. When the training set is small, the vocabulary of passwords in the generated dictionary is limited, resulting in inefficient guessing. The invention can expand the same type of vocabulary which does not appear in the training set based on the existing large-scale natural language corpus, and uses a smoothing method to calculate the probability of all the vocabulary in the corpus, and the generated dictionary can contain the vocabulary which does not appear in the training set. Therefore, the dependence on the training set can be effectively reduced, and the same type of vocabulary is expanded in the dictionary. 3) When the cracking rate is estimated, the probability corresponding to the appointed guessing times is estimated, then the maximum probability of each password in the test set is calculated and compared with the probability, and the person who is more than the probability is always in the dictionary generated by the method, so that the cracking rate detection efficiency can be effectively improved.

The purpose of the invention is realized by at least one of the following technical solutions.

A password guessing method for improving guessing efficiency of a small training set based on a corpus comprises the following steps:

s1, constructing corpus comprising four types of corpus setsΓDetermining the structure of the password guessing rule;

s2 corpus-basedΓTraining set against passwordsPWD_TRAINTraining password inpwdGenerating guess rules for the passwordrObtaining a password guess rule set composed of a plurality of password guess rulesR；

S3 corpus-basedΓAnd password guessing rule setRComputing a corpusΓEvery word inwProbability of (2), is recorded asp(w，PWD_TRAIN)，w∈Γ(ii) a Computing a set of password guessing rulesRGuessing rule of each password inrProbability of (2), is recorded asq(r, PWD_TRAIN)，r∈R；

S4, generating guess times asSDictionary (2)D(S) Using dictionariesD(S) To carry outPassword guessing.

Further, in step S1, a corpus having the following characteristics is constructedΓ：

Feature 1, corpusΓIncludesΓThe set of | corpora is set,Γ ={C _i|1≤i≤|Γl } in whichC _iIs as followsiA corpus collection;

the characteristic 2 is that each corpus set comprises vocabularies of the same type and the same length;

characteristics 3, vocabulary types of the corpus set comprise language, country and region, general and violent corpora; language type corpora include vocabulary, surnames and first names in different languages (e.g., English, Russian, etc.); the national and regional corpora comprise place names and telephone numbers; the universal language material comprises common keyboard character sequences, year and date formats;

4, the length of all vocabularies in a single corpus set of the non-violent corpus is the same and is more than or equal to 4;

5, the length of the violent corpus set is less than or equal to 3, and the violent corpus set is divided into lower case letters, capital letters, numbers and special symbols; the total violent corpus is 12: ASCII code lower case letters [ az _ 1] with length of 1-3], [az_2], [az_3](the number is 26, 26 respectively)²，26³) And the length of the capital letters [ AZ _ 1] of the ASCII code is 1-3], [AZ_2], [AZ_3](the number is 26, 26 respectively)²，26³) Number [09_ 1] of length 1-3], [09_2], [09_3](the number is 10, 10 respectively)²，10³) 1-3 ASCII code other printable characters SP _1], [SP_2], [SP_3](the number is 33, 33 respectively)²，33³）；

Feature 6, corpusΓAny two corpus sets do not comprise the same vocabulary;

first, theiCorpus collectionC _iThe number of Chinese words is defined asC _iLength is defined asl(C _i)；

A password guessing rulerIs formed by connecting a plurality of corpus sets and a password guessing rulerIs described asr=[C ₁]…[C _s]，C ₁,…,C _s∈Γ；sRepresenting password guessing rulesrNumber of stages of (2), isd(r)；

Called password guessing rulerThe corpus space size of (1)S(r)；

|RI bar mutually different password guessing rulerForming a set of password guessing rulesR。

Further, in step S2, the password training setPWD_TRAINComprising a plurality of training passwordspwdBased on a corpusΓGenerating specific training passwordspwdPassword guessing rulerThe method comprises the following steps:

based on corpusΓConstructing a single training passwordpwdDirected acyclic graph ofG=<V, E>Wherein, there is a directed acyclic graphGEach edge in (1) is a corpusΓThe corpus collection to which the character substring from the starting point to the end point of the edge belongs;

generating directed acyclic graphsGAll paths from the starting point to the end point in the training password, each path corresponding to a training passwordpwdEach word segmentation method corresponds to a guess rule;

selecting the guess rule with the smallest segment number from all possible guess rules as the corresponding training passwordpwdPassword guessing rulerIf there are several guess rules with the minimum segment number, the guess rule with the minimum corpus size space is selected as the corresponding password guess ruler；

Finally, the guessing rule of a plurality of passwords is obtainedrSet of composed password guessing rulesR。

Further, in step S3, the corpus is searchedΓAnd training password setPWD_TRAINComputing a corpusΓEvery word inwSet of probability and guessing rulesRGuessing rule of each password inrThe probability of (c). The method comprises the following specific steps:

password guessing rule setRGuessing rule of each password inrProbability of (2), is recorded asq(r, PWD_TRAIN)，r∈R；

Password guessing rule setRGuessing rule of each password inrThe corresponding probabilities have the following characteristics:

1) password guessing rule setRPer password guessing rule inrAre all based on a training password setPWD_TRAINEach training password inpwdStep S2 is executed to generate;

2) password guessing rule setRGuessing rule of each password inrThe sum of the frequencies of (a) equals 1;

3) password guessing rule setRGuessing rule of each password inrIs proportional to its probability of being in the training password setPWD_ TRAINFrequency of occurrence in;

corpusΓEvery word inwProbability of (2), is recorded asp(w，PWD_TRAIN)，w∈Γ；

CorpusΓThe probability of each word in has the following characteristics:

1) statistical corpusΓThe frequency of each vocabulary in the training set. Then, each corpus is collectedCThe frequency of all the words in the corpus is added with 1, so that the frequency of the words which do not appear in the corpus is not 0;

2) corpus collectionCThe probability of each vocabulary in the vocabulary set is equal to the frequency of the vocabulary obtained in the step 1) divided by the sum of the frequencies of all the vocabularies in the corpus set;

3) each corpus collectionCThe sum of the probabilities of the Chinese vocabulary is equal to 1;

4) if corpus collectionCThe probability of a particular vocabulary in (1) does not appear in the training password set, which is inversely proportional to the corpus setCSum of the number of words in (1);

5) if corpus collectionCThe probability of a particular word in (1) occurring in the training password set is proportional to the frequency of the word in the training password set and inversely proportional to the corpus setCSum of the number of Chinese vocabularies.

Further, in step S4, the rule set is guessed for the passwordROne of theRule password guessing ruler=[C ₁]…[C _s]Andseach wordw ₁,…,w _sSatisfy the requirement ofw ₁∈C _1, w ₂∈C ₂,…,w _s∈C _s，C ₁,…,C _s∈ΓLet, callw ₁|…|w _sIs a corpus-based databaseΓAnd password guessing rule setRThe legal vocabulary combination of (1), wherein '|' is a string splicing operation;

legal vocabulary combinationw ₁|…|w _sProbability of becoming password Prob (w ₁|…|w _s) Is defined as:

Prob(w ₁|…|w _s)=∏_{i s1≤≤} p(w _i, PWD_TRAIN)×q(r, PWD_TRAIN) ；

given number of guessesSIf, ifSBased on corpusΓAnd password guessing rule setROrdered sequence of legal vocabulary combinations ofD(S)=<cp ₁,cp ₂,…,cp _S>Satisfies the following conditions:

condition 1, Prob (cp _j)≥Prob(cp _j+1), 1≤j≤S-1；

Condition 2, comprisingSOrdered sequence of individual legal vocabulary combinationsD(S) Last valid vocabulary combination in (1)cp _SHas a greater probability than all other non-occurrences inD(S) The probability that the legal vocabulary in (1) is combined into a password;

then callD(S) Guessing the number of times asSThe ordered dictionary of (a) is,D(S) To middleSProbability Prob of a legitimate vocabulary combining into a password (cp _S) Is marked asα(S)。

Further, a character stringstrPossibly described as multiple legal wordsCombinations, each legitimate vocabulary combination having a different probability of becoming a password;

character stringstrProbability of becoming password Prob (str) The probability that all the legal vocabulary combinations corresponding to the character string become the maximum of the password probability is defined, and if one character string cannot be described as a legal vocabulary combination, the probability that the character string becomes the password is 0.

To includeSOrdered dictionary of individual legal vocabulary combinationsD(S) Having the following properties:

property 1 if legal vocabulary combinationcpProbability of becoming password Prob (cp) Is greater thanα(S) Then the legal vocabulary combinationcpMust belong toD(S)。

Property 2 if stringstrProbability of becoming password Prob (str) Is greater thanα(S) Then the character stringstrMust belong toD(S)。

Compared with the prior art, the invention has the advantages that:

1) the training password set is participled based on a natural language corpus and a password guessing rule based on the corpus is generated. The guessing rules generated are more reflective of the inherent meaning of the password setting than the PCFG approach.

2) The vocabulary in the dictionary is expanded and generated by adopting the natural language corpus, so that the problem that the vocabulary discovered by the PCFG method completely depends on the training set can be effectively solved, and the defect of poor guessing effect of the method under the condition of a small training set is overcome.

Drawings

FIG. 1 is a flowchart illustrating steps of a password guessing method for improving guessing efficiency of a small training set based on a corpus according to the present invention.

FIG. 2 is a simplified directed acyclic graph generated by the training password "lovelain" in an embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating the comparison of the cracking rate of the PCFG in password set rockyou in the embodiment of the present invention.

FIG. 4 is a diagram illustrating the comparison of the cracking rate of the PCFG in the password set CSDN in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Example 1:

a password guessing method for improving guessing efficiency of a small training set based on a corpus comprises the following steps as shown in FIG. 1:

constructing a corpus with the following characteristicsΓ：

Feature 6, corpusΓAny two corpus sets do not comprise the same vocabulary;

Called password guessing rulerThe corpus space size of (1)S(r)；

Password training setPWD_TRAINComprising a plurality of training passwordspwdBased on a corpusΓGenerating specific training passwordspwdPassword guessing rulerThe method comprises the following steps:

guess of selecting minimum number of segments from all possible guess rulesRules as corresponding training passwordspwdPassword guessing rulerIf there are several guess rules with the minimum segment number, the guess rule with the minimum corpus size space is selected as the corresponding password guess ruler；

In this embodiment, the rule generation of the training password is shown as algorithm-1;

algorithm-1: rule generation for training passwords

Inputting: (1)ntraining password formed by characterspwd=c _1… c _n

(2) CorpusΓ={C _i}

And (3) outputting:pwdcorresponding ruler

Intermediate variables: (1) acyclic graphGSet of vertices ofVAnd edge setE

(2) Temporary rule setR ₀AndR ₁

1. building a set of vertices of a graphV={v _1,…, v _n, v _n+1}

2. To pairc _1… c _nChinese character stringc _i… c _jCirculation, wherein 1 is less than or equal toi<j≤n

2.1 corpus aggregation if presentCThe requirements are met,c _i… c _jÎC，C∈Γthen, then

2.1.1 E=E∪{<(v _i, v _j+1), C>}

3. R ₀=Ø;

4. To the slavev ₁Tov _n+1All paths ofpathCirculation of

4.1 pathThe sequence of edges experienced is<< v ₁, v ₂’, C ₁>,…<v _k’, v _n+1, C _s>>

4.2 R ₀ =R ₀∪{[C ₁]… [C _s]}

5. d _min=min{d(r) |rÎR ₀}

6. R ₁={r|rÎR ₀∧d(r)= d _min}

7. Return toR ₁Rule for minimizing space size of Chinese corpusr

For corporaΓAnd training password setPWD_TRAINComputing a corpusΓEvery word inwSet of probability and guessing rulesRGuessing rule of each password inrThe probability of (c). The method comprises the following specific steps:

CorpusΓThe probability of each word in has the following characteristics:

In this embodiment, the calculation of the rule probability and the vocabulary probability is shown as algorithm 2;

algorithm-2 calculation of rule probabilities

Inputting: (1) training password setPWD_TRAIN

(2) CorpusΓ={C _i}

And (3) outputting: (1) password guessing rule setR；

(2)REach rulerProbability of (2)q(r,PWD_TRAIN), r∈R；

(3)ΓEach of the words inwProbability of (2)p(w，PWD_TRAIN),w∈Γ；

Intermediate variables: (1)Γfrequency of each word inf(w, PWD_TRAIN)，w∈Γ

1. R=∅

2. f(w, PWD_TRAIN)=0，w∈Γ

3. For allpwd∈PWD_TRAINCirculation of

3.1 calculation Using Algorithm-1pwd= c _1… c _nCorresponding ruler=[C ₁]…[C _s]

3.2 ifr∈RThen

3.2.1 q(r,PWD_TRAIN)=q(r,PWD_TRAIN)+1/|PWD_TRAIN|

3.3 otherwise

3.3.1 R=R∪r

3.3.3 q(r,PWD_TRAIN)=1/|PWD_TRAIN|

3.4 t=1

3.5 iFrom 1 tosCirculation of

3.5.1

3.5.2 f(w, PWD_TRAIN)= f(w, PWD_TRAIN)+1

3.5.3 t=t+l(C _i)

4. For allC _i∈ΓCirculation of

4.1

4.2 pairs ofC _iThe words and phrases in (1)wCirculation of

4.2.1 iff(w, PWD_TRAIN) Not equal to 0, then

4.2.1.1 p(w, PWD_TRAIN)=(f(w, PWD_TRAIN)+1)/fsum

4.2.2 else

4.2.2.1 p(w, PWD_TRAIN)=1/fsum

S4, generating guessThe number of measurements isSDictionary (2)D(S) Using dictionariesD(S) Carrying out password guessing;

guessing a set of rules for a passwordROne rule password guessing ruler=[C ₁]…[C _s]Andseach wordw ₁,…,w _sSatisfy the requirement ofw ₁∈C _1, w ₂∈C ₂,…,w _s∈C _s，C ₁,…,C _s∈ΓBalance ofw ₁|…|w _sIs a corpus-based databaseΓAnd password guessing rule setRThe legal vocabulary combination of (1), wherein '|' is a string splicing operation;

condition 1, Prob (cp _j)≥Prob(cp _j+1), 1≤j≤S-1；

then callD(S) Guessing the number of times asSThe ordered dictionary of (a) is,D(S) To middleSCombining individual legal words into passwordsProbability Prob (cp _S) Is marked asα(S)。

A character stringstrIt is possible to describe a plurality of legal vocabulary combinations, each of which has a different probability of becoming a password;

In this embodiment, the ordered dictionary may be generated using the next algorithm of the references (Matt Weir, Sudhir Afflawa, Breno de Mediaros, Bill Glodek, "creating using basic context-free dictionary", in Proc. 30th IEEE Symposium on Security and Privacy, 2009, pp.391-405.)D(S)

S5, estimating dictionary according to guess timesD(S) The probability of the last legal vocabulary combination;

in this embodiment, since the dictionaryD(S) Is slow and guesses timesSWhen the size is large, the storage capacity required by the dictionary is large, so that the dictionary is difficult to generate to evaluate the cracking rate; guessing rule set based on passwordR，{q(r，PWD_ TRAIN)|r∈R}, Γ, {p(w，PWD_TRAIN)|w∈ΓFor a given probability }βThe literature (Dell' Amico, M) was used.& Filippone, M., Monte Carlo Strength Evaluation: Fast and Reliable PasswThe Monte Carlo sampling method introduced in ord Checking, Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, ACM 2015, 158-169.) is calculated to have a password probability greater thanβIs estimated, this process is noted asN(β)；

Computingα(S) Is estimated value of

The method comprises the following steps:

first, a first probability value is initializedα ₀And a second probability valueα ₁Satisfy the following requirementsSIntermediate first probability values estimated using Monte Carlo sampling methodα ₀And a second probability valueα ₁Corresponding number of guessesN(α ₀) AndN(α ₁) To (c) to (d); then continuously adjusting the first probability valueα ₀And a second probability valueα ₁So thatN((α ₀+α ₁) /2) approach toS(ii) a When the oxygen deficiency is reachedN((α ₀+α ₁)/2)-S|<0.1SWhen takingα ₀+α ₁) A/2 isα(S) Is estimated value of

。

In the present embodiment, the dictionary is estimatedD(S) The probability of the last vocabulary combination to become a password is shown in algorithm 3;

algorithm-3 estimation dictionaryD(S) Is combined into the probability of a password

Inputting: (1) guessing rule setR；

(2) RProbability of each rule inq(r，PWD_TRAIN)|r∈R}；

(3) CorpusΓ；

(4) ΓThe probability of each word inp(w，PWD_TRAIN)|w∈Γ}

(5) Number of guessesS

And (3) outputting:α(S) Is estimated value of

1. Selectingα ₀Andα ₁satisfy the following requirementsN(α ₀)<S<N(α ₁)

2. When in use

Circulation of

2.1 if

Then, then

2.1.1 α ₀=

2.2 otherwise

2.2.1 α ₁=

3. Return to

S6, generating dictionary in non-actual conditionD(S) Estimate dictionary in case ofD(S)The cracking rate of the test password set;

dictionary based on estimation in step S5D(S) Probability of the last legal vocabulary combination of

Sequentially calculating the probability of each character string in the test password set becoming the password, if the probability of the character string becoming the password is greater than

Then it indicates that the character string belongs to the dictionaryD(S)；

All belongings in the training setD(S) The number of character strings divided by the total number of character strings in the training set is equal to the number of guessesSThe invention is used for testing the cracking rate of the password set.

In this embodiment, the test password set is detectedPWD_TESTThe cracking rate of (2) is shown as an algorithm-4;

algorithm-4 test password setPWD_TESTCracking rate of

Inputting: (1) guessing rule setR；

(2) RProbability of each rule inq(r，PWD_TRAIN)|r∈R}；

(3) CorpusΓ；

(4) ΓThe probability of each word inp(w，PWD_TRAIN)|w∈Γ}

(5) Number of guessesS

(6) Testing password setPWD_TEST。

And (3) outputting: based on training setPWD_TRAINAnd corpusΓThe number of guesses isSThe generated dictionaryD(S) For test password setPWD_TESTCracking rate ofγ(PWD_TRAIN, Γ,PWD_TEST, S)

1. g=0

2. Based onR, {q(r，PWD_TRAIN)|r∈R}, Γ,{p(w，PWD_TRAIN)|w∈ΓAndSusing Algorithm-3 calculation

3. To pairpwd∈PWD_TESTCirculation of

3.1 if Prob: (pwd)>

Then, then

3.1.1 g=g+1

4. Return tog/|PWD_TEST|

As shown in fig. 1The implementation of the invention needs to be composed of two parts of data and software, wherein the data comprises a corpusΓTraining password setPWD_TRAINTesting of password setsPWD_TEST. The software comprises two parts, namely training software, cracking rate detection software and the like. Wherein, the algorithm-1 and the algorithm-2 are completed in the training software, and the algorithm-3 and the algorithm-4 are completed by the cracking rate detection software.

Example 2-generating password guessing rules for the training password "lovelain";

corpusΓ _sampleBesides the violent character set, the English vocabulary set EN _4= { love, rain, blue } with the length of 4 characters, and the English vocabulary set EN _5= { love, green } with the length of 5 characters. The simplified directed acyclic graph generated by algorithm-1 is shown in fig. 2. In this figure, the following rules may be generated:

r _1-lovereain: [EN_4][EN_4]the number of segments is 2, and the corpus space size is 3 × 3= 9;

r _2-loverain: [EN_5][az_3]the number of segments is 2, and the spatial size of the corpus is 2 multiplied by 26³=35152；

r _3-loverain: [EN_4][az_2][az_2]The number of segments is 3, and the spatial size of the corpus is 3 multiplied by 26²×26²=1370928；

Of all password guessing rules that lovelain can produce,r _1-loverainwith the least number of segments and the smallest corpus space size in the guessing rule with the least number of segments. Thus, the training password obtains a password guessing rule ofr _1-loverain: [EN_4][EN_4]。

Example 3-set of training passwordsPWD_TRAIN _SAMPLEThe training result and a plurality of legal vocabulary combinations;

training password setPWD_TRAIN _SAMPLE={loverain, loveblue, greenblue, love3}。

After using algorithm-2, the training results obtained were:

1) password guessing rule setRContains 3 password guessing rules:r ₁ =[EN_4][EN_4], r ₂ = [EN_5][EN_4], r ₃ = [EN_4][09_1]；

2) probability of password guessing rule:q(r ₁)=0.5,q(r ₂)=0.25, q(r ₃)=0.25；

3) probability of vocabulary in EN _ 4:p(“love”)=4/9, p(“blue”)=3/9, p(“rain”)=2/9, p("green") = 2/3; probability of vocabulary in EN _%p("over") = 1/3; 09_1 except thatp("3") =2/27 and all other probabilities are 1/27;

the following gives two legal vocabulary combinations that can be generated based on the above training results and their probability of becoming a password (keeping 4 significant decimals).

(lover|love)= q(r ₂)×p(lover)×p(love)=0.0370；

Prob(blue|4)= q(r ₃)×p(blue)×p(4)= 0.0031。

Example 3 comparison of the PCFG method;

two large-scale password sets for Rockyou and CSDN. Randomly selecting passwords from the training set and the test set according to the proportion of 3:1, 1:3 and 1:27 respectively. Training the training set by respectively using a classical PCFG method and the method provided by the invention, and testing the cracking rate of the test set, as shown in figures 3 and 4.

The following characteristics can be seen from the above tests:

1) the cracking rate of the PCFG and the PCFG depends on the size of the training set, and the larger the training set is, the higher the cracking rate is. In a small training set (1: 27) and guesses of 10¹¹Under the condition, the cracking rates of the PCFG method and the Rockyou password set are 85.59% and 51.68% respectively, and the cracking rates of the PCFG method and the Rockyou password set are 74.10% and 30.03% respectively, which are respectively and relatively improved by 65.6% and 146%. The invention is obviously improved compared with the classical PCFG method.

2) As the number of guesses increases, the cracking rate of the PCFG does not increase significantly. In the case of a small training set (1: 27), the number of guesses is selected from10¹¹Is lifted to 10¹⁴In time, the cracking rate of the PCFG method to Rockyou is only improved from 51.68% to 52.97%, and is only improved by 1.29%. Under the same condition, the cracking rate of the Rockyou in the invention is improved from 85.59% to 92.53%, and is improved by 6.94%. The invention shows that the increase of the cracking rate is more obvious than that of the PCFG when the guessing times are increased.

3) The present invention is insensitive to the size of the training set at the same number of guesses. For example, for the Rockyou password set, the number of guesses is 10¹¹Then, for 3:1 large training set and 27: 1, the cracking rate of the PCFG method is rapidly reduced from 70.79% to 51.68% and reduced by 19.11%, while the cracking rate of the PCFG method is reduced from 87.70% to 85.59% and reduced by only 2.11%.

The test shows that the invention expands the vocabulary in the training set, has higher cracking rate than the prior PCFG method under the same guess times, and can still keep more stable cracking rate under the condition of reducing the training set.

Claims

1. A password guessing method for improving guessing efficiency of a small training set based on a corpus is characterized by comprising the following steps of:

s1, constructing corpus comprising four types of corpus setsΓDetermining the structure of the password guessing rule; constructing a corpus with the following characteristicsΓ：

characteristics 3, vocabulary types of the corpus set comprise language, country and region, general and violent corpora;

5, the length of the violent corpus set is less than or equal to 3, and the violent corpus set is divided into lower case letters, capital letters, numbers and special symbols;

feature 6, corpusΓAny two corpus sets do not comprise the same vocabulary;

S3 corpus-basedΓAnd password guessing rule setRComputing a corpusΓEvery word inwThe probability of (d); computing a set of password guessing rulesRGuessing rule of each password inrThe probability of (d);

s4, generating guess times asSDictionary (2)D(S) Using dictionariesD(S) Password guessing is performed.

2. The corpus-based password guessing method for improving guessing efficiency of small training sets according to claim 1,

Called password guessing rulerThe corpus space size of (1)S(r)；

3. Corpus-based lifting gadget according to claim 1A password guessing method for guessing efficiency of a training set, characterized in that in step S2, the password training set is subjected to password guessingPWD_TRAINComprising a plurality of training passwordspwdBased on a corpusΓGenerating specific training passwordspwdPassword guessing ruler。

4. The method for guessing passwords according to claim 1, wherein in step S2, the method for guessing passwords based on corpus-based training set includesΓConstructing a single training passwordpwdDirected acyclic graph ofG=<V, E>Wherein, there is a directed acyclic graphGEach edge in (1) is a corpusΓThe corpus collection to which the character substring from the starting point to the end point of the edge belongs;

5. The corpus-based password guessing method for improving guessing efficiency of small training set as claimed in claim 1, wherein in step S3, the password guessing rule setRGuessing rule of each password inrProbability of (2), is recorded asq(r, PWD_TRAIN)，r∈R；

3) password guessing rule setRGuessing rule of each password inrIs proportional to its probability of being in the training password setPWD_TRAINThe frequency of occurrence of (c).

6. The method for guessing passwords according to claim 1, wherein in step S3, the corpus is used to improve guessing efficiency of small training setsΓEvery word inwProbability of (2), is recorded asp(w，PWD_TRAIN)，w∈Γ；

CorpusΓThe probability of each word in has the following characteristics:

1) statistical corpusΓThe frequency of each vocabulary in the training set is determined, and then each corpus is collectedCThe frequency of all the words in the corpus is added with 1, so that the frequency of the words which do not appear in the corpus is not 0;

7. The corpus-based password guessing method for improving guessing efficiency of small training set according to claim 1, wherein in step S4, the password guessing rule set is chosen as the password guessing rule setROne rule password guessing ruler=[C ₁]…[C _s]Andseach wordw ₁,…,w _sSatisfy the requirement ofw ₁∈C _1, w ₂∈C ₂,…,w _s∈C _s，C ₁,…,C _s∈ΓBalance ofw ₁|…|w _sIs a corpus-based databaseΓAnd password guessing rule setRThe legal vocabulary combination of (1), wherein '|' is a string splicing operation;

Prob(w ₁|…|w _s)=∏_{i s1≤≤} p(w _i, PWD_TRAIN)×q(r, PWD_TRAIN) 。

8. the corpus-based password guessing method for improving guessing efficiency of small training set as claimed in claim 7, wherein the number of guesses is givenSIf, ifSBased on corpusΓAnd password guessing rule setROrdered sequence of legal vocabulary combinations ofD(S)=<cp ₁,cp ₂,…,cp _S>Satisfies the following conditions:

condition 1, Prob (cp _j)≥Prob(cp _j+1), 1≤j≤S-1；

9. The corpus-based password guessing method for improving guessing efficiency of small training set as claimed in claim 8, wherein one character stringstrIt is possible to describe a plurality of legal vocabulary combinations, each of which has a different probability of becoming a password;

10. The corpus-based efficiently guessing password guessing method for training set based on claims 9, wherein the method comprisesSOrdered dictionary of individual legal vocabulary combinationsD(S) Having the following properties:

property 1 if legal vocabulary combinationcpProbability of becoming password Prob (cp) Is greater thanα(S) Then the legal vocabulary combinationcpMust belong toD(S)；