CN111553155A

CN111553155A - Password word segmentation system and method based on semantic structure

Info

Publication number: CN111553155A
Application number: CN202010356699.0A
Authority: CN
Inventors: 邱卫东; 贾兴磊; 田昊; 郭捷; 唐鹏
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2020-08-18
Anticipated expiration: 2040-04-29
Also published as: CN111553155B

Abstract

A password word segmentation system and method based on semantic structure includes: the system comprises a preprocessing module, an NLP semantic extraction module and a non-NLP semantic labeling module, wherein: the preprocessing module receives a password to be segmented, extracts a special semantic factor which cannot be identified in the subsequent step in the password, pre-segments the rest part according to character types, outputs an alphabetic part to the NLP semantic extraction module, and outputs a non-alphabetic part to the non-NLP semantic labeling module; the NLP semantic extraction module utilizes an NLP tool to perform word segmentation on the letter part of the dialog to obtain various semantic factors; and the non-NLP semantic annotation module semantically annotates the part which cannot be participled by the NLP tool in the password. The invention carries out word segmentation on the password according to the semantic information contained in the password according to the corpus, identifies the semantic structure of the password and can carry out accurate word segmentation on the password set by a Chinese user and an English user.

Description

Password word segmentation system and method based on semantic structure

Technical Field

The invention relates to a technology in the field of computer security, in particular to a password word segmentation system and method based on a semantic structure.

Background

Because of its high security and usability, text passwords are still widely used in user authentication and online services for computer systems. Because most of passwords used by users are defined by the users themselves, the users often select a plurality of character strings containing specific semantics or rules as the passwords for the convenience of memory, and therefore, the research on the semantic structure of the passwords has great significance for improving the password security of the users.

Different from natural language, the password has no fixed syntactic structure, and when a user sets the password, various semantic factors can be combined at will according to the rules of a website, so that the word segmentation method aiming at the natural language is not suitable for word segmentation of the password.

Most of the previous researches on semantic structures of passwords aim at passwords of English users, and word segmentation methods proposed for the passwords of the English users are often poor in performance on a Chinese leakage library due to certain differences of the passwords set by the English users and Chinese users. In recent years, research on Chinese user passwords is started by a plurality of researchers, and the research shows that it is effective to add extra semantic information in a word segmentation system, but what information is added and how the information is added is still a subjective judgment, and no systematic method exists.

Disclosure of Invention

The invention provides a password word segmentation system and method based on semantic structure aiming at the defects in the prior art, which are used for segmenting words of a password according to semantic information contained in the password according to a corpus, identifying the semantic structure of the password and accurately segmenting the words of the password set by a Chinese user and an English user.

The invention is realized by the following technical scheme:

the invention relates to a password word segmentation system based on a semantic structure, which comprises: the semantic annotation system comprises a preprocessing module, a Natural Language Processing (NLP) semantic extraction module and a non-natural language processing (non-NLP) semantic annotation module, wherein: the preprocessing module receives a password to be segmented, extracts a special semantic factor which cannot be identified in the subsequent step in the password, pre-segments the rest part according to character types, outputs an alphabetic part to the NLP semantic extraction module, and outputs a non-alphabetic part to the non-NLP semantic labeling module; the NLP semantic extraction module utilizes an NLP tool to perform word segmentation on the letter part of the dialog to obtain various semantic factors; and the non-NLP semantic annotation module semantically annotates the part which cannot be participled by the NLP tool in the password.

The special semantic factors comprise: keyboard structure, website, email.

The part which can not be segmented by the NLP tool comprises: numbers, special characters.

The preprocessing module comprises: keyboard structure extraction element, email extraction element, website extraction element and character segmentation unit, wherein: the keyboard structure extraction unit extracts a part of the password related to the distribution rule of keyboard keys, namely extracts a keyboard structure in the password, the email address extraction unit extracts an email address contained in the password, the website extraction unit extracts a website contained in the password, and the character segmentation unit segments the password according to different character types.

The NLP semantic extraction module comprises: word segmentation unit, part of speech mark (POS) unit and semantic classification unit, wherein: the word segmentation unit utilizes a Natural Language Toolkit (NLTK) to segment words of the letter part input from the preprocessing module, and outputs the result to the POS unit; the POS unit marks all input factors by using a POS module of the NLTK and outputs semantic factors needing to be further classified to the semantic classification unit; the semantic classification unit further classifies the named entity factors by using a character string matching method, labels the named entity factors as place names, months, male names, female names and Chinese name abbreviation categories, matches unidentified factors in a pinyin list, labels the matched factors as pinyin, labels the unidentified factors as abbreviations when the unidentified factors meet the rule of 'consonant letters with the length exceeding 3 bits', otherwise, the unidentified factors are still labeled.

The semantic factors needing further classification comprise: named entity, unidentified segment.

The non-NLP semantic annotation module comprises: digit mark module and special character mark module, wherein: the digital marking module marks the digital segments containing specific semantics correspondingly and marks the digital segments with unknown semantics according to the length of the digital segments; the special character marking unit marks the special character segments according to the length of the special character segments.

The specific semantics comprise: date, year, mobile phone number.

Technical effects

The invention solves the problem of word segmentation of passwords of different languages and different leakage libraries;

compared with the prior art, the method and the device have the advantages that the password is extracted from semantic factors including various character types, such as a keyboard structure, an electronic mailbox, a website and the like in advance before formal word segmentation is carried out on the password, so that semantic loss caused by word segmentation according to the character types is avoided, the word segmentation accuracy is improved, the keyboard structure contained in the password can be effectively extracted, and the word segmentation accuracy is improved; the invention adds a plurality of semantic factors such as place name, Chinese name abbreviation, pinyin, abbreviation, mobile phone number, keyboard structure, website, email and the like in the word segmentation system, improves the word segmentation accuracy and realizes the word segmentation of the Chinese website password.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention.

Detailed Description

As shown in fig. 1, the present embodiment relates to a password word segmentation system based on semantic structure, which includes: the system comprises a preprocessing module, an NLP semantic extraction module and a non-NLP semantic classification module, wherein: the preprocessing module is connected with the NLP semantic extraction module and transmits letter parts obtained by word pre-division in the preprocessing process, and the preprocessing module is connected with the non-NLP semantic labeling module and transmits numbers and special character parts obtained by word pre-division in the preprocessing process.

The pre-processing module predefines the extraction of three special semantic factors (keyboard structure, web address, electronic mail box), in the keyboard structure extraction unit, one substring in the password

Adjacent on the keyboard, and it<shift>If the key states are the same, the substring is judged

Is a keyboard structure ([ KB)]) The specific tag is determined by its length ([ KB4)]，[KB5]… …); the Website extraction unit detects whether a Website exists in the password through prefixes of ' www. ' and ' http:// ', and when the ' www. ' or the ' http:// ' is detected and matched with a substring in a common domain name suffix list, and one or more character strings separated by the ' are arranged between the two substrings, the character string from the prefix to the domain name suffix is judged to be the Website ([ Website ])]) (ii) a In the email box extracting unit, the format of ' @ ' + domain name ' is used as the format of the email box, and the user name before ' @ ' is reserved as a common character string, and word segmentation is carried out in the following steps. When matching the character string in the format of ' @ ' + domain name ' in the character string, determining that the character string is the electronic mailbox ([ email [)])。

The NLP semantic extraction module comprises: a segmentation unit, a part-of-speech tagging (POS) unit, and a semantic classification unit for recognition of named entities, wherein: the recognition of the word segmentation unit for the named entity specifically includes: adopting an algorithm of two-time word segmentation, firstly adding a named entity list containing four semantic factors ([ location ], [ month ], [ majname ]) into an NLTK tool for word segmentation, and adding a named entity list containing five semantic factors ([ location ], [ month ], [ majname ], [ fe _ name ], [ cn _ name _ abbr ]) into the NLTK tool for secondary word segmentation when unrecognized segments still exist after the first round of operation; the semantic classification unit labels the segment of [ NP ] as one of the named entities ([ location ], [ month ], [ male _ name ], [ female _ name ], [ cn _ name _ abbr ]) by performing string matching with the named entity list; for [ NN ] segmentation, first judged by string matching: when in the Pinyin list, it is marked as [ PY ], otherwise, when the length is more than 3 and the consonant letters are all, it is determined that English abbreviation ([ abbr ]) is possible, and when not, the [ NN ] label is kept unchanged.

The present embodiment of the present invention is a semantic structure-based password word segmentation method based on the above system, and specifically includes the following steps:

s1) the preprocessing module reads the password P to be participled.

S2) in the keyboard structure extraction unit, for one sub-string in the password

Keyboard structure ([ KB)]) The specific tag is determined by its length ([ KB4)]，[KB5]，……)。

S3) the Website extracting unit detects whether the Website exists in the password through prefixes of ' www. ' and ' http:// ', and when the ' www. ' or ' http:// ' is detected and the substrings are matched with one substring in the list of common domain name suffixes and one or more character strings separated by the ' are arranged between the two substrings, the character string from the prefix to the domain name suffix is judged to be the Website ([ Website ]).

S4) the mailbox extraction unit takes the format of ' @ ' + domain name ' as the format of the mailbox, and retains the user name before ' @ ' as the ordinary character string, and performs word segmentation in the following steps. When the character string in the format of ' @ ' + domain name ' is matched in the character string, the character string is judged to be the electronic mailbox ([ email ]).

S5) outputting the unmarked part of the password to a character word segmentation unit, pre-segmenting words according to different character types (numbers, letters and special characters) of the password, and respectively marking the words as numbers, letters and special characters.

After the preprocessing module, the password 1qaziloveyou123@ becomes (1qaz, KB4), (iloveyou, word), (123, number), (@, special).

S6), outputting the segment labeled as [ word ] to NLTK, wherein the corpora used in the word segmentation process are a Brown corpus and a Web Text corpus, and a plurality of named entity lists are added in the corpora, and represent 5 semantic factors: four English semantic factors (([ location ], [ month ], [ male _ name ], [ female _ name ]),) and Chinese name abbreviations ([ cn _ name _ abbr ]), firstly, the Chinese name abbreviations are not added, only four English named entities are added for word segmentation, and when the word segmentation result contains unidentified segments, the Chinese name abbreviations are added for second word segmentation.

S7), outputting the word segmentation result to a POS unit for semantic annotation after word segmentation, wherein the semantic annotation is marked by the following semantic factors with specific semantics: pronouns ([ NOUN ]), NOUNs ([ NOUN ]), qualifiers ([ DET ]), adjectives ([ ADJ ]), VERBs ([ VERB ]), prepositions ([ ADP ]), adverbs ([ ADV ]), subtexts ([ PRT ]), conjunctions ([ CONJ ]), English words ([ NUM ]), suffixes ([ X ]), which represent numbers. And in the POS labeling process, a sequence retrogretter is used for labeling, firstly a Browntrigram tagger is used, then a bigtram tagger is used, and finally an onegram tagger is used, the segments appearing in the named entity list are labeled as [ NN ], and the unidentified segments are labeled as [ NN ].

S8), carrying out further semantic classification on the [ NN ] segment after POS labeling. The segment of [ NP ] is labeled as one of the named entities ([ location ], [ month ], [ majname ], [ fe _ name ], [ cn _ name _ abbr ]) by string matching to the named entity list.

S9) for [ NN ] segmentation, first judged by string matching: when in the Pinyin list, it is marked as [ PY ], otherwise, when the length is more than 3 and the consonant letters are all, it is determined that English abbreviation ([ abbr ]) is possible, and when not, the [ NN ] label is kept unchanged.

S10) outputting the segments marked as [ number ] in S5) to a digital semantic classification unit, wherein the segments with the length of 4 bits are considered as years when the segments are between 1900 and 2020 and are marked as [ year ]; for a segment with a length of 6, when the date format of YYMMDD is satisfied, judging that the date is a date and marking as [ YYMMDD ]; for a segment with a length of 8, when the date format of YYYYMMDD is satisfied, judging that the date is a date and marking as [ YYYYMMDD ]; for the 11-bit segment, when the format of the mobile phone number is satisfied, it is determined that the segment is the mobile phone number and is labeled as [ mobile phone ], and the rest of the number segments are labeled as [ num1], [ num2], … … according to the length thereof.

S11) inputting the [ special ] segment in S5) into a special character marking unit, marked as [ spec1], [ spec2], … … according to the length thereof.

S12) combining the labels of all the segments together according to the order, and forming the semantic structure of the password.

In this embodiment, 13 leaky libraries including 6 middle libraries (CSDN, skyline, youku, 17173, love clap, Dudu cattle) and 7 english libraries (LinkedIn, Zoosk, Myspace, Rockyou, MyHeritage, Gmail, Webhost) are selected, and the word segmentation effect of the method is tested, and the specific test results are shown in table 1.

TABLE 1

The test takes the NN factor not contained in the word segmentation result as a standard for successful word segmentation of the password, and it can be seen that the embodiment can obtain higher word segmentation success rate on both the Chinese leakage library and the English leakage library, and particularly the test word segmentation success rate on the Chinese leakage library reaches over 90 percent, which is enough to explain the effectiveness of the embodiment.

Compared with the prior art, four leakage libraries are selected for testing, wherein two middle libraries (17173 and love pat) and two English libraries (LinkedIn and Gmail) are provided, and the word segmentation success rates of the four leakage libraries are respectively 92.22%, 91.24%, 79.37% and 84.19%, which are obviously higher than those of the prior art, such as 65.17%, 60.88%, 62.26% and 67.14%.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A semantic structure based password segmentation system, comprising: the semantic annotation system comprises a preprocessing module, a Natural Language Processing (NLP) semantic extraction module and a non-natural language processing (non-NLP) semantic annotation module, wherein: the preprocessing module receives a password to be segmented, extracts a special semantic factor which cannot be identified in the subsequent step in the password, pre-segments the rest part according to character types, outputs an alphabetic part to the NLP semantic extraction module, and outputs a non-alphabetic part to the non-NLP semantic labeling module; the NLP semantic extraction module utilizes an NLP tool to perform word segmentation on the letter part of the dialog to obtain various semantic factors; the non-NLP semantic annotation module carries out semantic annotation on the part which cannot be participled by an NLP tool in the password;

2. The system of claim 1, wherein the NLP semantic extraction module comprises: word segmentation unit, part of speech mark (POS) unit and semantic classification unit, wherein: the word segmentation unit utilizes a Natural Language Toolkit (NLTK) to segment words of the letter part input from the preprocessing module, and outputs the result to the POS unit; the POS unit marks all input factors by using a POS module of the NLTK and outputs semantic factors needing to be further classified to the semantic classification unit; the semantic classification unit further classifies the named entity factors by using a character string matching method, labels the named entity factors as place names, months, male names, female names and Chinese name abbreviation categories, matches unidentified factors in a pinyin list, labels the matched factors as pinyin, labels the unidentified factors as abbreviations when the unidentified factors meet the rule of 'consonant letters with the length exceeding 3 bits', otherwise, the unidentified factors are still labeled.

3. The system of claim 1, wherein the non-NLP semantic labeling module comprises: digit mark module and special character mark module, wherein: the digital marking module marks the digital segments containing specific semantics correspondingly and marks the digital segments with unknown semantics according to the length of the digital segments; the special character marking unit marks the special character segments according to the length of the special character segments.

4. The system according to claim 2, wherein the recognition of the named entity by the word segmentation unit specifically comprises: the method adopts an algorithm of two-time word segmentation, firstly adds a named entity list containing four semantic factors ([ location ], [ month ], [ majname ]) into an NLTK tool for word segmentation, and adds a named entity list containing five semantic factors ([ location ], [ month ], [ majname ], [ fe _ name ], [ cn _ name _ abbr ]) into the NLTK tool for secondary word segmentation when unrecognized segments still exist after the first round of operation.

5. The system of claim 2, wherein the semantic categorization unit labels [ NP ] segments as named entities ([ location ], [ month ], [ majname ], [ femaljname ],

[ cn _ name _ abbr ]); for [ NN ] segmentation, first judged by string matching: when in the Pinyin list, it is marked as [ PY ], otherwise, when the length is more than 3 and the consonant letters are all, it is determined that English abbreviation ([ abbr ]) is possible, and when not, the [ NN ] label is kept unchanged.

6. A semantic structure based password segmentation method based on the system of any preceding claim, comprising the steps of:

s1) the preprocessing module reads the password P to be participled;

When c is going to_iAnd c_(i+1)Adjacent on the keyboard, and it<shift>If the key states are the same, the substring is judged

Is a keyboard structure ([ KB)]) And its label is determined by its length ([ KB4)]，[KB5]，……)；

S3) the Website extracting unit detects whether a Website exists in the password through prefixes of ' www. ' and ' http:// ', and when the ' www. ' or the ' http:// ' is detected and the substrings are matched with one substring in a common domain name suffix list, and one or more character strings separated by the ' are arranged between the two substrings, the character string from the prefix to the domain name suffix is judged to be the Website ([ Website ]);

s4) the email extracting unit takes the format of ' @ ' + domain name ' as the format of the email, and reserves the user name before ' @ ' as the common character string; when matching the character string in the format of ' @ ' + domain name ' in the character string, judging that the character string is an electronic mailbox ([ email ]);

s5) outputting the part of the password which is not marked to a character word segmentation unit, pre-segmenting words according to different character types (numbers, letters and special characters) of the password, and marking the words as numbers, letters and special characters respectively;

s6) outputting the segments marked as [ word ] to NLTK, wherein the language database used in the word segmentation process is a Brown language database and a Web Text language database, and a plurality of named entity lists are added in the language databases;

s7) outputting the word segmentation result to a POS unit for semantic annotation, wherein the semantic annotation is as follows: pronouns ([ NOUN ]), NOUNs ([ NOUN ]), qualifiers ([ DET ]), adjectives ([ ADJ ]), VERBs ([ VERB ]), prepositions ([ ADP ]), adverbs ([ ADV ]), subtexts ([ PRT ]), conjunctions ([ CONJ ]), English words ([ NUM ]) representing numbers, and affixes ([ X ]);

s8), carrying out further semantic classification on the [ NN ] segments after POS labeling; segment the [ NP ] as one of the named entities ([ location ], [ month ], [ majname ], [ fe _ name ], [ cn _ name _ abbr ]), by string matching with the named entity list;

s9) for [ NN ] segmentation, first judged by string matching: when the pinyin is in the pinyin list, the pinyin is marked as [ PY ], otherwise, when the length is more than 3 and the pinyin is consonant letters, the pinyin is judged as English abbreviation ([ abbr ]), and when the length is not more than 3 and the pinyin is the consonant letters, the [ NN ] label is kept unchanged;

s10) outputting the segments marked as [ number ] in S5) to a digital semantic classification unit, wherein the segments with the length of 4 bits are considered as years when the segments are between 1900 and 2020 and are marked as [ year ]; for a segment with a length of 6, when the date format of YYMMDD is satisfied, judging that the date is a date and marking as [ YYMMDD ]; for a segment with a length of 8, when the date format of YYYYMMDD is satisfied, judging that the date is a date and marking as [ YYYYMMDD ]; for the 11-bit segment, when the format of the mobile phone number is satisfied, the segment is determined to be the mobile phone number and is marked as [ mobilephone ], and the rest of the number segments are marked as [ num1], [ num2], … … according to the length of the segment;

s11) inputting the [ special ] segment in S5) into a special character marking unit, and marking the special character marking unit as [ spec1], [ spec2], … …;

7. The method of claim 6, wherein the named entity list in step S6) comprises: four English semantic factors (([ location ], [ month ], [ male _ name ], [ female _ name ]),) and Chinese name abbreviations ([ cn _ name _ abbr ]), firstly, the Chinese name abbreviations are not added, only four English named entities are added for word segmentation, and when the word segmentation result contains unidentified segments, the Chinese name abbreviations are added for second word segmentation.

8. The method of claim 6, wherein step s7) comprises labeling with a sequence retromarker during POS labeling, first with Brown trigram tagger, then with bigtram tagger, and finally with onegram tagger, and the segments appearing in the named entity list are labeled [ NN ] and the unrecognized segments are labeled [ NN ].