CN111553155B - Password word segmentation system and method based on semantic structure - Google Patents

Password word segmentation system and method based on semantic structure Download PDF

Info

Publication number
CN111553155B
CN111553155B CN202010356699.0A CN202010356699A CN111553155B CN 111553155 B CN111553155 B CN 111553155B CN 202010356699 A CN202010356699 A CN 202010356699A CN 111553155 B CN111553155 B CN 111553155B
Authority
CN
China
Prior art keywords
semantic
password
name
unit
nlp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010356699.0A
Other languages
Chinese (zh)
Other versions
CN111553155A (en
Inventor
邱卫东
贾兴磊
田昊
郭捷
唐鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiao Tong University
Original Assignee
Shanghai Jiao Tong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiao Tong University filed Critical Shanghai Jiao Tong University
Priority to CN202010356699.0A priority Critical patent/CN111553155B/en
Publication of CN111553155A publication Critical patent/CN111553155A/en
Application granted granted Critical
Publication of CN111553155B publication Critical patent/CN111553155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

A password word segmentation system and method based on semantic structure comprises: the system comprises a preprocessing module, an NLP semantic extraction module and a non-NLP semantic annotation module, wherein: the preprocessing module receives a password to be segmented, extracts special semantic factors which cannot be identified in the subsequent steps in the password, pre-segments the rest parts according to character types, outputs letter parts to the NLP semantic extraction module, and outputs non-letter parts to the non-NLP semantic labeling module; the NLP semantic extraction module divides words from the letter part of the password by using an NLP tool to obtain various semantic factors; the non-NLP semantic annotation module carries out semantic annotation on the part of the password, which cannot be segmented by the NLP tool. According to the invention, the password is segmented according to the semantic information contained in the password according to the corpus, the semantic structure of the password is identified, and the password set by the Chinese user and the English user can be accurately segmented.

Description

基于语义结构的口令分词系统及方法Password segmentation system and method based on semantic structure

技术领域Technical Field

本发明涉及的是一种计算机安全领域的技术,具体是一种基于语义结构的口令分词系统及方法。The present invention relates to a technology in the field of computer security, in particular to a password segmentation system and method based on semantic structure.

背景技术Background Art

由于文本密码具有很好的安全性和可用性,文本密码目前仍然广泛应用在计算机系统的用户认证和在线服务中。由于大多数用户使用的口令是由用户自己定义的,为了方便记忆,用户往往选择若干含有特定语义或规律的字符串作为口令,因此对于口令语义结构的研究对于提高用户密码安全性具有重要意义。Text passwords are still widely used in user authentication and online services in computer systems due to their good security and usability. Since most passwords are defined by the users themselves, they often choose strings with specific semantics or rules as passwords for easy memorization. Therefore, the study of password semantic structure is of great significance to improving the security of user passwords.

与自然语言不同的是,口令没有固定的语法结构,用户在设定口令时可以根据网站的规则对各种语义因子进行任意组合,因此针对自然语言的分词方法并不适用于口令分词。Unlike natural languages, passwords do not have a fixed grammatical structure. When setting passwords, users can arbitrarily combine various semantic factors according to the rules of the website. Therefore, the word segmentation method for natural languages is not applicable to password segmentation.

过去大多数口令语义结构的研究都是针对英文用户口令,由于英文用户和中文用户设定的口令存在一定的差异,针对英文用户口令提出的分词方法往往在中文泄露库上表现不佳。近年来若干研究者开始对中文用户口令进行研究,研究表明在分词系统中添加额外的语义信息是有效的,但是添加什么信息、如何添加信息仍然是一种主观判断,没有一种系统化的方法。In the past, most studies on the semantic structure of passwords were conducted on English user passwords. Since there are certain differences between the passwords set by English users and Chinese users, the word segmentation methods proposed for English user passwords often perform poorly on Chinese leaked databases. In recent years, several researchers have begun to study Chinese user passwords. Studies have shown that adding additional semantic information to the word segmentation system is effective, but what information to add and how to add it is still a subjective judgment, and there is no systematic method.

发明内容Summary of the invention

本发明针对现有技术存在的上述不足,提出一种基于语义结构的口令分词系统及方法,根据语料库对口令按照其中蕴含的语义信息进行分词,识别口令的语义结构,对中文用户和英文用户设定的口令都能进行准确的分词。In view of the above-mentioned deficiencies in the prior art, the present invention proposes a password segmentation system and method based on semantic structure, which segments passwords according to the semantic information contained therein based on a corpus, identifies the semantic structure of passwords, and can accurately segment passwords set by Chinese users and English users.

本发明是通过以下技术方案实现的:The present invention is achieved through the following technical solutions:

本发明涉及一种基于语义结构的口令分词系统,包括:预处理模块、自然语言处理(NLP)语义提取模块和非自然语言处理(non-NLP)语义标注模块,其中:预处理模块接收待分词口令,提取口令中无法在之后的步骤中被识别的特殊语义因子并将其余部分按照字符类型进行预分词,将字母部分输出至NLP语义提取模块,将非字母部分输出至non-NLP语义标注模块;NLP语义提取模块利用NLP工具从对口令的字母部分进行分词,得到多种语义因子;non-NLP语义标注模块对口令中无法用NLP工具进行分词的部分进行语义标注。The invention relates to a password segmentation system based on semantic structure, comprising: a preprocessing module, a natural language processing (NLP) semantic extraction module and a non-natural language processing (non-NLP) semantic annotation module, wherein: the preprocessing module receives a password to be segmented, extracts special semantic factors in the password that cannot be identified in subsequent steps and performs pre-segmentation on the remaining parts according to character types, outputs the letter part to the NLP semantic extraction module, and outputs the non-letter part to the non-NLP semantic annotation module; the NLP semantic extraction module performs segmentation on the letter part of the password using an NLP tool to obtain multiple semantic factors; and the non-NLP semantic annotation module performs semantic annotation on the part of the password that cannot be segmented using the NLP tool.

所述的特殊语义因子包括:键盘结构、网址、电子邮箱。The special semantic factors include: keyboard structure, website, and email address.

所述的无法用NLP工具进行分词的部分包括:数字、特殊字符。The parts that cannot be segmented using NLP tools include: numbers and special characters.

所述的预处理模块包括:键盘结构提取单元、电子邮箱提取单元、网址提取单元和字符切分单元,其中:键盘结构提取单元提取口令中与键盘按键分布规律有关的部分,即提取口令中的键盘结构,电子邮箱提取单元提取口令中含有的电子邮箱地址,网址提取单元提取口令中含有的网址,字符切分单元将口令按照字符类型不同进行分词。The preprocessing module includes: a keyboard structure extraction unit, an email extraction unit, a website extraction unit and a character segmentation unit, wherein: the keyboard structure extraction unit extracts the part of the password related to the keyboard key distribution pattern, that is, extracts the keyboard structure in the password, the email extraction unit extracts the email address contained in the password, the website extraction unit extracts the website contained in the password, and the character segmentation unit segments the password according to different character types.

所述的NLP语义提取模块包括:分词单元、词性标注(POS)单元和语义分类单元,其中:分词单元利用自然语言工具包(NLTK)对从预处理模块输入的字母部分进行分词,将结果输出至POS单元;POS单元利用NLTK的POS模块对输入的各因子进行标注,将需要进一步分类的语义因子输出至语义分类单元;语义分类单元利用字符串匹配的方法对命名实体因子进一步分类,标注为地名、月份、男性名字、女性名字、中文姓名缩写类别,将未识别因子在拼音列表中进行匹配,将匹配到的因子标注为拼音,未匹配到的因子当符合“长度超过3位的辅音字母”这一规则就标注为缩写,否则仍然标注为未识别因子。The NLP semantic extraction module includes: a word segmentation unit, a part-of-speech tagging (POS) unit and a semantic classification unit, wherein: the word segmentation unit uses a natural language toolkit (NLTK) to segment the letter part input from the preprocessing module, and outputs the result to the POS unit; the POS unit uses the POS module of NLTK to mark each input factor, and outputs the semantic factor that needs to be further classified to the semantic classification unit; the semantic classification unit uses a string matching method to further classify the named entity factors, marking them as place names, months, male names, female names, and Chinese name abbreviations, matches unrecognized factors in a pinyin list, and marks the matched factors as pinyin. When the unmatched factors meet the rule of "consonant letters with a length of more than 3 digits", they are marked as abbreviations, otherwise they are still marked as unrecognized factors.

所述的需要进一步分类的语义因子包括:命名实体、未识别片段。The semantic factors that need to be further classified include: named entities and unrecognized segments.

所述的non-NLP语义标注模块包括:数字标注模块和特殊字符标注模块,其中:数字标注模块将含特定语义的数字片段进行相应的标注,对未知语义的数字片段按其长度进行标注;特殊字符标注单元对特殊字符片段按其长度进行标注。The non-NLP semantic annotation module includes: a digital annotation module and a special character annotation module, wherein: the digital annotation module marks the digital fragments containing specific semantics accordingly, and marks the digital fragments with unknown semantics according to their length; the special character annotation unit marks the special character fragments according to their length.

所述的特定语义包括:日期,年份、手机号。The specific semantics include: date, year, and mobile phone number.

技术效果Technical Effects

本发明解决了对不同语种、不同泄露库的口令进行分词的问题;The present invention solves the problem of word segmentation of passwords in different languages and different leaked libraries;

与现有技术相比,本发明通过在对口令进行正式分词之前,预先从键盘结构、电子邮箱、网址等包含多种字符类型的语义因子中提取口令,避免了按字符类型分词之后造成的语义丢失,提高了分词的准确率,可以有效地提取出口令中蕴含的键盘结构,提高了分词的准确性;本发明在分词系统中添加地名、中文姓名缩写、拼音、缩写、手机号、键盘结构、网址、电子邮箱等多种语义因子,提高了分词的准确率,同时实现了本发明对中文网站口令的分词。Compared with the prior art, the present invention extracts passwords from semantic factors including multiple character types such as keyboard structure, email address, and website before formally segmenting the passwords, thereby avoiding semantic loss caused by segmentation by character type and improving the accuracy of word segmentation. The keyboard structure contained in the password can be effectively extracted, thereby improving the accuracy of word segmentation. The present invention adds multiple semantic factors such as place names, Chinese name abbreviations, pinyin, abbreviations, mobile phone numbers, keyboard structures, website addresses, and email addresses to the word segmentation system, thereby improving the accuracy of word segmentation and realizing the word segmentation of Chinese website passwords by the present invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明系统结构示意图。FIG1 is a schematic diagram of the system structure of the present invention.

具体实施方式DETAILED DESCRIPTION

如图1所示,为本实施例涉及一种基于语义结构的口令分词系统,包括:预处理模块、NLP语义提取模块与non-NLP语义分类模块,其中:预处理模块与NLP语义提取模块相连并传输预处理过程中预分词得到的字母部分,预处理模块与non-NLP语义标注模块相连并传输预处理过程中预分词得到的数字和特殊字符部分。As shown in Figure 1, this embodiment involves a password segmentation system based on semantic structure, including: a preprocessing module, an NLP semantic extraction module and a non-NLP semantic classification module, wherein: the preprocessing module is connected to the NLP semantic extraction module and transmits the letter part obtained by pre-segmentation during the preprocessing process, and the preprocessing module is connected to the non-NLP semantic annotation module and transmits the number and special character part obtained by pre-segmentation during the preprocessing process.

所述的预处理模块中预定义了三种特殊语义因子(键盘结构、网址、电子邮箱)的提取,在键盘结构提取单元中,对于口令中的一个子串

Figure BDA0002473712320000021
Figure BDA0002473712320000035
在键盘上相邻,且其<shift>键状态相同,则判定该子串
Figure BDA0002473712320000031
为键盘结构([KB]),具体的标签由其长度决定([KB4],[KB5],……);网址提取单元通过前缀“www.”和“http://”检测口令中是否存在网址,当检测到“www.”或“http://”且该子串后匹配到一个在常用域名后缀列表中的子串,且两个子串之间为由“.”分隔开的一个或多个字符串,则判定从前缀到域名后缀的该字符串为网址([Website]);在电子邮箱提取单元中,以“‘@’+域名”的格式作为电子邮箱的格式,而将“@”之前的用户名作为普通的字符串保留,在之后的步骤中进行分词。当在字符串中匹配到“‘@’+域名”格式的字符串,则判定其为电子邮箱([email])。The preprocessing module predefines the extraction of three special semantic factors (keyboard structure, website, and email address). In the keyboard structure extraction unit, for a substring in the password,
Figure BDA0002473712320000021
Figure BDA0002473712320000035
If they are adjacent on the keyboard and their <shift> key status is the same, then the substring is determined
Figure BDA0002473712320000031
The structure is a keyboard ([KB]), and the specific label is determined by its length ([KB4], [KB5], ...); the URL extraction unit detects whether there is a URL in the password through the prefix "www." and "http://". When "www." or "http://" is detected and the substring is matched with a substring in the common domain name suffix list, and the two substrings are separated by one or more strings separated by ".", the string from the prefix to the domain name suffix is determined to be a URL ([Website]); in the email extraction unit, the format of "'@'+domain name" is used as the format of the email, and the user name before "@" is retained as a common string and segmented in the subsequent steps. When a string in the format of "'@'+domain name" is matched in the string, it is determined to be an email ([email]).

所述的NLP语义提取模块包括:用于对于命名实体的识别的分词单元、词性标注(POS)单元和语义分类单元,其中:分词单元对于命名实体的识别,具体是指:采用两次分词的算法,第一次将包含四种语义因子([location],[month],[male_name],[female_name])的命名实体列表加入NLTK工具中进行分词,当经过第一轮操作后仍存在未识别分段,则将包含五种语义因子([location],[month],[male_name],[female_name],[cn_name_abbr])的命名实体列表加入NLTK工具中进行二次分词;语义分类单元通过与命名实体列表进行字符串匹配,将[NP]分段标注为命名实体([location],[month],[male_name],[female_name],[cn_name_abbr])之一;对于[NN]分段,首先通过字符串匹配判断:当在拼音列表中时,标注为[PY],否则,当长度大于3且均为辅音字母,则判定可能是英文缩写([abbr]),当不是以上两种情况,保持[NN]标签不变。The NLP semantic extraction module includes: a word segmentation unit, a part-of-speech tagging (POS) unit and a semantic classification unit for identifying named entities, wherein: the word segmentation unit for identifying named entities specifically refers to: using a two-word segmentation algorithm, first adding a named entity list containing four semantic factors ([location], [month], [male_name], [female_name]) to the NLTK tool for word segmentation, and if there are still unrecognized segments after the first round of operation, then adding a named entity list containing five semantic factors ([location], [month], [male_name], [female_name] , [cn_name_abbr]) is added to the NLTK tool for secondary word segmentation; the semantic classification unit matches the named entity list by string matching, and marks the [NP] segment as one of the named entities ([location], [month], [male_name], [female_name], [cn_name_abbr]); for the [NN] segment, firstly judge by string matching: when it is in the pinyin list, it is marked as [PY], otherwise, when the length is greater than 3 and all are consonants, it is judged to be an English abbreviation ([abbr]), and when it is neither of the above two cases, keep the [NN] label unchanged.

本实施例基于上述系统的基于语义结构的口令分词方法,具体步骤包括:This embodiment is based on the password segmentation method based on semantic structure of the above system, and the specific steps include:

S1)预处理模块读取待分词口令P。S1) The preprocessing module reads the password P to be segmented.

S2)在键盘结构提取单元中,对于口令中的一个子串

Figure BDA0002473712320000032
Figure BDA0002473712320000033
在键盘上相邻,且其<shift>键状态相同,则判定该子串
Figure BDA0002473712320000034
键盘结构([KB]),具体的标签由其长度决定([KB4],[KB5],……)。S2) In the keyboard structure extraction unit, for a substring in the password
Figure BDA0002473712320000032
Figure BDA0002473712320000033
If they are adjacent on the keyboard and their <shift> key status is the same, then the substring is determined
Figure BDA0002473712320000034
Keyboard structure ([KB]), the specific label is determined by its length ([KB4], [KB5], ...).

S3)网址提取单元通过前缀“www.”和“http://”检测口令中是否存在网址,当检测到“www.”或“http://”且该子串后匹配到一个在常用域名后缀列表中的子串,且两个子串之间为由“.”分隔开的一个或多个字符串,则判定从前缀到域名后缀的该字符串为网址([Website])。S3) The website extraction unit detects whether there is a website in the password through the prefix "www." and "http://". When "www." or "http://" is detected and the substring is matched with a substring in the common domain name suffix list, and the two substrings are separated by one or more strings separated by ".", the string from the prefix to the domain name suffix is determined to be a website ([Website]).

S4)电子邮箱提取单元通过“‘@’+域名”的格式作为电子邮箱的格式,而将“@”之前的用户名作为普通的字符串保留,在之后的步骤中进行分词。当在字符串中匹配到“‘@’+域名”格式的字符串,则判定其为电子邮箱([email])。S4) The email extraction unit uses the format of "'@' + domain name" as the format of the email, and retains the user name before "@" as a common string, and performs word segmentation in the subsequent steps. When a string in the format of "'@' + domain name" is matched in the string, it is determined to be an email ([email]).

S5)将口令中未标注部分输出至字符分词单元中,按照其字符类型(数字、字母、特殊字符)的不同进行预分词,分别标注为数字([number])、字母([word])、特殊字符([special])。S5) outputting the unmarked part of the password to the character segmentation unit, performing pre-segmentation according to the different character types (numbers, letters, special characters), and marking them as numbers ([number]), letters ([word]), and special characters ([special]), respectively.

经过预处理模块之后,口令1qaziloveyou123@变为(1qaz,KB4),(iloveyou,word),(123,number),(@,special)。After passing through the preprocessing module, the password 1qaziloveyou123@ becomes (1qaz, KB4), (iloveyou, word), (123, number), (@, special).

S6)将标注为[word]的片段输出至NLTK中,分词过程使用的语料库为Brown语料库和Web Text语料库,并在其中添加若干命名实体列表,它们代表了5种语义因子:四种英文语义因子(([location],[month],[male_name],[female_name]),)和中文姓名缩写([cn_name_abbr]),首先不加入中文姓名缩写,只加入四种英文命名实体进行分词,当分词结果中含有未识别片段,则加入中文姓名缩写进行第二次分词。S6) Output the fragments marked as [word] to NLTK. The corpora used in the word segmentation process are the Brown corpus and the Web Text corpus, and add several named entity lists to them, which represent 5 semantic factors: four English semantic factors (([location], [month], [male_name], [female_name])) and Chinese name abbreviations ([cn_name_abbr]). First, do not add Chinese name abbreviations, only add four English named entities for word segmentation. When the word segmentation result contains unrecognized fragments, add Chinese name abbreviations for the second word segmentation.

S7)分词之后,将分词结果输出至POS单元进行语义标注,标注为以下几种有具体语义的语义因子:代词([NOUN]),名词([NOUN]),限定词([DET]),形容词([ADJ]),动词([VERB]),介词([ADP]),副词([ADV]),小品词([PRT]),连词([CONJ]),代表数字的英文单词([NUM]),词缀([X])。在POS标注过程中使用序列倒退标注器进行标注,最先使用Browntrigram tagger,接使用bigtram tagger,最后使用onegram tagger,出现在命名实体列表中的分段标注为[NN],未识别分段标注为[NN]。S7) After word segmentation, the word segmentation results are output to the POS unit for semantic annotation, and are annotated as the following semantic factors with specific semantics: pronoun ([NOUN]), noun ([NOUN]), determiner ([DET]), adjective ([ADJ]), verb ([VERB]), preposition ([ADP]), adverb ([ADV]), particle ([PRT]), conjunction ([CONJ]), English word representing number ([NUM]), affix ([X]). In the POS tagging process, a sequential backward tagger is used for annotation, first using Browntrigram tagger, then bigtram tagger, and finally onegram tagger. The segment that appears in the named entity list is annotated as [NN], and the unrecognized segment is annotated as [NN].

S8)POS标注之后,对[NN]分段进行进一步的语义分类。通过与命名实体列表进行字符串匹配,将[NP]分段标注为命名实体([location],[month],[male_name],[female_name],[cn_name_abbr])之一。S8) After POS tagging, the [NN] segment is further semantically classified. The [NP] segment is annotated as one of the named entities ([location], [month], [male_name], [female_name], [cn_name_abbr]) by string matching with the named entity list.

S9)对于[NN]分段,首先通过字符串匹配判断:当在拼音列表中时,标注为[PY],否则,当长度大于3且均为辅音字母,则判定可能是英文缩写([abbr]),当不是以上两种情况,保持[NN]标签不变。S9) For the [NN] segment, firstly judge by string matching: when it is in the pinyin list, it is marked as [PY]; otherwise, when the length is greater than 3 and all are consonants, it is judged to be an English abbreviation ([abbr]); when it is neither of the above two cases, keep the [NN] label unchanged.

S10)将S5)中标注为[number]的片段输出至数字语义分类单元,对于长度为4位的数字片段,当在1900到2020之间,认为是年份,标注为[year];对于长度为6的片段,当满足YYMMDD的日期格式,则判定是日期,标注为[YYMMDD];对于长度为8的片段,当满足YYYYMMDD的日期格式,则判定是日期,标注为[YYYYMMDD];对于11位的片段,当满足手机号的格式,则判定是手机号,标注为[mobilephone],其余的数字分段按照其长度标注为[num1],[num2],……。S10) outputs the fragment marked as [number] in S5) to the digital semantic classification unit. For a digital fragment with a length of 4 digits, when it is between 1900 and 2020, it is considered to be a year and marked as [year]; for a fragment with a length of 6, when it satisfies the date format of YYMMDD, it is determined to be a date and marked as [YYMMDD]; for a fragment with a length of 8, when it satisfies the date format of YYYYMMDD, it is determined to be a date and marked as [YYYYMMDD]; for a fragment with 11 digits, when it satisfies the format of a mobile phone number, it is determined to be a mobile phone number and marked as [mobilephone]. The remaining digital segments are marked as [num1], [num2], ... according to their lengths.

S11)将S5)中的[special]分段输入特殊字符标注单元,按照其长度标注为[spec1],[spec2],……。S11) Input the [special] segments in S5) into special character marking units and mark them as [spec1], [spec2], ... according to their lengths.

S12)将所有片段的标签按照顺序组合在一起,构成了该条口令的语义结构。S12) The labels of all the fragments are combined in order to form the semantic structure of the password.

本实施例选取13个泄露库,包括6个中文库(CSDN,天涯,优酷,17173,爱拍,嘟嘟牛)和7个英文库(LinkedIn,Zoosk,Myspace,Rockyou,MyHeritage,Gmail,Webhost),对该方法的分词效果进行测试,具体测试结果如表1所示。In this embodiment, 13 leaked libraries are selected, including 6 Chinese libraries (CSDN, Tianya, Youku, 17173, Aipai, Dudu Niu) and 7 English libraries (LinkedIn, Zoosk, Myspace, Rockyou, MyHeritage, Gmail, Webhost), to test the word segmentation effect of the method. The specific test results are shown in Table 1.

表1Table 1

Figure BDA0002473712320000051
Figure BDA0002473712320000051

本测试以分词结果中不含有NN因子作为一条口令分词成功的标准,可以看到,本实施例在中文泄露库和英文泄露库上都可以获得较高的分词成功率,尤其是在中文泄露库上的测试分词成功率均达到了90%以上,足以说明本实施例的有效性。This test uses the absence of NN factors in the word segmentation results as the criterion for successful word segmentation of a password. It can be seen that this embodiment can achieve a high word segmentation success rate on both the Chinese leak database and the English leak database, especially the test word segmentation success rate on the Chinese leak database has reached more than 90%, which is sufficient to illustrate the effectiveness of this embodiment.

与现有技术相比,本实施例选取了四个泄露库进行测试,其中有两个中文库(17173和爱拍)和两个英文库(LinkedIn和Gmail),本实施例在四个泄露库上的分词成功率分别为92.22%、91.24%、79.37%、84.19%,明显高于现有技术的65.17%、60.88%、62.26%、67.14%。Compared with the prior art, this embodiment selected four leaked libraries for testing, including two Chinese libraries (17173 and Aipai) and two English libraries (LinkedIn and Gmail). The word segmentation success rates of this embodiment on the four leaked libraries were 92.22%, 91.24%, 79.37%, and 84.19%, respectively, which were significantly higher than 65.17%, 60.88%, 62.26%, and 67.14% of the prior art.

上述具体实施可由本领域技术人员在不背离本发明原理和宗旨的前提下以不同的方式对其进行局部调整,本发明的保护范围以权利要求书为准且不由上述具体实施所限,在其范围内的各个实现方案均受本发明之约束。The above-mentioned specific implementation can be partially adjusted in different ways by those skilled in the art without departing from the principle and purpose of the present invention. The protection scope of the present invention shall be based on the claims and shall not be limited by the above-mentioned specific implementation. Each implementation scheme within its scope shall be subject to the constraints of the present invention.

Claims (8)

1. A password segmentation system based on semantic structures, comprising: a preprocessing module, a Natural Language Processing (NLP) semantic extraction module and a non-NLP semantic annotation module, wherein: the preprocessing module receives a password to be segmented, extracts special semantic factors which cannot be identified in the subsequent steps in the password, pre-segments the rest parts according to character types, outputs letter parts to the NLP semantic extraction module, and outputs non-letter parts to the non-NLP semantic labeling module; the NLP semantic extraction module divides words from the letter part of the password by using an NLP tool to obtain various semantic factors; the non-NLP semantic annotation module carries out semantic annotation on the part of the password, which cannot be segmented by the NLP tool;
the preprocessing module comprises: the electronic mail box extraction device comprises a keyboard structure extraction unit, an electronic mail box extraction unit, a website extraction unit and a character segmentation unit, wherein: the keyboard structure extracting unit extracts a part of the password related to the distribution rule of the keyboard keys, namely, extracts the keyboard structure in the password, the electronic mailbox extracting unit extracts the electronic mailbox address contained in the password, the website extracting unit extracts the website contained in the password, and the character segmentation unit carries out word segmentation on the password according to different character types.
2. The system of claim 1, wherein the NLP semantic extraction module comprises: the system comprises a word segmentation unit, a part-of-speech labeling POS unit and a semantic classification unit, wherein: the word segmentation unit is used for segmenting the letter part input from the preprocessing module by using a Natural Language Tool Kit (NLTK) and outputting a result to the POS unit; the POS unit marks the input factors by using a POS module of NLTK, and outputs semantic factors needing further classification to the semantic classification unit; the semantic classification unit further classifies the named entity factors by using a character string matching method, marks the named entity factors as abbreviation categories of place names, months, male names, female names and Chinese names, matches unidentified factors in a pinyin list, marks the matched factors as pinyin, marks the unidentified factors as abbreviations when the unidentified factors accord with the rule of consonant letters with the length exceeding 3 bits, and otherwise marks the unidentified factors.
3. The system of claim 1, wherein the non-NLP semantic annotation module comprises: the system comprises a digital marking module and a special character marking module, wherein: the digital labeling module carries out corresponding labeling on the digital fragments containing the specific semantics, and labels the digital fragments with unknown semantics according to the length of the digital fragments; the special character labeling unit labels the special character segments according to the lengths of the special character segments.
4. The system according to claim 2, wherein the recognition of named entities by the word segmentation unit is specifically: and adding a named entity list containing four semantic factors [ location ], [ mole_name ] and [ mole_name ] into the NLTK tool for word segmentation for the first time by adopting a twice word segmentation algorithm, and adding a named entity list containing five semantic factors [ location ], [ mole_name ] and [ cn_name_abbr ] into the NLTK tool for secondary word segmentation when unrecognized segments still exist after the first round of operation.
5. The system of claim 2, wherein the semantic classification unit marks the [ NP ] segment as one of a named entity [ location ], [ mol ], [ mol_name ], [ fe_name ], [ cn_name_abbr ] by performing a string match with a named entity list; for [ NN ] segmentation, firstly, judging through character string matching: when in the pinyin list, labeled [ PY ], otherwise, when the length is greater than 3 and the consonants are all consonants, the judgment is probably the English abbreviation [ abbr ], and when the two conditions are not the same, the [ NN ] label is kept unchanged.
6. A method of password segmentation based on semantic structures based on the system of any one of the preceding claims, comprising the steps of:
s1, a preprocessing module reads a password P to be segmented;
s2, in the keyboard structure extraction unit, for one substring in the password
Figure QLYQS_1
When->
Figure QLYQS_2
And
Figure QLYQS_3
adjacent on the keyboard, and it<shift>The key states are the same, the substring is determined +.>
Figure QLYQS_4
Is a keyboard structure [ KB ]]And its label is determined by its length; />
S3, detecting whether a Website exists in the password through prefixes of ' www ' and ' http:// ', and when ' www ' or ' http:// ' is detected and the sub-string is matched with a sub-string in a common domain name suffix list, one or more character strings separated by ' are arranged between the two sub-strings, judging that the character string from the prefix to the domain name suffix is the Website [ Website ];
s4, the email box extracting unit takes the format of the '@' + domain name as the format of the email box, and takes the user name before the '@' as a common character string; when the character string is matched with the character string in the format of the '@' + domain name, judging that the character string is an electronic mailbox [ email ];
s5, outputting unlabeled parts in the password to a character word segmentation unit, and pre-segmenting according to different character types, numbers, letters and special characters, wherein the unlabeled parts are labeled as numbers, letters and special characters;
s6, outputting the fragments marked as word into NLTK, wherein the corpus used in the word segmentation process is a Brown corpus and a Web Text corpus, and a plurality of named entity lists are added in the Brown corpus and the Web Text corpus;
s7, outputting the word segmentation result to a POS unit for semantic annotation, wherein the annotation is as follows: pronouns [ NOUN ], NOUNs [ NOUN ], qualifiers [ DET ], adjectives [ ADJ ], VERBs [ VERB ], prepositions [ ADP ], adverbs [ ADV ], small article words [ PRT ], conjunctions [ CONJ ], english words representing numbers [ NUM ] and suffixes [ X ];
s8, after POS labeling, further semantic classification is carried out on the [ NN ] segments; by matching the character string with the named entity list, labeling the [ NP ] segment as one of named entities [ location ], [ mole ], [ mole_name ], [ fe_name ], [ cn_name_abbr ];
s9, for [ NN ] segmentation, firstly judging through character string matching: when the Chinese phonetic alphabet is in the phonetic alphabet list, marking as [ PY ], otherwise, when the Chinese phonetic alphabet is longer than 3 and is consonant, judging as English abbreviation [ abbr ], and when the Chinese phonetic alphabet is not in the two cases, keeping the [ NN ] label unchanged;
s10, outputting the segment marked as the number in S5 to a digital semantic classification unit, wherein for the digital segment with the length of 4 bits, the segment is regarded as year and marked as the year between 1900 and 2020; for a fragment of length 6, when the date format of YYMMDD is satisfied, the determination is a date, labeled [ YYMMDD ]; for a fragment of length 8, when the date format of YYYYMMDD is satisfied, the determination is a date, labeled [ YYYYMMDD ]; for the 11-bit segment, when the format of the mobile phone number is met, judging that the mobile phone number is the mobile phone number, marking the mobile phone as [ mobile ], and marking the rest digital segments according to the length of the digital segments;
s11, inputting the special segment in the S5 into a special character labeling unit, and labeling according to the length of the special character labeling unit;
s12, the labels of all the fragments are combined together in sequence to form the semantic structure of the password.
7. The password segmentation method according to claim 6, wherein the named entity list in step S6 includes: four English semantic factors [ location ], [ month ], [ rule_name ], and Chinese name abbreviation [ cn_name_abbr ], wherein the Chinese name abbreviation is not added first, only four English named entities are added for word segmentation, and when the word segmentation result contains unrecognized fragments, the Chinese name abbreviation is added for second word segmentation.
8. The method of claim 6, wherein step s7 is performed by using a sequence reversing labeler in the POS labeling process, wherein first using Brown trigram tagger, then using bigram tagger, and finally using onegar tagger, the segments appearing in the named entity list are labeled [ NN ], and the unidentified segments are labeled [ NN ].
CN202010356699.0A 2020-04-29 2020-04-29 Password word segmentation system and method based on semantic structure Active CN111553155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010356699.0A CN111553155B (en) 2020-04-29 2020-04-29 Password word segmentation system and method based on semantic structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010356699.0A CN111553155B (en) 2020-04-29 2020-04-29 Password word segmentation system and method based on semantic structure

Publications (2)

Publication Number Publication Date
CN111553155A CN111553155A (en) 2020-08-18
CN111553155B true CN111553155B (en) 2023-05-09

Family

ID=71999272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010356699.0A Active CN111553155B (en) 2020-04-29 2020-04-29 Password word segmentation system and method based on semantic structure

Country Status (1)

Country Link
CN (1) CN111553155B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784227A (en) * 2021-01-04 2021-05-11 上海交通大学 Dictionary generating system and method based on password semantic structure
CN113657118B (en) * 2021-08-16 2024-05-14 好心情健康产业集团有限公司 Semantic analysis method, device and system based on call text

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460552A (en) * 2018-10-29 2019-03-12 朱丽莉 Rule-based and corpus Chinese faulty wording automatic testing method and equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460552A (en) * 2018-10-29 2019-03-12 朱丽莉 Rule-based and corpus Chinese faulty wording automatic testing method and equipment

Also Published As

Publication number Publication date
CN111553155A (en) 2020-08-18

Similar Documents

Publication Publication Date Title
Tang et al. Email data cleaning
Sun et al. Enhancing Chinese word segmentation using unlabeled data
US8447588B2 (en) Region-matching transducers for natural language processing
US8266169B2 (en) Complex queries for corpus indexing and search
US8510097B2 (en) Region-matching transducers for text-characterization
Krstev et al. Using textual and lexical resources in developing serbian wordnet
El-Shishtawy et al. An accurate arabic root-based lemmatizer for information retrieval purposes
Ahmadi A tokenization system for the Kurdish language
Cing et al. Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language
Patil et al. Issues and challenges in marathi named entity recognition
CN111553155B (en) Password word segmentation system and method based on semantic structure
Sembok et al. Arabic word stemming algorithms and retrieval effectiveness
Huang et al. Words without boundaries: Computational approaches to Chinese word segmentation
Tufiş et al. DIAC+: A professional diacritics recovering system
Khan et al. Urdu word segmentation using machine learning approaches
Jain et al. Detection and correction of non word spelling errors in Hindi language
Attia et al. Gwu-hasp: Hybrid arabic spelling and punctuation corrector
Starko et al. Ukrainian Text Preprocessing in GRAC
CN112784227A (en) Dictionary generating system and method based on password semantic structure
Naemi et al. Informal-to-formal word conversion for persian language using natural language processing techniques
Kedtiwerasak et al. Thai keyword extraction using textrank algorithm
Kumar et al. Applications of stemming algorithms in information retrieval-a review
Ren et al. A hybrid approach to automatic Chinese text checking and error correction
Bar et al. Arabic multiword expressions
Nicolov et al. Efficient spam analysis for weblogs through url segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant