CN111553155B - Password word segmentation system and method based on semantic structure - Google Patents
Password word segmentation system and method based on semantic structure Download PDFInfo
- Publication number
- CN111553155B CN111553155B CN202010356699.0A CN202010356699A CN111553155B CN 111553155 B CN111553155 B CN 111553155B CN 202010356699 A CN202010356699 A CN 202010356699A CN 111553155 B CN111553155 B CN 111553155B
- Authority
- CN
- China
- Prior art keywords
- semantic
- password
- name
- unit
- nlp
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 66
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000000605 extraction Methods 0.000 claims abstract description 32
- 238000007781 pre-processing Methods 0.000 claims abstract description 19
- 239000000284 extract Substances 0.000 claims abstract description 12
- 238000002372 labelling Methods 0.000 claims abstract 12
- 238000003058 natural language processing Methods 0.000 claims description 29
- 239000012634 fragment Substances 0.000 claims description 19
- 238000012360 testing method Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 101100129590 Schizosaccharomyces pombe (strain 972 / ATCC 24843) mcp5 gene Proteins 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/30—Authentication, i.e. establishing the identity or authorisation of security principals
- G06F21/31—User authentication
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及的是一种计算机安全领域的技术,具体是一种基于语义结构的口令分词系统及方法。The present invention relates to a technology in the field of computer security, in particular to a password segmentation system and method based on semantic structure.
背景技术Background Art
由于文本密码具有很好的安全性和可用性,文本密码目前仍然广泛应用在计算机系统的用户认证和在线服务中。由于大多数用户使用的口令是由用户自己定义的,为了方便记忆,用户往往选择若干含有特定语义或规律的字符串作为口令,因此对于口令语义结构的研究对于提高用户密码安全性具有重要意义。Text passwords are still widely used in user authentication and online services in computer systems due to their good security and usability. Since most passwords are defined by the users themselves, they often choose strings with specific semantics or rules as passwords for easy memorization. Therefore, the study of password semantic structure is of great significance to improving the security of user passwords.
与自然语言不同的是,口令没有固定的语法结构,用户在设定口令时可以根据网站的规则对各种语义因子进行任意组合,因此针对自然语言的分词方法并不适用于口令分词。Unlike natural languages, passwords do not have a fixed grammatical structure. When setting passwords, users can arbitrarily combine various semantic factors according to the rules of the website. Therefore, the word segmentation method for natural languages is not applicable to password segmentation.
过去大多数口令语义结构的研究都是针对英文用户口令,由于英文用户和中文用户设定的口令存在一定的差异,针对英文用户口令提出的分词方法往往在中文泄露库上表现不佳。近年来若干研究者开始对中文用户口令进行研究,研究表明在分词系统中添加额外的语义信息是有效的,但是添加什么信息、如何添加信息仍然是一种主观判断,没有一种系统化的方法。In the past, most studies on the semantic structure of passwords were conducted on English user passwords. Since there are certain differences between the passwords set by English users and Chinese users, the word segmentation methods proposed for English user passwords often perform poorly on Chinese leaked databases. In recent years, several researchers have begun to study Chinese user passwords. Studies have shown that adding additional semantic information to the word segmentation system is effective, but what information to add and how to add it is still a subjective judgment, and there is no systematic method.
发明内容Summary of the invention
本发明针对现有技术存在的上述不足,提出一种基于语义结构的口令分词系统及方法,根据语料库对口令按照其中蕴含的语义信息进行分词,识别口令的语义结构,对中文用户和英文用户设定的口令都能进行准确的分词。In view of the above-mentioned deficiencies in the prior art, the present invention proposes a password segmentation system and method based on semantic structure, which segments passwords according to the semantic information contained therein based on a corpus, identifies the semantic structure of passwords, and can accurately segment passwords set by Chinese users and English users.
本发明是通过以下技术方案实现的:The present invention is achieved through the following technical solutions:
本发明涉及一种基于语义结构的口令分词系统,包括:预处理模块、自然语言处理(NLP)语义提取模块和非自然语言处理(non-NLP)语义标注模块,其中:预处理模块接收待分词口令,提取口令中无法在之后的步骤中被识别的特殊语义因子并将其余部分按照字符类型进行预分词,将字母部分输出至NLP语义提取模块,将非字母部分输出至non-NLP语义标注模块;NLP语义提取模块利用NLP工具从对口令的字母部分进行分词,得到多种语义因子;non-NLP语义标注模块对口令中无法用NLP工具进行分词的部分进行语义标注。The invention relates to a password segmentation system based on semantic structure, comprising: a preprocessing module, a natural language processing (NLP) semantic extraction module and a non-natural language processing (non-NLP) semantic annotation module, wherein: the preprocessing module receives a password to be segmented, extracts special semantic factors in the password that cannot be identified in subsequent steps and performs pre-segmentation on the remaining parts according to character types, outputs the letter part to the NLP semantic extraction module, and outputs the non-letter part to the non-NLP semantic annotation module; the NLP semantic extraction module performs segmentation on the letter part of the password using an NLP tool to obtain multiple semantic factors; and the non-NLP semantic annotation module performs semantic annotation on the part of the password that cannot be segmented using the NLP tool.
所述的特殊语义因子包括:键盘结构、网址、电子邮箱。The special semantic factors include: keyboard structure, website, and email address.
所述的无法用NLP工具进行分词的部分包括:数字、特殊字符。The parts that cannot be segmented using NLP tools include: numbers and special characters.
所述的预处理模块包括:键盘结构提取单元、电子邮箱提取单元、网址提取单元和字符切分单元,其中:键盘结构提取单元提取口令中与键盘按键分布规律有关的部分,即提取口令中的键盘结构,电子邮箱提取单元提取口令中含有的电子邮箱地址,网址提取单元提取口令中含有的网址,字符切分单元将口令按照字符类型不同进行分词。The preprocessing module includes: a keyboard structure extraction unit, an email extraction unit, a website extraction unit and a character segmentation unit, wherein: the keyboard structure extraction unit extracts the part of the password related to the keyboard key distribution pattern, that is, extracts the keyboard structure in the password, the email extraction unit extracts the email address contained in the password, the website extraction unit extracts the website contained in the password, and the character segmentation unit segments the password according to different character types.
所述的NLP语义提取模块包括:分词单元、词性标注(POS)单元和语义分类单元,其中:分词单元利用自然语言工具包(NLTK)对从预处理模块输入的字母部分进行分词,将结果输出至POS单元;POS单元利用NLTK的POS模块对输入的各因子进行标注,将需要进一步分类的语义因子输出至语义分类单元;语义分类单元利用字符串匹配的方法对命名实体因子进一步分类,标注为地名、月份、男性名字、女性名字、中文姓名缩写类别,将未识别因子在拼音列表中进行匹配,将匹配到的因子标注为拼音,未匹配到的因子当符合“长度超过3位的辅音字母”这一规则就标注为缩写,否则仍然标注为未识别因子。The NLP semantic extraction module includes: a word segmentation unit, a part-of-speech tagging (POS) unit and a semantic classification unit, wherein: the word segmentation unit uses a natural language toolkit (NLTK) to segment the letter part input from the preprocessing module, and outputs the result to the POS unit; the POS unit uses the POS module of NLTK to mark each input factor, and outputs the semantic factor that needs to be further classified to the semantic classification unit; the semantic classification unit uses a string matching method to further classify the named entity factors, marking them as place names, months, male names, female names, and Chinese name abbreviations, matches unrecognized factors in a pinyin list, and marks the matched factors as pinyin. When the unmatched factors meet the rule of "consonant letters with a length of more than 3 digits", they are marked as abbreviations, otherwise they are still marked as unrecognized factors.
所述的需要进一步分类的语义因子包括:命名实体、未识别片段。The semantic factors that need to be further classified include: named entities and unrecognized segments.
所述的non-NLP语义标注模块包括:数字标注模块和特殊字符标注模块,其中:数字标注模块将含特定语义的数字片段进行相应的标注,对未知语义的数字片段按其长度进行标注;特殊字符标注单元对特殊字符片段按其长度进行标注。The non-NLP semantic annotation module includes: a digital annotation module and a special character annotation module, wherein: the digital annotation module marks the digital fragments containing specific semantics accordingly, and marks the digital fragments with unknown semantics according to their length; the special character annotation unit marks the special character fragments according to their length.
所述的特定语义包括:日期,年份、手机号。The specific semantics include: date, year, and mobile phone number.
技术效果Technical Effects
本发明解决了对不同语种、不同泄露库的口令进行分词的问题;The present invention solves the problem of word segmentation of passwords in different languages and different leaked libraries;
与现有技术相比,本发明通过在对口令进行正式分词之前,预先从键盘结构、电子邮箱、网址等包含多种字符类型的语义因子中提取口令,避免了按字符类型分词之后造成的语义丢失,提高了分词的准确率,可以有效地提取出口令中蕴含的键盘结构,提高了分词的准确性;本发明在分词系统中添加地名、中文姓名缩写、拼音、缩写、手机号、键盘结构、网址、电子邮箱等多种语义因子,提高了分词的准确率,同时实现了本发明对中文网站口令的分词。Compared with the prior art, the present invention extracts passwords from semantic factors including multiple character types such as keyboard structure, email address, and website before formally segmenting the passwords, thereby avoiding semantic loss caused by segmentation by character type and improving the accuracy of word segmentation. The keyboard structure contained in the password can be effectively extracted, thereby improving the accuracy of word segmentation. The present invention adds multiple semantic factors such as place names, Chinese name abbreviations, pinyin, abbreviations, mobile phone numbers, keyboard structures, website addresses, and email addresses to the word segmentation system, thereby improving the accuracy of word segmentation and realizing the word segmentation of Chinese website passwords by the present invention.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明系统结构示意图。FIG1 is a schematic diagram of the system structure of the present invention.
具体实施方式DETAILED DESCRIPTION
如图1所示,为本实施例涉及一种基于语义结构的口令分词系统,包括:预处理模块、NLP语义提取模块与non-NLP语义分类模块,其中:预处理模块与NLP语义提取模块相连并传输预处理过程中预分词得到的字母部分,预处理模块与non-NLP语义标注模块相连并传输预处理过程中预分词得到的数字和特殊字符部分。As shown in Figure 1, this embodiment involves a password segmentation system based on semantic structure, including: a preprocessing module, an NLP semantic extraction module and a non-NLP semantic classification module, wherein: the preprocessing module is connected to the NLP semantic extraction module and transmits the letter part obtained by pre-segmentation during the preprocessing process, and the preprocessing module is connected to the non-NLP semantic annotation module and transmits the number and special character part obtained by pre-segmentation during the preprocessing process.
所述的预处理模块中预定义了三种特殊语义因子(键盘结构、网址、电子邮箱)的提取,在键盘结构提取单元中,对于口令中的一个子串 在键盘上相邻,且其<shift>键状态相同,则判定该子串为键盘结构([KB]),具体的标签由其长度决定([KB4],[KB5],……);网址提取单元通过前缀“www.”和“http://”检测口令中是否存在网址,当检测到“www.”或“http://”且该子串后匹配到一个在常用域名后缀列表中的子串,且两个子串之间为由“.”分隔开的一个或多个字符串,则判定从前缀到域名后缀的该字符串为网址([Website]);在电子邮箱提取单元中,以“‘@’+域名”的格式作为电子邮箱的格式,而将“@”之前的用户名作为普通的字符串保留,在之后的步骤中进行分词。当在字符串中匹配到“‘@’+域名”格式的字符串,则判定其为电子邮箱([email])。The preprocessing module predefines the extraction of three special semantic factors (keyboard structure, website, and email address). In the keyboard structure extraction unit, for a substring in the password, If they are adjacent on the keyboard and their <shift> key status is the same, then the substring is determined The structure is a keyboard ([KB]), and the specific label is determined by its length ([KB4], [KB5], ...); the URL extraction unit detects whether there is a URL in the password through the prefix "www." and "http://". When "www." or "http://" is detected and the substring is matched with a substring in the common domain name suffix list, and the two substrings are separated by one or more strings separated by ".", the string from the prefix to the domain name suffix is determined to be a URL ([Website]); in the email extraction unit, the format of "'@'+domain name" is used as the format of the email, and the user name before "@" is retained as a common string and segmented in the subsequent steps. When a string in the format of "'@'+domain name" is matched in the string, it is determined to be an email ([email]).
所述的NLP语义提取模块包括:用于对于命名实体的识别的分词单元、词性标注(POS)单元和语义分类单元,其中:分词单元对于命名实体的识别,具体是指:采用两次分词的算法,第一次将包含四种语义因子([location],[month],[male_name],[female_name])的命名实体列表加入NLTK工具中进行分词,当经过第一轮操作后仍存在未识别分段,则将包含五种语义因子([location],[month],[male_name],[female_name],[cn_name_abbr])的命名实体列表加入NLTK工具中进行二次分词;语义分类单元通过与命名实体列表进行字符串匹配,将[NP]分段标注为命名实体([location],[month],[male_name],[female_name],[cn_name_abbr])之一;对于[NN]分段,首先通过字符串匹配判断:当在拼音列表中时,标注为[PY],否则,当长度大于3且均为辅音字母,则判定可能是英文缩写([abbr]),当不是以上两种情况,保持[NN]标签不变。The NLP semantic extraction module includes: a word segmentation unit, a part-of-speech tagging (POS) unit and a semantic classification unit for identifying named entities, wherein: the word segmentation unit for identifying named entities specifically refers to: using a two-word segmentation algorithm, first adding a named entity list containing four semantic factors ([location], [month], [male_name], [female_name]) to the NLTK tool for word segmentation, and if there are still unrecognized segments after the first round of operation, then adding a named entity list containing five semantic factors ([location], [month], [male_name], [female_name] , [cn_name_abbr]) is added to the NLTK tool for secondary word segmentation; the semantic classification unit matches the named entity list by string matching, and marks the [NP] segment as one of the named entities ([location], [month], [male_name], [female_name], [cn_name_abbr]); for the [NN] segment, firstly judge by string matching: when it is in the pinyin list, it is marked as [PY], otherwise, when the length is greater than 3 and all are consonants, it is judged to be an English abbreviation ([abbr]), and when it is neither of the above two cases, keep the [NN] label unchanged.
本实施例基于上述系统的基于语义结构的口令分词方法,具体步骤包括:This embodiment is based on the password segmentation method based on semantic structure of the above system, and the specific steps include:
S1)预处理模块读取待分词口令P。S1) The preprocessing module reads the password P to be segmented.
S2)在键盘结构提取单元中,对于口令中的一个子串 在键盘上相邻,且其<shift>键状态相同,则判定该子串键盘结构([KB]),具体的标签由其长度决定([KB4],[KB5],……)。S2) In the keyboard structure extraction unit, for a substring in the password If they are adjacent on the keyboard and their <shift> key status is the same, then the substring is determined Keyboard structure ([KB]), the specific label is determined by its length ([KB4], [KB5], ...).
S3)网址提取单元通过前缀“www.”和“http://”检测口令中是否存在网址,当检测到“www.”或“http://”且该子串后匹配到一个在常用域名后缀列表中的子串,且两个子串之间为由“.”分隔开的一个或多个字符串,则判定从前缀到域名后缀的该字符串为网址([Website])。S3) The website extraction unit detects whether there is a website in the password through the prefix "www." and "http://". When "www." or "http://" is detected and the substring is matched with a substring in the common domain name suffix list, and the two substrings are separated by one or more strings separated by ".", the string from the prefix to the domain name suffix is determined to be a website ([Website]).
S4)电子邮箱提取单元通过“‘@’+域名”的格式作为电子邮箱的格式,而将“@”之前的用户名作为普通的字符串保留,在之后的步骤中进行分词。当在字符串中匹配到“‘@’+域名”格式的字符串,则判定其为电子邮箱([email])。S4) The email extraction unit uses the format of "'@' + domain name" as the format of the email, and retains the user name before "@" as a common string, and performs word segmentation in the subsequent steps. When a string in the format of "'@' + domain name" is matched in the string, it is determined to be an email ([email]).
S5)将口令中未标注部分输出至字符分词单元中,按照其字符类型(数字、字母、特殊字符)的不同进行预分词,分别标注为数字([number])、字母([word])、特殊字符([special])。S5) outputting the unmarked part of the password to the character segmentation unit, performing pre-segmentation according to the different character types (numbers, letters, special characters), and marking them as numbers ([number]), letters ([word]), and special characters ([special]), respectively.
经过预处理模块之后,口令1qaziloveyou123@变为(1qaz,KB4),(iloveyou,word),(123,number),(@,special)。After passing through the preprocessing module, the password 1qaziloveyou123@ becomes (1qaz, KB4), (iloveyou, word), (123, number), (@, special).
S6)将标注为[word]的片段输出至NLTK中,分词过程使用的语料库为Brown语料库和Web Text语料库,并在其中添加若干命名实体列表,它们代表了5种语义因子:四种英文语义因子(([location],[month],[male_name],[female_name]),)和中文姓名缩写([cn_name_abbr]),首先不加入中文姓名缩写,只加入四种英文命名实体进行分词,当分词结果中含有未识别片段,则加入中文姓名缩写进行第二次分词。S6) Output the fragments marked as [word] to NLTK. The corpora used in the word segmentation process are the Brown corpus and the Web Text corpus, and add several named entity lists to them, which represent 5 semantic factors: four English semantic factors (([location], [month], [male_name], [female_name])) and Chinese name abbreviations ([cn_name_abbr]). First, do not add Chinese name abbreviations, only add four English named entities for word segmentation. When the word segmentation result contains unrecognized fragments, add Chinese name abbreviations for the second word segmentation.
S7)分词之后,将分词结果输出至POS单元进行语义标注,标注为以下几种有具体语义的语义因子:代词([NOUN]),名词([NOUN]),限定词([DET]),形容词([ADJ]),动词([VERB]),介词([ADP]),副词([ADV]),小品词([PRT]),连词([CONJ]),代表数字的英文单词([NUM]),词缀([X])。在POS标注过程中使用序列倒退标注器进行标注,最先使用Browntrigram tagger,接使用bigtram tagger,最后使用onegram tagger,出现在命名实体列表中的分段标注为[NN],未识别分段标注为[NN]。S7) After word segmentation, the word segmentation results are output to the POS unit for semantic annotation, and are annotated as the following semantic factors with specific semantics: pronoun ([NOUN]), noun ([NOUN]), determiner ([DET]), adjective ([ADJ]), verb ([VERB]), preposition ([ADP]), adverb ([ADV]), particle ([PRT]), conjunction ([CONJ]), English word representing number ([NUM]), affix ([X]). In the POS tagging process, a sequential backward tagger is used for annotation, first using Browntrigram tagger, then bigtram tagger, and finally onegram tagger. The segment that appears in the named entity list is annotated as [NN], and the unrecognized segment is annotated as [NN].
S8)POS标注之后,对[NN]分段进行进一步的语义分类。通过与命名实体列表进行字符串匹配,将[NP]分段标注为命名实体([location],[month],[male_name],[female_name],[cn_name_abbr])之一。S8) After POS tagging, the [NN] segment is further semantically classified. The [NP] segment is annotated as one of the named entities ([location], [month], [male_name], [female_name], [cn_name_abbr]) by string matching with the named entity list.
S9)对于[NN]分段,首先通过字符串匹配判断:当在拼音列表中时,标注为[PY],否则,当长度大于3且均为辅音字母,则判定可能是英文缩写([abbr]),当不是以上两种情况,保持[NN]标签不变。S9) For the [NN] segment, firstly judge by string matching: when it is in the pinyin list, it is marked as [PY]; otherwise, when the length is greater than 3 and all are consonants, it is judged to be an English abbreviation ([abbr]); when it is neither of the above two cases, keep the [NN] label unchanged.
S10)将S5)中标注为[number]的片段输出至数字语义分类单元,对于长度为4位的数字片段,当在1900到2020之间,认为是年份,标注为[year];对于长度为6的片段,当满足YYMMDD的日期格式,则判定是日期,标注为[YYMMDD];对于长度为8的片段,当满足YYYYMMDD的日期格式,则判定是日期,标注为[YYYYMMDD];对于11位的片段,当满足手机号的格式,则判定是手机号,标注为[mobilephone],其余的数字分段按照其长度标注为[num1],[num2],……。S10) outputs the fragment marked as [number] in S5) to the digital semantic classification unit. For a digital fragment with a length of 4 digits, when it is between 1900 and 2020, it is considered to be a year and marked as [year]; for a fragment with a length of 6, when it satisfies the date format of YYMMDD, it is determined to be a date and marked as [YYMMDD]; for a fragment with a length of 8, when it satisfies the date format of YYYYMMDD, it is determined to be a date and marked as [YYYYMMDD]; for a fragment with 11 digits, when it satisfies the format of a mobile phone number, it is determined to be a mobile phone number and marked as [mobilephone]. The remaining digital segments are marked as [num1], [num2], ... according to their lengths.
S11)将S5)中的[special]分段输入特殊字符标注单元,按照其长度标注为[spec1],[spec2],……。S11) Input the [special] segments in S5) into special character marking units and mark them as [spec1], [spec2], ... according to their lengths.
S12)将所有片段的标签按照顺序组合在一起,构成了该条口令的语义结构。S12) The labels of all the fragments are combined in order to form the semantic structure of the password.
本实施例选取13个泄露库,包括6个中文库(CSDN,天涯,优酷,17173,爱拍,嘟嘟牛)和7个英文库(LinkedIn,Zoosk,Myspace,Rockyou,MyHeritage,Gmail,Webhost),对该方法的分词效果进行测试,具体测试结果如表1所示。In this embodiment, 13 leaked libraries are selected, including 6 Chinese libraries (CSDN, Tianya, Youku, 17173, Aipai, Dudu Niu) and 7 English libraries (LinkedIn, Zoosk, Myspace, Rockyou, MyHeritage, Gmail, Webhost), to test the word segmentation effect of the method. The specific test results are shown in Table 1.
表1Table 1
本测试以分词结果中不含有NN因子作为一条口令分词成功的标准,可以看到,本实施例在中文泄露库和英文泄露库上都可以获得较高的分词成功率,尤其是在中文泄露库上的测试分词成功率均达到了90%以上,足以说明本实施例的有效性。This test uses the absence of NN factors in the word segmentation results as the criterion for successful word segmentation of a password. It can be seen that this embodiment can achieve a high word segmentation success rate on both the Chinese leak database and the English leak database, especially the test word segmentation success rate on the Chinese leak database has reached more than 90%, which is sufficient to illustrate the effectiveness of this embodiment.
与现有技术相比,本实施例选取了四个泄露库进行测试,其中有两个中文库(17173和爱拍)和两个英文库(LinkedIn和Gmail),本实施例在四个泄露库上的分词成功率分别为92.22%、91.24%、79.37%、84.19%,明显高于现有技术的65.17%、60.88%、62.26%、67.14%。Compared with the prior art, this embodiment selected four leaked libraries for testing, including two Chinese libraries (17173 and Aipai) and two English libraries (LinkedIn and Gmail). The word segmentation success rates of this embodiment on the four leaked libraries were 92.22%, 91.24%, 79.37%, and 84.19%, respectively, which were significantly higher than 65.17%, 60.88%, 62.26%, and 67.14% of the prior art.
上述具体实施可由本领域技术人员在不背离本发明原理和宗旨的前提下以不同的方式对其进行局部调整,本发明的保护范围以权利要求书为准且不由上述具体实施所限,在其范围内的各个实现方案均受本发明之约束。The above-mentioned specific implementation can be partially adjusted in different ways by those skilled in the art without departing from the principle and purpose of the present invention. The protection scope of the present invention shall be based on the claims and shall not be limited by the above-mentioned specific implementation. Each implementation scheme within its scope shall be subject to the constraints of the present invention.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010356699.0A CN111553155B (en) | 2020-04-29 | 2020-04-29 | Password word segmentation system and method based on semantic structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010356699.0A CN111553155B (en) | 2020-04-29 | 2020-04-29 | Password word segmentation system and method based on semantic structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111553155A CN111553155A (en) | 2020-08-18 |
CN111553155B true CN111553155B (en) | 2023-05-09 |
Family
ID=71999272
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010356699.0A Active CN111553155B (en) | 2020-04-29 | 2020-04-29 | Password word segmentation system and method based on semantic structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111553155B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112784227A (en) * | 2021-01-04 | 2021-05-11 | 上海交通大学 | Dictionary generating system and method based on password semantic structure |
CN113657118B (en) * | 2021-08-16 | 2024-05-14 | 好心情健康产业集团有限公司 | Semantic analysis method, device and system based on call text |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109460552A (en) * | 2018-10-29 | 2019-03-12 | 朱丽莉 | Rule-based and corpus Chinese faulty wording automatic testing method and equipment |
-
2020
- 2020-04-29 CN CN202010356699.0A patent/CN111553155B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109460552A (en) * | 2018-10-29 | 2019-03-12 | 朱丽莉 | Rule-based and corpus Chinese faulty wording automatic testing method and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN111553155A (en) | 2020-08-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tang et al. | Email data cleaning | |
Sun et al. | Enhancing Chinese word segmentation using unlabeled data | |
US8447588B2 (en) | Region-matching transducers for natural language processing | |
US8266169B2 (en) | Complex queries for corpus indexing and search | |
US8510097B2 (en) | Region-matching transducers for text-characterization | |
Krstev et al. | Using textual and lexical resources in developing serbian wordnet | |
El-Shishtawy et al. | An accurate arabic root-based lemmatizer for information retrieval purposes | |
Ahmadi | A tokenization system for the Kurdish language | |
Cing et al. | Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language | |
Patil et al. | Issues and challenges in marathi named entity recognition | |
CN111553155B (en) | Password word segmentation system and method based on semantic structure | |
Sembok et al. | Arabic word stemming algorithms and retrieval effectiveness | |
Huang et al. | Words without boundaries: Computational approaches to Chinese word segmentation | |
Tufiş et al. | DIAC+: A professional diacritics recovering system | |
Khan et al. | Urdu word segmentation using machine learning approaches | |
Jain et al. | Detection and correction of non word spelling errors in Hindi language | |
Attia et al. | Gwu-hasp: Hybrid arabic spelling and punctuation corrector | |
Starko et al. | Ukrainian Text Preprocessing in GRAC | |
CN112784227A (en) | Dictionary generating system and method based on password semantic structure | |
Naemi et al. | Informal-to-formal word conversion for persian language using natural language processing techniques | |
Kedtiwerasak et al. | Thai keyword extraction using textrank algorithm | |
Kumar et al. | Applications of stemming algorithms in information retrieval-a review | |
Ren et al. | A hybrid approach to automatic Chinese text checking and error correction | |
Bar et al. | Arabic multiword expressions | |
Nicolov et al. | Efficient spam analysis for weblogs through url segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |