CN111553155A - Password word segmentation system and method based on semantic structure - Google Patents

Password word segmentation system and method based on semantic structure Download PDF

Info

Publication number
CN111553155A
CN111553155A CN202010356699.0A CN202010356699A CN111553155A CN 111553155 A CN111553155 A CN 111553155A CN 202010356699 A CN202010356699 A CN 202010356699A CN 111553155 A CN111553155 A CN 111553155A
Authority
CN
China
Prior art keywords
semantic
password
segments
word segmentation
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010356699.0A
Other languages
Chinese (zh)
Other versions
CN111553155B (en
Inventor
邱卫东
贾兴磊
田昊
郭捷
唐鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202010356699.0A priority Critical patent/CN111553155B/en
Publication of CN111553155A publication Critical patent/CN111553155A/en
Application granted granted Critical
Publication of CN111553155B publication Critical patent/CN111553155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

A password word segmentation system and method based on semantic structure includes: the system comprises a preprocessing module, an NLP semantic extraction module and a non-NLP semantic labeling module, wherein: the preprocessing module receives a password to be segmented, extracts a special semantic factor which cannot be identified in the subsequent step in the password, pre-segments the rest part according to character types, outputs an alphabetic part to the NLP semantic extraction module, and outputs a non-alphabetic part to the non-NLP semantic labeling module; the NLP semantic extraction module utilizes an NLP tool to perform word segmentation on the letter part of the dialog to obtain various semantic factors; and the non-NLP semantic annotation module semantically annotates the part which cannot be participled by the NLP tool in the password. The invention carries out word segmentation on the password according to the semantic information contained in the password according to the corpus, identifies the semantic structure of the password and can carry out accurate word segmentation on the password set by a Chinese user and an English user.

Description

Password word segmentation system and method based on semantic structure
Technical Field
The invention relates to a technology in the field of computer security, in particular to a password word segmentation system and method based on a semantic structure.
Background
Because of its high security and usability, text passwords are still widely used in user authentication and online services for computer systems. Because most of passwords used by users are defined by the users themselves, the users often select a plurality of character strings containing specific semantics or rules as the passwords for the convenience of memory, and therefore, the research on the semantic structure of the passwords has great significance for improving the password security of the users.
Different from natural language, the password has no fixed syntactic structure, and when a user sets the password, various semantic factors can be combined at will according to the rules of a website, so that the word segmentation method aiming at the natural language is not suitable for word segmentation of the password.
Most of the previous researches on semantic structures of passwords aim at passwords of English users, and word segmentation methods proposed for the passwords of the English users are often poor in performance on a Chinese leakage library due to certain differences of the passwords set by the English users and Chinese users. In recent years, research on Chinese user passwords is started by a plurality of researchers, and the research shows that it is effective to add extra semantic information in a word segmentation system, but what information is added and how the information is added is still a subjective judgment, and no systematic method exists.
Disclosure of Invention
The invention provides a password word segmentation system and method based on semantic structure aiming at the defects in the prior art, which are used for segmenting words of a password according to semantic information contained in the password according to a corpus, identifying the semantic structure of the password and accurately segmenting the words of the password set by a Chinese user and an English user.
The invention is realized by the following technical scheme:
the invention relates to a password word segmentation system based on a semantic structure, which comprises: the semantic annotation system comprises a preprocessing module, a Natural Language Processing (NLP) semantic extraction module and a non-natural language processing (non-NLP) semantic annotation module, wherein: the preprocessing module receives a password to be segmented, extracts a special semantic factor which cannot be identified in the subsequent step in the password, pre-segments the rest part according to character types, outputs an alphabetic part to the NLP semantic extraction module, and outputs a non-alphabetic part to the non-NLP semantic labeling module; the NLP semantic extraction module utilizes an NLP tool to perform word segmentation on the letter part of the dialog to obtain various semantic factors; and the non-NLP semantic annotation module semantically annotates the part which cannot be participled by the NLP tool in the password.
The special semantic factors comprise: keyboard structure, website, email.
The part which can not be segmented by the NLP tool comprises: numbers, special characters.
The preprocessing module comprises: keyboard structure extraction element, email extraction element, website extraction element and character segmentation unit, wherein: the keyboard structure extraction unit extracts a part of the password related to the distribution rule of keyboard keys, namely extracts a keyboard structure in the password, the email address extraction unit extracts an email address contained in the password, the website extraction unit extracts a website contained in the password, and the character segmentation unit segments the password according to different character types.
The NLP semantic extraction module comprises: word segmentation unit, part of speech mark (POS) unit and semantic classification unit, wherein: the word segmentation unit utilizes a Natural Language Toolkit (NLTK) to segment words of the letter part input from the preprocessing module, and outputs the result to the POS unit; the POS unit marks all input factors by using a POS module of the NLTK and outputs semantic factors needing to be further classified to the semantic classification unit; the semantic classification unit further classifies the named entity factors by using a character string matching method, labels the named entity factors as place names, months, male names, female names and Chinese name abbreviation categories, matches unidentified factors in a pinyin list, labels the matched factors as pinyin, labels the unidentified factors as abbreviations when the unidentified factors meet the rule of 'consonant letters with the length exceeding 3 bits', otherwise, the unidentified factors are still labeled.
The semantic factors needing further classification comprise: named entity, unidentified segment.
The non-NLP semantic annotation module comprises: digit mark module and special character mark module, wherein: the digital marking module marks the digital segments containing specific semantics correspondingly and marks the digital segments with unknown semantics according to the length of the digital segments; the special character marking unit marks the special character segments according to the length of the special character segments.
The specific semantics comprise: date, year, mobile phone number.
Technical effects
The invention solves the problem of word segmentation of passwords of different languages and different leakage libraries;
compared with the prior art, the method and the device have the advantages that the password is extracted from semantic factors including various character types, such as a keyboard structure, an electronic mailbox, a website and the like in advance before formal word segmentation is carried out on the password, so that semantic loss caused by word segmentation according to the character types is avoided, the word segmentation accuracy is improved, the keyboard structure contained in the password can be effectively extracted, and the word segmentation accuracy is improved; the invention adds a plurality of semantic factors such as place name, Chinese name abbreviation, pinyin, abbreviation, mobile phone number, keyboard structure, website, email and the like in the word segmentation system, improves the word segmentation accuracy and realizes the word segmentation of the Chinese website password.
Drawings
FIG. 1 is a schematic diagram of the system of the present invention.
Detailed Description
As shown in fig. 1, the present embodiment relates to a password word segmentation system based on semantic structure, which includes: the system comprises a preprocessing module, an NLP semantic extraction module and a non-NLP semantic classification module, wherein: the preprocessing module is connected with the NLP semantic extraction module and transmits letter parts obtained by word pre-division in the preprocessing process, and the preprocessing module is connected with the non-NLP semantic labeling module and transmits numbers and special character parts obtained by word pre-division in the preprocessing process.
The pre-processing module predefines the extraction of three special semantic factors (keyboard structure, web address, electronic mail box), in the keyboard structure extraction unit, one substring in the password
Figure BDA0002473712320000021
Figure BDA0002473712320000035
Adjacent on the keyboard, and it<shift>If the key states are the same, the substring is judged
Figure BDA0002473712320000031
Is a keyboard structure ([ KB)]) The specific tag is determined by its length ([ KB4)],[KB5]… …); the Website extraction unit detects whether a Website exists in the password through prefixes of ' www. ' and ' http:// ', and when the ' www. ' or the ' http:// ' is detected and matched with a substring in a common domain name suffix list, and one or more character strings separated by the ' are arranged between the two substrings, the character string from the prefix to the domain name suffix is judged to be the Website ([ Website ])]) (ii) a In the email box extracting unit, the format of ' @ ' + domain name ' is used as the format of the email box, and the user name before ' @ ' is reserved as a common character string, and word segmentation is carried out in the following steps. When matching the character string in the format of ' @ ' + domain name ' in the character string, determining that the character string is the electronic mailbox ([ email [)])。
The NLP semantic extraction module comprises: a segmentation unit, a part-of-speech tagging (POS) unit, and a semantic classification unit for recognition of named entities, wherein: the recognition of the word segmentation unit for the named entity specifically includes: adopting an algorithm of two-time word segmentation, firstly adding a named entity list containing four semantic factors ([ location ], [ month ], [ majname ]) into an NLTK tool for word segmentation, and adding a named entity list containing five semantic factors ([ location ], [ month ], [ majname ], [ fe _ name ], [ cn _ name _ abbr ]) into the NLTK tool for secondary word segmentation when unrecognized segments still exist after the first round of operation; the semantic classification unit labels the segment of [ NP ] as one of the named entities ([ location ], [ month ], [ male _ name ], [ female _ name ], [ cn _ name _ abbr ]) by performing string matching with the named entity list; for [ NN ] segmentation, first judged by string matching: when in the Pinyin list, it is marked as [ PY ], otherwise, when the length is more than 3 and the consonant letters are all, it is determined that English abbreviation ([ abbr ]) is possible, and when not, the [ NN ] label is kept unchanged.
The present embodiment of the present invention is a semantic structure-based password word segmentation method based on the above system, and specifically includes the following steps:
s1) the preprocessing module reads the password P to be participled.
S2) in the keyboard structure extraction unit, for one sub-string in the password
Figure BDA0002473712320000032
Figure BDA0002473712320000033
Adjacent on the keyboard, and it<shift>If the key states are the same, the substring is judged
Figure BDA0002473712320000034
Keyboard structure ([ KB)]) The specific tag is determined by its length ([ KB4)],[KB5],……)。
S3) the Website extracting unit detects whether the Website exists in the password through prefixes of ' www. ' and ' http:// ', and when the ' www. ' or ' http:// ' is detected and the substrings are matched with one substring in the list of common domain name suffixes and one or more character strings separated by the ' are arranged between the two substrings, the character string from the prefix to the domain name suffix is judged to be the Website ([ Website ]).
S4) the mailbox extraction unit takes the format of ' @ ' + domain name ' as the format of the mailbox, and retains the user name before ' @ ' as the ordinary character string, and performs word segmentation in the following steps. When the character string in the format of ' @ ' + domain name ' is matched in the character string, the character string is judged to be the electronic mailbox ([ email ]).
S5) outputting the unmarked part of the password to a character word segmentation unit, pre-segmenting words according to different character types (numbers, letters and special characters) of the password, and respectively marking the words as numbers, letters and special characters.
After the preprocessing module, the password 1qaziloveyou123@ becomes (1qaz, KB4), (iloveyou, word), (123, number), (@, special).
S6), outputting the segment labeled as [ word ] to NLTK, wherein the corpora used in the word segmentation process are a Brown corpus and a Web Text corpus, and a plurality of named entity lists are added in the corpora, and represent 5 semantic factors: four English semantic factors (([ location ], [ month ], [ male _ name ], [ female _ name ]),) and Chinese name abbreviations ([ cn _ name _ abbr ]), firstly, the Chinese name abbreviations are not added, only four English named entities are added for word segmentation, and when the word segmentation result contains unidentified segments, the Chinese name abbreviations are added for second word segmentation.
S7), outputting the word segmentation result to a POS unit for semantic annotation after word segmentation, wherein the semantic annotation is marked by the following semantic factors with specific semantics: pronouns ([ NOUN ]), NOUNs ([ NOUN ]), qualifiers ([ DET ]), adjectives ([ ADJ ]), VERBs ([ VERB ]), prepositions ([ ADP ]), adverbs ([ ADV ]), subtexts ([ PRT ]), conjunctions ([ CONJ ]), English words ([ NUM ]), suffixes ([ X ]), which represent numbers. And in the POS labeling process, a sequence retrogretter is used for labeling, firstly a Browntrigram tagger is used, then a bigtram tagger is used, and finally an onegram tagger is used, the segments appearing in the named entity list are labeled as [ NN ], and the unidentified segments are labeled as [ NN ].
S8), carrying out further semantic classification on the [ NN ] segment after POS labeling. The segment of [ NP ] is labeled as one of the named entities ([ location ], [ month ], [ majname ], [ fe _ name ], [ cn _ name _ abbr ]) by string matching to the named entity list.
S9) for [ NN ] segmentation, first judged by string matching: when in the Pinyin list, it is marked as [ PY ], otherwise, when the length is more than 3 and the consonant letters are all, it is determined that English abbreviation ([ abbr ]) is possible, and when not, the [ NN ] label is kept unchanged.
S10) outputting the segments marked as [ number ] in S5) to a digital semantic classification unit, wherein the segments with the length of 4 bits are considered as years when the segments are between 1900 and 2020 and are marked as [ year ]; for a segment with a length of 6, when the date format of YYMMDD is satisfied, judging that the date is a date and marking as [ YYMMDD ]; for a segment with a length of 8, when the date format of YYYYMMDD is satisfied, judging that the date is a date and marking as [ YYYYMMDD ]; for the 11-bit segment, when the format of the mobile phone number is satisfied, it is determined that the segment is the mobile phone number and is labeled as [ mobile phone ], and the rest of the number segments are labeled as [ num1], [ num2], … … according to the length thereof.
S11) inputting the [ special ] segment in S5) into a special character marking unit, marked as [ spec1], [ spec2], … … according to the length thereof.
S12) combining the labels of all the segments together according to the order, and forming the semantic structure of the password.
In this embodiment, 13 leaky libraries including 6 middle libraries (CSDN, skyline, youku, 17173, love clap, Dudu cattle) and 7 english libraries (LinkedIn, Zoosk, Myspace, Rockyou, MyHeritage, Gmail, Webhost) are selected, and the word segmentation effect of the method is tested, and the specific test results are shown in table 1.
TABLE 1
Figure BDA0002473712320000051
The test takes the NN factor not contained in the word segmentation result as a standard for successful word segmentation of the password, and it can be seen that the embodiment can obtain higher word segmentation success rate on both the Chinese leakage library and the English leakage library, and particularly the test word segmentation success rate on the Chinese leakage library reaches over 90 percent, which is enough to explain the effectiveness of the embodiment.
Compared with the prior art, four leakage libraries are selected for testing, wherein two middle libraries (17173 and love pat) and two English libraries (LinkedIn and Gmail) are provided, and the word segmentation success rates of the four leakage libraries are respectively 92.22%, 91.24%, 79.37% and 84.19%, which are obviously higher than those of the prior art, such as 65.17%, 60.88%, 62.26% and 67.14%.
The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (8)

1. A semantic structure based password segmentation system, comprising: the semantic annotation system comprises a preprocessing module, a Natural Language Processing (NLP) semantic extraction module and a non-natural language processing (non-NLP) semantic annotation module, wherein: the preprocessing module receives a password to be segmented, extracts a special semantic factor which cannot be identified in the subsequent step in the password, pre-segments the rest part according to character types, outputs an alphabetic part to the NLP semantic extraction module, and outputs a non-alphabetic part to the non-NLP semantic labeling module; the NLP semantic extraction module utilizes an NLP tool to perform word segmentation on the letter part of the dialog to obtain various semantic factors; the non-NLP semantic annotation module carries out semantic annotation on the part which cannot be participled by an NLP tool in the password;
the preprocessing module comprises: keyboard structure extraction element, email extraction element, website extraction element and character segmentation unit, wherein: the keyboard structure extraction unit extracts a part of the password related to the distribution rule of keyboard keys, namely extracts a keyboard structure in the password, the email address extraction unit extracts an email address contained in the password, the website extraction unit extracts a website contained in the password, and the character segmentation unit segments the password according to different character types.
2. The system of claim 1, wherein the NLP semantic extraction module comprises: word segmentation unit, part of speech mark (POS) unit and semantic classification unit, wherein: the word segmentation unit utilizes a Natural Language Toolkit (NLTK) to segment words of the letter part input from the preprocessing module, and outputs the result to the POS unit; the POS unit marks all input factors by using a POS module of the NLTK and outputs semantic factors needing to be further classified to the semantic classification unit; the semantic classification unit further classifies the named entity factors by using a character string matching method, labels the named entity factors as place names, months, male names, female names and Chinese name abbreviation categories, matches unidentified factors in a pinyin list, labels the matched factors as pinyin, labels the unidentified factors as abbreviations when the unidentified factors meet the rule of 'consonant letters with the length exceeding 3 bits', otherwise, the unidentified factors are still labeled.
3. The system of claim 1, wherein the non-NLP semantic labeling module comprises: digit mark module and special character mark module, wherein: the digital marking module marks the digital segments containing specific semantics correspondingly and marks the digital segments with unknown semantics according to the length of the digital segments; the special character marking unit marks the special character segments according to the length of the special character segments.
4. The system according to claim 2, wherein the recognition of the named entity by the word segmentation unit specifically comprises: the method adopts an algorithm of two-time word segmentation, firstly adds a named entity list containing four semantic factors ([ location ], [ month ], [ majname ]) into an NLTK tool for word segmentation, and adds a named entity list containing five semantic factors ([ location ], [ month ], [ majname ], [ fe _ name ], [ cn _ name _ abbr ]) into the NLTK tool for secondary word segmentation when unrecognized segments still exist after the first round of operation.
5. The system of claim 2, wherein the semantic categorization unit labels [ NP ] segments as named entities ([ location ], [ month ], [ majname ], [ femaljname ],
[ cn _ name _ abbr ]); for [ NN ] segmentation, first judged by string matching: when in the Pinyin list, it is marked as [ PY ], otherwise, when the length is more than 3 and the consonant letters are all, it is determined that English abbreviation ([ abbr ]) is possible, and when not, the [ NN ] label is kept unchanged.
6. A semantic structure based password segmentation method based on the system of any preceding claim, comprising the steps of:
s1) the preprocessing module reads the password P to be participled;
s2) in the keyboard structure extraction unit, for one sub-string in the password
Figure FDA0002473712310000021
When c is going toiAnd c(i+1)Adjacent on the keyboard, and it<shift>If the key states are the same, the substring is judged
Figure FDA0002473712310000022
Is a keyboard structure ([ KB)]) And its label is determined by its length ([ KB4)],[KB5],……);
S3) the Website extracting unit detects whether a Website exists in the password through prefixes of ' www. ' and ' http:// ', and when the ' www. ' or the ' http:// ' is detected and the substrings are matched with one substring in a common domain name suffix list, and one or more character strings separated by the ' are arranged between the two substrings, the character string from the prefix to the domain name suffix is judged to be the Website ([ Website ]);
s4) the email extracting unit takes the format of ' @ ' + domain name ' as the format of the email, and reserves the user name before ' @ ' as the common character string; when matching the character string in the format of ' @ ' + domain name ' in the character string, judging that the character string is an electronic mailbox ([ email ]);
s5) outputting the part of the password which is not marked to a character word segmentation unit, pre-segmenting words according to different character types (numbers, letters and special characters) of the password, and marking the words as numbers, letters and special characters respectively;
s6) outputting the segments marked as [ word ] to NLTK, wherein the language database used in the word segmentation process is a Brown language database and a Web Text language database, and a plurality of named entity lists are added in the language databases;
s7) outputting the word segmentation result to a POS unit for semantic annotation, wherein the semantic annotation is as follows: pronouns ([ NOUN ]), NOUNs ([ NOUN ]), qualifiers ([ DET ]), adjectives ([ ADJ ]), VERBs ([ VERB ]), prepositions ([ ADP ]), adverbs ([ ADV ]), subtexts ([ PRT ]), conjunctions ([ CONJ ]), English words ([ NUM ]) representing numbers, and affixes ([ X ]);
s8), carrying out further semantic classification on the [ NN ] segments after POS labeling; segment the [ NP ] as one of the named entities ([ location ], [ month ], [ majname ], [ fe _ name ], [ cn _ name _ abbr ]), by string matching with the named entity list;
s9) for [ NN ] segmentation, first judged by string matching: when the pinyin is in the pinyin list, the pinyin is marked as [ PY ], otherwise, when the length is more than 3 and the pinyin is consonant letters, the pinyin is judged as English abbreviation ([ abbr ]), and when the length is not more than 3 and the pinyin is the consonant letters, the [ NN ] label is kept unchanged;
s10) outputting the segments marked as [ number ] in S5) to a digital semantic classification unit, wherein the segments with the length of 4 bits are considered as years when the segments are between 1900 and 2020 and are marked as [ year ]; for a segment with a length of 6, when the date format of YYMMDD is satisfied, judging that the date is a date and marking as [ YYMMDD ]; for a segment with a length of 8, when the date format of YYYYMMDD is satisfied, judging that the date is a date and marking as [ YYYYMMDD ]; for the 11-bit segment, when the format of the mobile phone number is satisfied, the segment is determined to be the mobile phone number and is marked as [ mobilephone ], and the rest of the number segments are marked as [ num1], [ num2], … … according to the length of the segment;
s11) inputting the [ special ] segment in S5) into a special character marking unit, and marking the special character marking unit as [ spec1], [ spec2], … …;
s12) combining the labels of all the segments together according to the order, and forming the semantic structure of the password.
7. The method of claim 6, wherein the named entity list in step S6) comprises: four English semantic factors (([ location ], [ month ], [ male _ name ], [ female _ name ]),) and Chinese name abbreviations ([ cn _ name _ abbr ]), firstly, the Chinese name abbreviations are not added, only four English named entities are added for word segmentation, and when the word segmentation result contains unidentified segments, the Chinese name abbreviations are added for second word segmentation.
8. The method of claim 6, wherein step s7) comprises labeling with a sequence retromarker during POS labeling, first with Brown trigram tagger, then with bigtram tagger, and finally with onegram tagger, and the segments appearing in the named entity list are labeled [ NN ] and the unrecognized segments are labeled [ NN ].
CN202010356699.0A 2020-04-29 2020-04-29 Password word segmentation system and method based on semantic structure Active CN111553155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010356699.0A CN111553155B (en) 2020-04-29 2020-04-29 Password word segmentation system and method based on semantic structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010356699.0A CN111553155B (en) 2020-04-29 2020-04-29 Password word segmentation system and method based on semantic structure

Publications (2)

Publication Number Publication Date
CN111553155A true CN111553155A (en) 2020-08-18
CN111553155B CN111553155B (en) 2023-05-09

Family

ID=71999272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010356699.0A Active CN111553155B (en) 2020-04-29 2020-04-29 Password word segmentation system and method based on semantic structure

Country Status (1)

Country Link
CN (1) CN111553155B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784227A (en) * 2021-01-04 2021-05-11 上海交通大学 Dictionary generating system and method based on password semantic structure
CN113657118A (en) * 2021-08-16 2021-11-16 北京好欣晴移动医疗科技有限公司 Semantic analysis method, device and system based on call text

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460552A (en) * 2018-10-29 2019-03-12 朱丽莉 Rule-based and corpus Chinese faulty wording automatic testing method and equipment

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460552A (en) * 2018-10-29 2019-03-12 朱丽莉 Rule-based and corpus Chinese faulty wording automatic testing method and equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784227A (en) * 2021-01-04 2021-05-11 上海交通大学 Dictionary generating system and method based on password semantic structure
CN113657118A (en) * 2021-08-16 2021-11-16 北京好欣晴移动医疗科技有限公司 Semantic analysis method, device and system based on call text
CN113657118B (en) * 2021-08-16 2024-05-14 好心情健康产业集团有限公司 Semantic analysis method, device and system based on call text

Also Published As

Publication number Publication date
CN111553155B (en) 2023-05-09

Similar Documents

Publication Publication Date Title
US8447588B2 (en) Region-matching transducers for natural language processing
US8266169B2 (en) Complex queries for corpus indexing and search
US8510097B2 (en) Region-matching transducers for text-characterization
Evans et al. A framework for named entity recognition in the open domain.
Warjri et al. Part-of-speech (POS) tagging using conditional random field (CRF) model for Khasi corpora
Saloot et al. An architecture for Malay Tweet normalization
Cing et al. Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language
Freihat et al. Towards an optimal solution to lemmatization in Arabic
Patil et al. Issues and challenges in marathi named entity recognition
CN111553155B (en) Password word segmentation system and method based on semantic structure
Díez Platas et al. Medieval Spanish (12th–15th centuries) named entity recognition and attribute annotation system based on contextual information
Yang et al. Combination and boundary detection approaches on Chinese indexing
Khan et al. Urdu word segmentation using machine learning approaches
Gupta et al. Designing and development of stemmer of Dogri using unsupervised learning
Venčkauskas et al. Problems of authorship identification of the national language electronic discourse
Joao et al. New functions for unsupervised asymmetrical paraphrase detection
Lui Generalized language identification
Charoenpornsawat et al. Automatic sentence break disambiguation for Thai
Jain et al. Detection and correction of non word spelling errors in Hindi language
Rychlý et al. Annotated amharic corpora
Wang et al. Chinese-braille translation based on braille corpus
CN111767733A (en) Document security classification discrimination method based on statistical word segmentation
Yeshambel et al. Evaluation of corpora, resources and tools for Amharic information retrieval
US20090150141A1 (en) Method and system for learning second or foreign languages
Khan et al. Supervised Urdu word segmentation model based on POS information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant