CN106168946A - A kind of method identifying user initials phenomenon - Google Patents

A kind of method identifying user initials phenomenon Download PDF

Info

Publication number
CN106168946A
CN106168946A CN201610474472.XA CN201610474472A CN106168946A CN 106168946 A CN106168946 A CN 106168946A CN 201610474472 A CN201610474472 A CN 201610474472A CN 106168946 A CN106168946 A CN 106168946A
Authority
CN
China
Prior art keywords
user
name
phenomenon
abbreviation
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610474472.XA
Other languages
Chinese (zh)
Inventor
亚静
王玉斌
柳厅文
时金桥
李全刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201610474472.XA priority Critical patent/CN106168946A/en
Publication of CN106168946A publication Critical patent/CN106168946A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of method identifying user initials phenomenon, and step includes: 1) filter the character in two or more user names, only retains English alphabet and numeral;2) user name after above-mentioned filtration being respectively divided into several continuous fragments, the initial character respectively choosing each fragment forms a new character strings;3) obtain the longest abbreviation length according to above-mentioned new character strings, if length value is more than or equal to given threshold value Δ L, then judge to have between described user name user initials phenomenon;By unified for the English alphabet of the reservation form being converted into lower case or upper case;Described fragment is word or single character;Described fragment obtains according to the dictionary segmentation specified;Dynamic programming algorithm is used to obtain the longest abbreviation length according to described new character strings.

Description

A kind of method identifying user initials phenomenon
Technical field
The present invention relates to computer realm, be specifically related to a kind of method identifying user initials phenomenon.
Background technology
In recent years, internet development is rapid, has been deep into different social sectors, such as in Sina, Sohu, rise Xun Deng portal website viewing news or video, carry out communication for information at social networkies such as microblogging, mhkc, communities, and people are using During these networks, meeting login account, fills in user name.User name be user fill in when registration of website meet certain rule Can be identified for that the character string of user identity, be generally made up of some spcial characters such as English alphabet, numeral and underscores.
User is when some websites is registered, owing to the user name of this website has a uniqueness, and conventional user name by Other people register, or do not use conventional user name for other considerations such as protection individual privacies, but simultaneously for the ease of note Recalling and select to abridge some word in conventional user name, become a new user name and register use, this is The abbreviation phenomenon of user name.Identify that the abbreviation phenomenon between user name is of great importance for study Internet, such as, research worker Social network data is being excavated, such as user behavior analysis, personalized recommendation etc., it is sometimes desirable to by difference social networks Same user or similar users be associated, some of them correlating method is accomplished by by the abbreviation phenomenon identified in user name Analyze the similarity degree of user name.
Identify that user initials phenomenon is and provide two user names, it is judged that whether certain fragment of one of them user name It it is the abbreviation of certain fragment of another user name.The universal method solving this problem is enumerative technique, i.e. enumerates a user name Substring s1Substring s with another user name2, check s1Whether it is s2Abbreviated form.If there is such s1、s2, then Think and between two user names, there is abbreviation phenomenon, and s1It is to s2Abbreviation, otherwise it is assumed that it is existing to there is not abbreviation between two user names As.
But, enumerative technique needs to enumerate all of substring of user name when identifying user initials phenomenon, and time overhead is big, It is difficult to be suitable for large-scale active user name abbreviation phenomenon identification mission.
Summary of the invention
In view of above-mentioned deficiency, the present invention provides a kind of method identifying user initials phenomenon, it is not necessary to enumerate user name Substring, reduces amount of calculation, and time overhead is little, it is possible to identify user initials phenomenon more efficiently.
For solving above-mentioned technical problem, the present invention adopts the following technical scheme that
A kind of method identifying user initials phenomenon, step includes:
1) character in two or more user names is filtered, only retain English alphabet and numeral;
2) user name after above-mentioned filtration is respectively divided into several continuous fragments, respectively chooses the initial character group of each fragment Become a new character strings;
3) the longest abbreviation length is obtained according to above-mentioned new character strings, as length value is more than or equal to given threshold value Δ L, then Judge to have between described user name user initials phenomenon.
Further, by unified for the English alphabet of the reservation form being converted into lower case or upper case.
Further, described fragment is word or single character.
Further, described fragment obtains according to the dictionary segmentation specified.
Further, described dictionary includes name, place name, name, fabricate word or other appointment words, and this appointment word includes name Word, verb, adjective, adverbial word.
Further, dynamic programming algorithm is used to obtain the longest abbreviation length according to described new character strings.
Further, described threshold value Δ L is the minimum length of user initials form to be identified.
Further, when Chinese personal name Pinyin abbreviation form to be identified, Δ L >=2.
Further, when English name-to abbreviated form to be identified, Δ L=2.
The invention has the beneficial effects as follows, the method that the present invention provides is without enumerating all of substring of user name, it is possible to automatically Identify and whether there is abbreviation phenomenon between user name.Before identifying, judging user name, in advance user name split, abridge, Decrease the length of character string self during identification compared to prior art, thus decrease amount of calculation.Identifying user name During abbreviation phenomenon, this method just can be judged easily by the value judging the longest abbreviation length, and the obtaining of the longest abbreviation length Taking and use a kind of dynamic programming algorithm, the process enumerating substring compared to prior art one by one decreases the meter of a large amount of repetition Calculate.
Accompanying drawing explanation
Fig. 1 is a kind of method flow diagram identifying user initials phenomenon provided in embodiment.
Detailed description of the invention
Features described above and advantage for making the present invention can become apparent, special embodiment below, and coordinate institute's accompanying drawing to make Describe in detail as follows.
This provides a kind of method identifying user initials phenomenon, as shown in Figure 1, it is assumed that given two user names a, b, sentences Between disconnected a and b, whether there is user initials phenomenon, comprise the steps of
1, user name pretreatment
User name generally can comprise the part spcial characters such as English alphabet, numeral and underscore, and this step is intended to Except the spcial character in user name, the spcial character that the present invention mentions refers to all characters in addition to English alphabet and numeral, Only retaining English alphabet and numeral, and English alphabet is changed into lower case or upper case form by unification, the present embodiment is with small letter shape As a example by formula.
2, split and abridge user name
According to the dictionary W of the word with practical significance that des specifies, user name is divided into several continuous print Fragment, each fragment is the word in W or single character, simultaneously need to ensure that the quantity of the fragment after segmentation is the fewest.Should The des of dictionary W is user, it is also possible to for other people.What is called has the word of practical significance, e.g. name, place name, thing Name or other are for concept meaningful from the point of view of des, this concept can be noun, verb, adjective, adverbial word or other property Vocabulary, or other vocabulary fabricated by des.
By the following method the user name through above-mentioned pretreatment can be split, for ease of expressing, set given herein The entitled u of user, a length of n, remember SiThe segmentation knot of the substring that the i-th character of expression user name u forms to the n-th character Really, segmentation step is as follows:
(1) i=n+1, S are initializedn+1={ }.
(2) make i=i-1, check each word w in dictionary W successively, if a length of m of w, if meeting w= uiui+1…ui+m-1, the position of the i-th character during i.e. w occurs in user name u, if at this moment SiNon-assignment or satisfied | Si+m|+1 <|Si|, then make Si=w ∪ Si+m.If checking out in W S after all of wordiAssignment not yet, then make Si=u [i] ∪ Si+m。 Wherein u [i] represents the i-th character of u.
(3) (2nd) step is repeated, until i=0, now S0It it is exactly the segmentation result of user name u.
After splitting user name, the initial character taking each fragment in segmentation result forms a new character strings work Abbreviated form for original subscriber's name.
3, the longest abbreviation length is calculated
If user name a, the segmentation result of b that obtain after the 2nd step processes are respectively Xa、Xb, abbreviated form is respectively Ya、Yb, the longest abbreviation length m of user name a and b need to be obtained.The longest abbreviation length refers to that two user initials forms are full The longest common portion of foot specified conditions, these specified conditions are two use before each character correspondence abbreviation of this common portion Character string in the segmentation result at family, for single character and another is for word to be satisfied by one.For obtaining this longest abbreviation Length m, specialized designs one dynamic programming algorithm, its formula is as follows:
m = m a x 1 &le; i &le; | Y a | , 1 &le; j &le; | Y b | f ( i , j )
Wherein, Ya[i] represents character string YaI-th character, Yb[j] represents character string YbJth character, | Xa[i]| Represent set XaThe length of middle i-th character string, | Xb[j] | represent set XbThe length of middle jth character string, | Ya| represent word Symbol string YaLength, | Yb| represent character string YbLength.
4, user initials phenomenon is identified
If given threshold value is Δ L, if meeting m >=Δ L, illustrating between user name a, b, to there is abbreviation phenomenon, otherwise saying Abbreviation phenomenon is there is not between bright user name a, b.
In conjunction with said method, it is applied to the embodiment of concrete scene especially exemplified by following two, to illustrate that this method is practical.
Embodiment 1:
The present embodiment 1 is for identifying the abbreviation phenomenon that whether there is name phonetic between user name.According to Chinese surname From the point of view of name feature, name at least two word, such as Zhang Wei, history Xiao Ming etc., as a example by Zhang Wei, its name phonetic is Zhang Wei or Wei Zhang, Pinyin abbreviation example has randomness, and from statistics, on the biggest probability of Pinyin abbreviation It is the initial taking name, i.e. zw or wz;And the name phonetic Shi Xiaoming or Xiaoming Shi of history Xiao Ming, its phonetic Abbreviation is likely to sxm or xms, by above-mentioned analysis it is believed that W is the set that string length is not less than the phonetic of 2, and Δ L=2. In like manner, if the abbreviation of English name-to to be identified, due to the English name generally less use of middle name, i.e. English Name is at least made up of first name and last name, and such as English name Sheldon Lee Cooper, is often Sheldon Cooper, is abbreviated as sc, so the minimum length of abbreviated form to be identified may be configured as 2, i.e. and Δ L=2.
Given two user names a=zgxxidian123, b=zhangguoxin012, the present embodiment present invention to be passed through Whether there is abbreviation phenomenon between the method identification the two user name provided, provide method to be calculated by above-mentioned 2nd step User name a, the segmentation result of b are respectively Xa={ z, g, x, xi, dian, 1,2,3}, Xb=zhang, guo, xin, 0,1,2}, Abbreviated form is respectively Ya=zgxxd123, Yb=zgx012, further, is calculated user name a and b by above-mentioned 3rd step The longest abbreviation length m=3.By epimere known Δ L=2, then m >=Δ L, illustrate between user name a, b, to there is abbreviation phenomenon.
Embodiment 2:
The present embodiment 2 is for identifying the abbreviation phenomenon that whether there is name phonetic between user name, by above-described embodiment Analyzing and understand, W is the set that string length is not less than the phonetic of 2, Δ L=2.
Given two user names a=wanxia68, b=wanter_123, provide method to be calculated by above-mentioned 2nd step User name a, the segmentation result of b are respectively Xa={ wan, xia, 6,8}, Xb=wan, te, r, 1, and 2,3}, abbreviated form is respectively Ya=wx68, Yb=wtr123, further, is calculated the length of the longest abbreviation of user name a and b by above-mentioned 3rd step M=0.Due to m < Δ L, illustrate not exist between user name a, b abbreviation phenomenon.
The method that the present invention provides is automatically to identify whether user name exists abbreviation phenomenon by algorithm, it is not necessary to as existing skill Art enumerates all of substring of user name like that, simple and feasible.Before identifying, judging user name, in advance user name is carried out point Cut, abridge, decrease the length of character string self during identification compared to prior art, thus decrease amount of calculation.Knowing During other user initials phenomenon, this method just can be judged easily by the value judging the longest abbreviation length, and the longest abbreviation The acquisition of length uses a kind of dynamic programming algorithm, and the process enumerating substring compared to prior art one by one decreases in a large number The calculating repeated.
Last it should be noted that, although the present invention is open as above with embodiment, but these embodiments are not intended to limit Determining the present invention, in art, it can be modified or replace by those of ordinary skill, without deviating from the essence of the present invention God and scope, therefore protection scope of the present invention is as the criterion with claims.

Claims (9)

1. the method identifying user initials phenomenon, step includes:
1) character in two or more user names is filtered, only retain English alphabet and numeral;
2) user name after above-mentioned filtration is respectively divided into several continuous fragments, respectively chooses the initial character composition one of each fragment New character strings;
3) obtain the longest abbreviation length according to above-mentioned new character strings, if length value is more than or equal to given threshold value Δ L, then judge There is between described user name user initials phenomenon.
Method the most according to claim 1, it is characterised in that be converted into lower case or upper case by unified for the English alphabet of reservation Form.
Method the most according to claim 1, it is characterised in that described fragment is word or single character.
Method the most according to claim 1, it is characterised in that described fragment obtains according to the dictionary segmentation specified.
Method the most according to claim 4, it is characterised in that described dictionary include name, place name, name, fabricate word or Other specify word, and this appointment word includes noun, verb, adjective, adverbial word.
Method the most according to claim 1, it is characterised in that use dynamic programming algorithm to obtain according to described new character strings The longest abbreviation length.
Method the most according to claim 1, it is characterised in that described threshold value Δ L is user initials form to be identified Minimum length.
Method the most according to claim 7, it is characterised in that when Chinese personal name Pinyin abbreviation form to be identified, Δ L >= 2。
Method the most according to claim 7, it is characterised in that when English name-to abbreviated form to be identified, Δ L=2.
CN201610474472.XA 2016-06-24 2016-06-24 A kind of method identifying user initials phenomenon Pending CN106168946A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610474472.XA CN106168946A (en) 2016-06-24 2016-06-24 A kind of method identifying user initials phenomenon

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610474472.XA CN106168946A (en) 2016-06-24 2016-06-24 A kind of method identifying user initials phenomenon

Publications (1)

Publication Number Publication Date
CN106168946A true CN106168946A (en) 2016-11-30

Family

ID=58066025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610474472.XA Pending CN106168946A (en) 2016-06-24 2016-06-24 A kind of method identifying user initials phenomenon

Country Status (1)

Country Link
CN (1) CN106168946A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870905A (en) * 2017-12-04 2018-04-03 语联网(武汉)信息技术有限公司 A kind of recognition methods of specific vocabulary
CN109240583A (en) * 2017-07-04 2019-01-18 优信数享(北京)信息技术有限公司 A kind of method, terminal and the data query system of focus input frame input data
CN109800332A (en) * 2018-12-04 2019-05-24 北京明略软件系统有限公司 Method, apparatus, computer storage medium and the terminal of processing field name
CN113419720A (en) * 2021-07-06 2021-09-21 北京理工大学 Automatic judgment method for necessity of abbreviation expansion for source code
CN113688614A (en) * 2020-05-19 2021-11-23 阿里巴巴集团控股有限公司 Method, device and storage medium for generating field annotation and understanding character string

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561813A (en) * 2009-05-27 2009-10-21 东北大学 Method for analyzing similarity of character string under Web environment
US7962849B2 (en) * 2005-03-30 2011-06-14 International Business Machines Corporation Processing of user character inputs having whitespace

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7962849B2 (en) * 2005-03-30 2011-06-14 International Business Machines Corporation Processing of user character inputs having whitespace
CN101561813A (en) * 2009-05-27 2009-10-21 东北大学 Method for analyzing similarity of character string under Web environment

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
PHILIP TOP 等: "A Dynamic Programming Algorithm for Name Matching", 《2007 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DATA MINING》 *
YUBIN WANG等: "Identifying Users across Different Sites using Usernames", 《PROCEDIA COMPUTER SCIENCE》 *
崔庆华 等: "一种基于动态规划的缩写词定义识别方法", 《安徽大学学报(自然科学版)》 *
李华旸 等: "基于动态规划的缩写发现算法", 《武汉大学学报》 *
邢晓辉: "基于LCS的中文缩写字段匹配问题的研究", 《山东科学》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109240583A (en) * 2017-07-04 2019-01-18 优信数享(北京)信息技术有限公司 A kind of method, terminal and the data query system of focus input frame input data
CN107870905A (en) * 2017-12-04 2018-04-03 语联网(武汉)信息技术有限公司 A kind of recognition methods of specific vocabulary
CN109800332A (en) * 2018-12-04 2019-05-24 北京明略软件系统有限公司 Method, apparatus, computer storage medium and the terminal of processing field name
CN113688614A (en) * 2020-05-19 2021-11-23 阿里巴巴集团控股有限公司 Method, device and storage medium for generating field annotation and understanding character string
CN113419720A (en) * 2021-07-06 2021-09-21 北京理工大学 Automatic judgment method for necessity of abbreviation expansion for source code

Similar Documents

Publication Publication Date Title
CN106168946A (en) A kind of method identifying user initials phenomenon
WO2021114745A1 (en) Named entity recognition method employing affix perception for use in social media
Almeman et al. Automatic building of arabic multi dialect text corpora by bootstrapping dialect words
CN106933972B (en) The method and device of data element are defined using natural language processing technique
CN104778256B (en) A kind of the quick of field question answering system consulting can increment clustering method
CN103106189B (en) A kind of method and apparatus excavating synonym attribute word
CN101079031A (en) Web page subject extraction system and method
CN109344234A (en) Machine reads understanding method, device, computer equipment and storage medium
CN105787134B (en) Intelligent answer method, apparatus and system
Xu et al. Chunk-level password guessing: Towards modeling refined password composition representations
CN109508458A (en) The recognition methods of legal entity and device
CN105512110B (en) A kind of wrongly written character word construction of knowledge base method based on fuzzy matching with statistics
CN108197315A (en) A kind of method and apparatus for establishing participle index database
CN112989414A (en) Mobile service data desensitization rule generation method based on width learning
CN110413972A (en) A kind of table name field name intelligence complementing method based on NLP technology
CN111104801A (en) Text word segmentation method, system, device and medium based on website domain name
CN104346382B (en) Use the text analysis system and method for language inquiry
CN104915458B (en) A kind of method, system and mobile terminal associated automatically when user searches for and applies
Heyman et al. Filling the gaps: A speeded word fragment completion megastudy
CN104699662B (en) The method and apparatus for identifying overall symbol string
Sagcan et al. Toponym recognition in social media for estimating the location of events
CN105956023A (en) Method and system for rarely-used character library network application
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
CN110162794A (en) A kind of method and server of participle
Al-Sultany et al. Enriching tweets for topic modeling via linking to the wikipedia

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20161130