CN106168946A

CN106168946A - A kind of method identifying user initials phenomenon

Info

Publication number: CN106168946A
Application number: CN201610474472.XA
Authority: CN
Inventors: 亚静; 王玉斌; 柳厅文; 时金桥; 李全刚
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2016-06-24
Filing date: 2016-06-24
Publication date: 2016-11-30

Abstract

The present invention provides a kind of method identifying user initials phenomenon, and step includes: 1) filter the character in two or more user names, only retains English alphabet and numeral；2) user name after above-mentioned filtration being respectively divided into several continuous fragments, the initial character respectively choosing each fragment forms a new character strings；3) obtain the longest abbreviation length according to above-mentioned new character strings, if length value is more than or equal to given threshold value Δ L, then judge to have between described user name user initials phenomenon；By unified for the English alphabet of the reservation form being converted into lower case or upper case；Described fragment is word or single character；Described fragment obtains according to the dictionary segmentation specified；Dynamic programming algorithm is used to obtain the longest abbreviation length according to described new character strings.

Description

A kind of method identifying user initials phenomenon

Technical field

The present invention relates to computer realm, be specifically related to a kind of method identifying user initials phenomenon.

Background technology

In recent years, internet development is rapid, has been deep into different social sectors, such as in Sina, Sohu, rise Xun Deng portal website viewing news or video, carry out communication for information at social networkies such as microblogging, mhkc, communities, and people are using During these networks, meeting login account, fills in user name.User name be user fill in when registration of website meet certain rule Can be identified for that the character string of user identity, be generally made up of some spcial characters such as English alphabet, numeral and underscores.

User is when some websites is registered, owing to the user name of this website has a uniqueness, and conventional user name by Other people register, or do not use conventional user name for other considerations such as protection individual privacies, but simultaneously for the ease of note Recalling and select to abridge some word in conventional user name, become a new user name and register use, this is The abbreviation phenomenon of user name.Identify that the abbreviation phenomenon between user name is of great importance for study Internet, such as, research worker Social network data is being excavated, such as user behavior analysis, personalized recommendation etc., it is sometimes desirable to by difference social networks Same user or similar users be associated, some of them correlating method is accomplished by by the abbreviation phenomenon identified in user name Analyze the similarity degree of user name.

Identify that user initials phenomenon is and provide two user names, it is judged that whether certain fragment of one of them user name It it is the abbreviation of certain fragment of another user name.The universal method solving this problem is enumerative technique, i.e. enumerates a user name Substring s₁Substring s with another user name₂, check s₁Whether it is s₂Abbreviated form.If there is such s₁、s₂, then Think and between two user names, there is abbreviation phenomenon, and s₁It is to s₂Abbreviation, otherwise it is assumed that it is existing to there is not abbreviation between two user names As.

But, enumerative technique needs to enumerate all of substring of user name when identifying user initials phenomenon, and time overhead is big, It is difficult to be suitable for large-scale active user name abbreviation phenomenon identification mission.

Summary of the invention

In view of above-mentioned deficiency, the present invention provides a kind of method identifying user initials phenomenon, it is not necessary to enumerate user name Substring, reduces amount of calculation, and time overhead is little, it is possible to identify user initials phenomenon more efficiently.

For solving above-mentioned technical problem, the present invention adopts the following technical scheme that

A kind of method identifying user initials phenomenon, step includes:

1) character in two or more user names is filtered, only retain English alphabet and numeral；

2) user name after above-mentioned filtration is respectively divided into several continuous fragments, respectively chooses the initial character group of each fragment Become a new character strings；

3) the longest abbreviation length is obtained according to above-mentioned new character strings, as length value is more than or equal to given threshold value Δ L, then Judge to have between described user name user initials phenomenon.

Further, by unified for the English alphabet of the reservation form being converted into lower case or upper case.

Further, described fragment is word or single character.

Further, described fragment obtains according to the dictionary segmentation specified.

Further, described dictionary includes name, place name, name, fabricate word or other appointment words, and this appointment word includes name Word, verb, adjective, adverbial word.

Further, dynamic programming algorithm is used to obtain the longest abbreviation length according to described new character strings.

Further, described threshold value Δ L is the minimum length of user initials form to be identified.

Further, when Chinese personal name Pinyin abbreviation form to be identified, Δ L >=2.

Further, when English name-to abbreviated form to be identified, Δ L=2.

The invention has the beneficial effects as follows, the method that the present invention provides is without enumerating all of substring of user name, it is possible to automatically Identify and whether there is abbreviation phenomenon between user name.Before identifying, judging user name, in advance user name split, abridge, Decrease the length of character string self during identification compared to prior art, thus decrease amount of calculation.Identifying user name During abbreviation phenomenon, this method just can be judged easily by the value judging the longest abbreviation length, and the obtaining of the longest abbreviation length Taking and use a kind of dynamic programming algorithm, the process enumerating substring compared to prior art one by one decreases the meter of a large amount of repetition Calculate.

Accompanying drawing explanation

Fig. 1 is a kind of method flow diagram identifying user initials phenomenon provided in embodiment.

Detailed description of the invention

Features described above and advantage for making the present invention can become apparent, special embodiment below, and coordinate institute's accompanying drawing to make Describe in detail as follows.

This provides a kind of method identifying user initials phenomenon, as shown in Figure 1, it is assumed that given two user names a, b, sentences Between disconnected a and b, whether there is user initials phenomenon, comprise the steps of

1, user name pretreatment

User name generally can comprise the part spcial characters such as English alphabet, numeral and underscore, and this step is intended to Except the spcial character in user name, the spcial character that the present invention mentions refers to all characters in addition to English alphabet and numeral, Only retaining English alphabet and numeral, and English alphabet is changed into lower case or upper case form by unification, the present embodiment is with small letter shape As a example by formula.

2, split and abridge user name

According to the dictionary W of the word with practical significance that des specifies, user name is divided into several continuous print Fragment, each fragment is the word in W or single character, simultaneously need to ensure that the quantity of the fragment after segmentation is the fewest.Should The des of dictionary W is user, it is also possible to for other people.What is called has the word of practical significance, e.g. name, place name, thing Name or other are for concept meaningful from the point of view of des, this concept can be noun, verb, adjective, adverbial word or other property Vocabulary, or other vocabulary fabricated by des.

By the following method the user name through above-mentioned pretreatment can be split, for ease of expressing, set given herein The entitled u of user, a length of n, remember S_iThe segmentation knot of the substring that the i-th character of expression user name u forms to the n-th character Really, segmentation step is as follows:

(1) i=n+1, S are initialized_n+1={ }.

(2) make i=i-1, check each word w in dictionary W successively, if a length of m of w, if meeting w= u_iu_i+1…u_i+m-1, the position of the i-th character during i.e. w occurs in user name u, if at this moment S_iNon-assignment or satisfied | S_i+m|+1 <|S_i|, then make S_i=w ∪ S_i+m.If checking out in W S after all of word_iAssignment not yet, then make S_i=u [i] ∪ S_i+m。 Wherein u [i] represents the i-th character of u.

(3) (2nd) step is repeated, until i=0, now S₀It it is exactly the segmentation result of user name u.

After splitting user name, the initial character taking each fragment in segmentation result forms a new character strings work Abbreviated form for original subscriber's name.

3, the longest abbreviation length is calculated

If user name a, the segmentation result of b that obtain after the 2nd step processes are respectively X_a、X_b, abbreviated form is respectively Y_a、Y_b, the longest abbreviation length m of user name a and b need to be obtained.The longest abbreviation length refers to that two user initials forms are full The longest common portion of foot specified conditions, these specified conditions are two use before each character correspondence abbreviation of this common portion Character string in the segmentation result at family, for single character and another is for word to be satisfied by one.For obtaining this longest abbreviation Length m, specialized designs one dynamic programming algorithm, its formula is as follows:

m = \underset{1 \leq i \leq | Y_{a} |, 1 \leq j \leq | Y_{b} |}{m a x} f (i, j)

Wherein, Y_a[i] represents character string Y_aI-th character, Y_b[j] represents character string Y_bJth character, | X_a[i]| Represent set X_aThe length of middle i-th character string, | X_b[j] | represent set X_bThe length of middle jth character string, | Y_a| represent word Symbol string Y_aLength, | Y_b| represent character string Y_bLength.

4, user initials phenomenon is identified

If given threshold value is Δ L, if meeting m >=Δ L, illustrating between user name a, b, to there is abbreviation phenomenon, otherwise saying Abbreviation phenomenon is there is not between bright user name a, b.

In conjunction with said method, it is applied to the embodiment of concrete scene especially exemplified by following two, to illustrate that this method is practical.

Embodiment 1:

The present embodiment 1 is for identifying the abbreviation phenomenon that whether there is name phonetic between user name.According to Chinese surname From the point of view of name feature, name at least two word, such as Zhang Wei, history Xiao Ming etc., as a example by Zhang Wei, its name phonetic is Zhang Wei or Wei Zhang, Pinyin abbreviation example has randomness, and from statistics, on the biggest probability of Pinyin abbreviation It is the initial taking name, i.e. zw or wz；And the name phonetic Shi Xiaoming or Xiaoming Shi of history Xiao Ming, its phonetic Abbreviation is likely to sxm or xms, by above-mentioned analysis it is believed that W is the set that string length is not less than the phonetic of 2, and Δ L=2. In like manner, if the abbreviation of English name-to to be identified, due to the English name generally less use of middle name, i.e. English Name is at least made up of first name and last name, and such as English name Sheldon Lee Cooper, is often Sheldon Cooper, is abbreviated as sc, so the minimum length of abbreviated form to be identified may be configured as 2, i.e. and Δ L=2.

Given two user names a=zgxxidian123, b=zhangguoxin012, the present embodiment present invention to be passed through Whether there is abbreviation phenomenon between the method identification the two user name provided, provide method to be calculated by above-mentioned 2nd step User name a, the segmentation result of b are respectively X_a={ z, g, x, xi, dian, 1,2,3}, X_b=zhang, guo, xin, 0,1,2}, Abbreviated form is respectively Y_a=zgxxd123, Y_b=zgx012, further, is calculated user name a and b by above-mentioned 3rd step The longest abbreviation length m=3.By epimere known Δ L=2, then m >=Δ L, illustrate between user name a, b, to there is abbreviation phenomenon.

Embodiment 2:

The present embodiment 2 is for identifying the abbreviation phenomenon that whether there is name phonetic between user name, by above-described embodiment Analyzing and understand, W is the set that string length is not less than the phonetic of 2, Δ L=2.

Given two user names a=wanxia68, b=wanter_123, provide method to be calculated by above-mentioned 2nd step User name a, the segmentation result of b are respectively X_a={ wan, xia, 6,8}, X_b=wan, te, r, 1, and 2,3}, abbreviated form is respectively Y_a=wx68, Y_b=wtr123, further, is calculated the length of the longest abbreviation of user name a and b by above-mentioned 3rd step M=0.Due to m < Δ L, illustrate not exist between user name a, b abbreviation phenomenon.

The method that the present invention provides is automatically to identify whether user name exists abbreviation phenomenon by algorithm, it is not necessary to as existing skill Art enumerates all of substring of user name like that, simple and feasible.Before identifying, judging user name, in advance user name is carried out point Cut, abridge, decrease the length of character string self during identification compared to prior art, thus decrease amount of calculation.Knowing During other user initials phenomenon, this method just can be judged easily by the value judging the longest abbreviation length, and the longest abbreviation The acquisition of length uses a kind of dynamic programming algorithm, and the process enumerating substring compared to prior art one by one decreases in a large number The calculating repeated.

Last it should be noted that, although the present invention is open as above with embodiment, but these embodiments are not intended to limit Determining the present invention, in art, it can be modified or replace by those of ordinary skill, without deviating from the essence of the present invention God and scope, therefore protection scope of the present invention is as the criterion with claims.

Claims

1. the method identifying user initials phenomenon, step includes:

2) user name after above-mentioned filtration is respectively divided into several continuous fragments, respectively chooses the initial character composition one of each fragment New character strings；

3) obtain the longest abbreviation length according to above-mentioned new character strings, if length value is more than or equal to given threshold value Δ L, then judge There is between described user name user initials phenomenon.

Method the most according to claim 1, it is characterised in that be converted into lower case or upper case by unified for the English alphabet of reservation Form.

Method the most according to claim 1, it is characterised in that described fragment is word or single character.

Method the most according to claim 1, it is characterised in that described fragment obtains according to the dictionary segmentation specified.

Method the most according to claim 4, it is characterised in that described dictionary include name, place name, name, fabricate word or Other specify word, and this appointment word includes noun, verb, adjective, adverbial word.

Method the most according to claim 1, it is characterised in that use dynamic programming algorithm to obtain according to described new character strings The longest abbreviation length.

Method the most according to claim 1, it is characterised in that described threshold value Δ L is user initials form to be identified Minimum length.

Method the most according to claim 7, it is characterised in that when Chinese personal name Pinyin abbreviation form to be identified, Δ L >= 2。

Method the most according to claim 7, it is characterised in that when English name-to abbreviated form to be identified, Δ L=2.