CN105608462A - Character similarity judgment method and device - Google Patents

Character similarity judgment method and device Download PDF

Info

Publication number
CN105608462A
CN105608462A CN201510917453.5A CN201510917453A CN105608462A CN 105608462 A CN105608462 A CN 105608462A CN 201510917453 A CN201510917453 A CN 201510917453A CN 105608462 A CN105608462 A CN 105608462A
Authority
CN
China
Prior art keywords
character string
order
stroke
character
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510917453.5A
Other languages
Chinese (zh)
Inventor
汪平仄
张涛
侯文迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Technology Co Ltd
Xiaomi Inc
Original Assignee
Xiaomi Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaomi Inc filed Critical Xiaomi Inc
Priority to CN201510917453.5A priority Critical patent/CN105608462A/en
Publication of CN105608462A publication Critical patent/CN105608462A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Character Discrimination (AREA)

Abstract

The invention provides a character similarity judgment method and device and belongs to the Internet technical field. The method includes the following steps that: a first character sequence to be detected and a second character sequence to be detected are acquired; a first stroke order sequence and a second stroke order sequence are obtained, wherein the first stroke order sequence is strokes in the first character sequence arranged according to a writing order, and the second stroke order sequence is strokes in the second character sequence arranged according to a writing order; a minimum edit distance between the first stroke order sequence and the second stroke order sequence is obtained, wherein the edit distance is the number of the times of transformation operation required by the transformation of the first stroke order sequence to the second stroke order sequence; and the similarity of the first character sequence and the second character sequence is obtained. According to the method and device of the invention, based on the similarity of stroke order sequences, the similarity of corresponding character sequences can be obtained, and therefore, the similarity of characters can be automatically judged, human resources can be saved, and the objectiveness of the judgment of character similarity can be enhanced.

Description

The determination methods of character similitude and device
Technical field
The disclosure relates to Internet technical field, relates in particular to a kind of determination methods and device of character similitude.
Background technology
Along with the development of Internet technology and terminal technology, utilize be handwritten in input character in terminal modeThrough more and more general, above-mentioned character can be chinese character, Japanese character etc., but in the mistake of handwriting inputCheng Zhong, owing to having similitude between some characters, therefore user may unconsciously input similar character,That is to say wrong word, in this case, terminal is write error correcting technique with regard to needs utilization and is carried out error correction, in addition,Writing error correcting technique is also widely used in the other fields such as scanned copy identification, document debugging.
In correlation technique, write error correcting technique and often depend on the similar character defining in wrong word dictionary to sending outExisting wrong word, as its name suggests, this similar character centering comprises a pair of character that similarity is higher, for example, " "" the sixth of the twelve Earthly Branches ", but writing arrangement is manually carried out in the general employing of above-mentioned wrong word dictionary, and by manually defining each phaseLike the similarity degree of word centering character, such method not only needs to expend a large amount of human resources, also simultaneouslyThere is larger subjectivity, therefore, need at present a kind of method of automatic decision character similitude badly, thus sharpWith the method automatic arranging wrong word dictionary.
Summary of the invention
For overcoming the problem existing in correlation technique, the disclosure provide a kind of character similitude determination methods andDevice.
According to the first aspect of disclosure embodiment, a kind of determination methods of character similitude is provided, comprising:
Obtain the first character string to be detected and the second character string;
Obtain the first stroke order row and the second order of strokes observed in calligraphy sequence, described the first stroke order is classified described the first character order asIn row according to sequential write arrange stroke, described the second order of strokes observed in calligraphy sequence be in described the second character string according toThe stroke that sequential write is arranged;
Obtain the minimum editing distance between described the first stroke order row and described the second order of strokes observed in calligraphy sequence, described inEditing distance is that described the first stroke order row are transformed into the needed map function of described the second order of strokes observed in calligraphy sequence timeNumber;
According to the editing distance of described minimum, obtain described the first character string and described the second character stringSimilitude.
In the possible embodiment of the first of first aspect, described according to the editing distance of described minimum,The similitude of obtaining described the first character string and described the second character string comprises:
Calculate the similar of described the first character string and described the second character string according to calculating formula of similarityDegree, described calculating formula of similarity is:
Sim=1-Dmin/max(A,B)
Wherein, Sim is the similarity of described the first character string and described the second character string, DminFor describedMinimum editing distance, A is the stroke number that described the first stroke order row comprise, B is described the second order of strokes observed in calligraphy orderThe stroke number that row comprise, max () is maximizing computing.
In the possible embodiment of the second of first aspect, described map function comprise delete stroke operation,Insert stroke operation and replace at least one in stroke operation.
In the third possible embodiment of first aspect, described the first character string and described the second wordSymbol sequence comprises one or more characters.
In the 4th kind of possible embodiment of first aspect, described the first character string and described the second wordSymbol sequence is for cutting apart Chinese character.
According to the second aspect of disclosure embodiment, a kind of judgment means of character similitude is provided, comprising:
Character string acquisition module, for obtaining the first character string to be detected and the second character string;
Order of strokes observed in calligraphy retrieval module, for obtaining the first stroke order row and the second order of strokes observed in calligraphy sequence, described the first strokeOrder is classified as in described the first character string that described character string acquisition module obtains and is arranged according to sequential writeStroke, described the second order of strokes observed in calligraphy sequence is described the second character string that described character string acquisition module obtainsAccording to sequential write arrange stroke;
Editing distance acquisition module, suitable for obtaining described the first stroke that described order of strokes observed in calligraphy retrieval module obtainsMinimum editing distance between sequence and described the second order of strokes observed in calligraphy sequence, described editing distance is described the first strokeOrder row are transformed into the needed map function number of times of described the second order of strokes observed in calligraphy sequence;
Similar retrieval module, for the editor of the described minimum obtained according to described editing distance acquisition moduleDistance, obtains the similitude of described the first character string and described the second character string.
In the possible embodiment of the first of second aspect, described similar retrieval module is used for:
Calculate according to calculating formula of similarity described the first character string that described character string acquisition module obtainsWith the similarity of described the second character string, described calculating formula of similarity is:
Sim=1-Dmin/max(A,B)
Wherein, Sim is the similarity of described the first character string and described the second character string, DminFor describedMinimum editing distance, A is the stroke number that described the first stroke order row comprise, B is described the second order of strokes observed in calligraphy orderThe stroke number that row comprise, max () is maximizing computing.
In the possible embodiment of the second of second aspect, described map function comprise delete stroke operation,Insert stroke operation and replace at least one in stroke operation.
In the third possible embodiment of second aspect, described the first character string and described the second wordSymbol sequence comprises one or more characters.
In the 4th kind of possible embodiment of second aspect, described the first character string and described the second wordSymbol sequence is for cutting apart Chinese character.
According to the third aspect of disclosure embodiment, a kind of judgment means of character similitude is provided, comprising:
Processor;
For the memory of storage of processor executable instruction;
Wherein, described processor is configured to:
Obtain the first character string to be detected and the second character string;
Obtain the first stroke order row and the second order of strokes observed in calligraphy sequence, described the first stroke order is classified described the first character order asIn row according to sequential write arrange stroke, described the second order of strokes observed in calligraphy sequence be in described the second character string according toThe stroke that sequential write is arranged;
Obtain the minimum editing distance between described the first stroke order row and described the second order of strokes observed in calligraphy sequence, described inEditing distance is that described the first stroke order row are transformed into the needed map function of described the second order of strokes observed in calligraphy sequence timeNumber;
According to the editing distance of described minimum, obtain described the first character string and described the second character stringSimilitude.
The technical scheme that embodiment of the present disclosure provides can comprise following beneficial effect:
By obtaining two order of strokes observed in calligraphy sequences that character string to be detected is corresponding, and obtain these two order of strokes observed in calligraphy sequencesSmallest edit distance, this smallest edit distance can reflect the similarity between these two order of strokes observed in calligraphy sequences, by thisSmallest edit distance, that is to say that the similarity between order of strokes observed in calligraphy sequence judges the phase between these two character strings to be detectedLike degree, thereby realize the technique effect of automatic decision character similitude, saved human resources, also simultaneouslyStrengthen the objective degree that judges character similitude.
Should be understood that, it is only exemplary and explanatory that above general description and details are hereinafter described,Can not limit the disclosure.
Brief description of the drawings
Accompanying drawing is herein merged in description and forms the part of this description, shows and meets the disclosureEmbodiment, and with description one be used from explain principle of the present disclosure.
Fig. 1 is according to the flow chart of the determination methods of a kind of character similitude shown in an exemplary embodiment.
Fig. 2 is according to the flow chart of the determination methods of a kind of character similitude shown in an exemplary embodiment.
Fig. 3 is according to the block diagram of the judgment means of a kind of character similitude shown in an exemplary embodiment.
Fig. 4 is according to the block diagram of a kind of terminal 400 shown in an exemplary embodiment.
Fig. 5 is according to the block diagram of the judgment means 500 of a kind of character similitude shown in an exemplary embodiment.
Detailed description of the invention
For making object of the present disclosure, technical scheme and advantage clearer, below in conjunction with accompanying drawing to the disclosureEmbodiment is described in further detail.
Here will at length describe exemplary embodiment, its sample table shows in the accompanying drawings. Retouching belowState while relating to accompanying drawing, unless separately there is expression, the same numbers in different accompanying drawings represents same or analogous key element.Embodiment described in following exemplary embodiment does not represent all enforcement sides consistent with the disclosureFormula. On the contrary, they be only with as some aspects that described in detail in appended claims, of the present disclosure mutuallyThe example of the apparatus and method that cause.
Fig. 1 is according to the flow chart of the determination methods of a kind of character similitude shown in an exemplary embodiment,As shown in Figure 1, the determination methods of this character similitude, for terminal or server, comprises the following steps.
In step 110, obtain the first character string to be detected and the second character string.
In step 120, obtain the first stroke order row and the second order of strokes observed in calligraphy sequence, this first stroke order is classified this asThe stroke of arranging according to sequential write in the first character string, this second order of strokes observed in calligraphy sequence is this second character stringAccording to sequential write arrange stroke.
In step 130, obtain the minimum editor between these the first stroke order row and this second order of strokes observed in calligraphy sequenceDistance, this editing distance is transformed into the needed map function of this second order of strokes observed in calligraphy sequence for this first stroke order is listed asNumber of times.
In step 140, according to this minimum editing distance, obtain this first character string and this second wordThe similitude of symbol sequence.
In sum, the determination methods of the character similitude that the present embodiment provides, to be detected by obtaining twoThe order of strokes observed in calligraphy sequence that character string is corresponding, and obtain the smallest edit distance of these two order of strokes observed in calligraphy sequences, this minimum is compiledCollect distance and can reflect the similarity between these two order of strokes observed in calligraphy sequences, by this smallest edit distance, that is to say penSimilarity between order row judges the similarity of these two character strings to be detected, thereby has realized automatic decisionThe technique effect of character similitude, has saved human resources, has also strengthened the visitor who judges character similitude simultaneouslySight degree.
In the possible embodiment of the first, this obtains this first character according to this minimum editing distanceThe similitude of sequence and this second character string comprises:
Calculate the similarity of this first character string and this second character string according to calculating formula of similarity, shouldCalculating formula of similarity is:
Sim=1-Dmin/max(A,B)
Wherein, Sim is the similarity of this first character string and this second character string, DminFor this minimumEditing distance, A is the stroke number that these the first stroke order row comprise, B is the pen that this second order of strokes observed in calligraphy sequence comprisesDraw number, max () is maximizing computing.
In the possible embodiment of the second, this map function comprises deletes stroke operation, insertion stroke behaviourMake and replace at least one in stroke operation.
In the third possible embodiment, this first character string and this second character string comprise oneOr multiple characters.
In the 4th kind of possible embodiment, this first character string and this second character string are for cutting apart the ChineseWord.
Above-mentioned all optional technical schemes, can adopt any combination to form optional embodiment of the present disclosure,This repeats no longer one by one.
Fig. 2 is according to the flow chart of the determination methods of a kind of character similitude shown in an exemplary embodiment,As shown in Figure 2, the determination methods of this character similitude, for terminal or server, comprises the following steps.
In step 210, obtain the first character string to be detected and the second character string.
First, it should be noted that, the executive agent of above-mentioned steps 210 can be server, can be alsoThe terminals such as computer, panel computer, mobile phone, are not specifically limited this disclosure, similarly, and following stepRapid 220 to 250 executive agent can be also server or terminal, and the disclosure is in the following description to thisTo repeat no more.
Technical scheme of the present disclosure provides a kind of method that can automatic decision character string font similitude,Therefore before carrying out subsequent step, need to first obtain character string to be detected, that is to say above-mentioned firstCharacter string and the second character string, it should be noted that, above-mentioned the first character string and the second character stringCan be chinese character sequence, cut apart chinese character sequence, and other have fixing stroke, and while writingStroke has the character string of sequencing, as Japanese character sequence, Korean characters sequence etc., to these these public affairsOpen and be not specifically limited, the above-mentioned Chinese character of cutting apart refers to the multiple characters that are split to form by a chinese character,For example: " Rolling sends out " is and cuts apart Chinese character, it comprises two characters, is split to form by " dialling ". In addition,It is to be noted that this first character string and the second character string can comprise one or more characters, and shouldThe number of characters comprising in the first character string and the second character string can equate, but, when this first wordIn symbol sequence and the second character string, comprise while cutting apart Chinese character the first character string and the second character string bagThe number of characters containing also can be unequal, and this disclosure is not specifically limited.
In above-mentioned steps 210, can from character database, obtain at random the first character string and the second characterSequence, also can obtain the first character string and the second character string according to user or technical staff's operation,The disclosure is not specifically limited the obtain manner of the first character string and the second character string.
In step 220, obtain the first stroke order row and the second order of strokes observed in calligraphy sequence, this first stroke order is classified this asThe stroke of arranging according to sequential write in the first character string, this second order of strokes observed in calligraphy sequence is this second character stringAccording to sequential write arrange stroke.
Inventor recognizes, people, in the time of writing Chinese characters, Japanese etc., tend to carry out book according to stroke orderWrite, that is to say, each character string all can a corresponding specific order of strokes observed in calligraphy sequence, and this order of strokes observed in calligraphy sequence isSequence after the stroke comprising in character string is arranged according to sequential write, and between kinds of characters sequenceThe similitude of similitude and its order of strokes observed in calligraphy sequence is closely bound up, generally speaking, and the similitude between different order of strokes observed in calligraphy sequencesHigher, the font similitude between its corresponding character string is also higher, and vice versa, as " doing " and " in "Corresponding order of strokes observed in calligraphy sequence is respectively " Shu one by one " and " Yi mono-亅 ", and its order of strokes observed in calligraphy sequence similarity is higher obviously,And accordingly, " doing " and " in " similitude also higher, in addition, for cutting apart chinese character, thoughSo it is formed by Chinese character segmentation, but, cut apart stroke writing order and the former chinese character of chinese characterStroke writing order identical, therefore, the order of strokes observed in calligraphy sequence of cutting apart chinese character can be weighed and cut apart Chinese character equallySimilitude between character string and other character strings.
As above, can utilize the similitude between order of strokes observed in calligraphy sequence corresponding to kinds of characters sequence to judge characterSimilitude between sequence, for this reason, in step 220, need to obtain respectively the first character string and the second wordThe first stroke order row and the second order of strokes observed in calligraphy sequence that symbol sequence is corresponding, so that the carrying out of subsequent step.
The obtain manner disclosure for above-mentioned order of strokes observed in calligraphy sequence is not specifically limited, while specifically enforcement, and Ke YigenObtain from corresponding order of strokes observed in calligraphy sequence library according to the first character string and the second character string, also can basisAbove-mentioned order of strokes observed in calligraphy sequence is obtained in user or technical staff's input.
In step 230, obtain the minimum editor between these the first stroke order row and this second order of strokes observed in calligraphy sequenceDistance, this editing distance is transformed into the needed map function of this second order of strokes observed in calligraphy sequence for this first stroke order is listed asNumber of times.
Particularly, this map function comprises that deleting stroke operates, inserts stroke operation and replace stroke operationIn at least one, inventor recognizes, no matter whether two order of strokes observed in calligraphy sequences similar, always can be by above-mentionedAn order of strokes observed in calligraphy sequence is transformed into another order of strokes observed in calligraphy sequence by one or more operations in map function, as canBy the operation of replacing stroke, " Shu one by one " changed into " Yi mono-亅 ", is about to " Shu " and replaces to “ 亅 ",And for example can change order of strokes observed in calligraphy sequence " one by one " corresponding to Chinese character " two " into " doing " by the operation of inserting strokeCorresponding order of strokes observed in calligraphy sequence " Shu one by one ", inserts stroke " Shu " at " one by one " afterbody, and for example can pass throughThe operation of deleting stroke changes order of strokes observed in calligraphy sequence " one by one " corresponding to " three " order of strokes observed in calligraphy corresponding to " two " intoSequence " one by one ", deletes stroke " " at " one one by one " afterbody, and an order of strokes observed in calligraphy sequence is transformed into separatelyA needed map function number of times of order of strokes observed in calligraphy sequence is editing distance, as in above-mentioned giving an example, " one by oneShu " and the editing distance of " Yi mono-亅 " be 1, editing distance can reflect the similitude of order of strokes observed in calligraphy sequence, obviouslyGround, the similitude of the less order of strokes observed in calligraphy sequence of editing distance is larger, is the first stroke in step 230 inediting distanceOrder row are transformed into the needed map function number of times of this second order of strokes observed in calligraphy sequence.
But the mode that a certain order of strokes observed in calligraphy sequence is changed into another order of strokes observed in calligraphy sequence is possible more than a kind of, as incited somebody to action " oneOne Shu " change " Yi mono-亅 " into, a kind of method can operate by replacing stroke, is about to " Shu "Replace to " 亅 ", another kind of method also can be inserted stroke operation after first deleting stroke operationStep complete, that is to say, first delete the stroke " Shu " in " Shu one by one ", then insert again stroke “ 亅 ",Obviously, in these two kinds of mapping modes, the editing distance of " Shu one by one " and " Yi mono-亅 " is respectively 1 and 2,Therefore in order to reflect exactly the similitude between order of strokes observed in calligraphy sequence, need the minimum editor between calculating pen order rowDistance.
Particularly, can adopt following recurrence formula to calculate between the first stroke order row and the second order of strokes observed in calligraphy sequenceMinimum editing distance:
Dmin[i,j]=min(
Editdistance[i-1, j]+1, // on A, stroke is deleted in i position
Editdistance[i, j-1]+1, // on A, stroke is inserted in i+1 position
Editdistance[i-1, j-1]+1//replace stroke to operate
)
Wherein, A is the first stroke order row, and i is the stroke sequence number of the first stroke order row, and j is the second order of strokes observed in calligraphy orderThe stroke sequence number of row, min () is the computing of minimizing, DminFor minimum editing distance, editdistance isEditing distance.
In step 240, calculate this first character string and this second character order according to calculating formula of similarityThe similarity of row, this calculating formula of similarity is:
Sim=1-Dmin/max(A,B)
Wherein, Sim is the similarity of this first character string and this second character string, DminFor this minimumEditing distance, A is the stroke number that these the first stroke order row comprise, B is the pen that this second order of strokes observed in calligraphy sequence comprisesDraw number, max () is maximizing computing.
Max (A, B) is theoretical maximum editing distance, that is to say two order of strokes observed in calligraphy orders in worst situationEditing distance between row, that is to say stroke in the case of two order of strokes observed in calligraphy sequences editor's distance completely not identicalFrom, easily to expect, theoretical maximum editing distance is that the stroke number comprising in two order of strokes observed in calligraphy sequences is maximumThe stroke number comprising in order of strokes observed in calligraphy sequence.
Above-mentioned calculating formula of similarity goes for calculating the similarity between chinese character sequence, in addition,Owing to writing the stroke order of cutting apart chinese character and writing that to cut apart the stroke order of front former chinese character identical,Therefore, above-mentioned calculating formula of similarity is also applicable between computed segmentation chinese character sequence and cuts apart Chinese characterSimilarity between character string and normal chinese character sequence.
For example, can utilize above-mentioned formula calculate " doing " and " in " similarity, first, its minimumEditing distance is 1, and theoretical maximum editing distance is 3, and similarity is: Sim=1-1/3 ≈ 0.67=67%;And for example, can utilize above-mentioned formula to calculate the similitude between " soil is sweet " and " earthenware ", " soil is sweet " correspondenceOrder of strokes observed in calligraphy sequence is " Shu one by one Shu Shu one by one ", and order of strokes observed in calligraphy sequence corresponding to " earthenware " is that " Shu is Shu Shu one by oneOne by one ", its smallest edit distance is 0, and theoretical maximum editing distance is 8, and similarity is:Sim=1-0/8=1=100%. Visible, similarity is to be more than or equal to 0 to be less than or equal to 1 number.
In step 205, obtain the similar character sequence pair that similarity is greater than predetermined threshold value, and this is similarCharacter string to and the right similarity of this similar character sequence be stored in wrong word dictionary.
In practical application, write error correcting technique and generally depend on wrong word dictionary, general in this wrong word dictionaryComprise many similar character sequences pair, owing to often thering is similitude between wrong word, thereby write error correction skillArt can utilize this similar character sequence to judging whether user has write wrong word, or error correction literary composition is treated in judgementIn part, whether comprise wrong word, therefore, be necessary to utilize technical scheme of the present disclosure to arrange wrong word dictionary,Write error correction to facilitate.
Similarity between can calculating character sequence in above-mentioned steps 240, similarity exceedes predetermined threshold valueTwo character strings are easily obscured in actual writing, and produce clerical error, therefore can be defined as phaseLike character string pair, and be saved in wrong word dictionary, so that write error correction, needed explanation, above-mentioned predetermined threshold value can be set by technical staff, and the disclosure is not specifically limited this.
In sum, the determination methods of the character similitude that the present embodiment provides, to be detected by obtaining twoThe order of strokes observed in calligraphy sequence that character string is corresponding, and obtain the smallest edit distance of these two order of strokes observed in calligraphy sequences, this minimum is compiledCollect distance and can reflect the similarity between these two order of strokes observed in calligraphy sequences, by this smallest edit distance, that is to say penSimilarity between order row judges the similarity of these two character strings to be detected, thereby has realized automatic decisionThe technique effect of character similitude, has saved human resources, has also strengthened the visitor who judges character similitude simultaneouslySight degree. Further, in the process of structure wrong word dictionary, can apply above-mentioned determination methods, fromAnd improve the structure efficiency of wrong word dictionary, also improve the objective degree of wrong word dictionary.
Fig. 3 is according to the block diagram of the judgment means 300 of a kind of character similitude shown in an exemplary embodiment.With reference to Fig. 3, this device comprises character string acquisition module 310, order of strokes observed in calligraphy retrieval module 320, editor's distanceFrom acquisition module 330 and similar retrieval module 340.
This character string acquisition module 310, for obtaining the first character string to be detected and the second character orderRow.
Technical scheme of the present disclosure provides a kind of device that can automatic decision character string font similitude,Therefore before carrying out subsequent step, need to first obtain character string to be detected, that is to say above-mentioned firstCharacter string and the second character string, it should be noted that, above-mentioned the first character string and the second character stringCan be chinese character sequence, cut apart chinese character sequence, and other have fixing stroke, and while writingStroke has the character string of sequencing, as Japanese character sequence, Korean characters sequence etc., to these these public affairsOpen and be not specifically limited. In addition, it is to be noted that this first character string and the second character string can wrapDraw together one or more characters.
Character string acquisition module 310 can obtain at random the first character string and second from character databaseCharacter string, also can obtain the first character string and the second character order according to user or technical staff's operationRow, the disclosure is not specifically limited the obtain manner of the first character string and the second character string.
This order of strokes observed in calligraphy retrieval module 320, for obtaining the first stroke order row and the second order of strokes observed in calligraphy sequence, this is first years oldOrder of strokes observed in calligraphy sequence is to arrange according to sequential write in this first character string of obtaining of this character string acquisition module 310The stroke of row, this second order of strokes observed in calligraphy sequence is this second character string that this character string acquisition module 310 obtainsAccording to sequential write arrange stroke.
The obtain manner disclosure for above-mentioned order of strokes observed in calligraphy sequence is not specifically limited, while specifically enforcement, and order of strokes observed in calligraphy orderRow acquisition module 320 can be according to the first character string and the second character string from corresponding order of strokes observed in calligraphy sequence dataIn storehouse, obtain, also can obtain above-mentioned order of strokes observed in calligraphy sequence according to user or technical staff's input.
This editing distance acquisition module 330, for obtain that this order of strokes observed in calligraphy retrieval module 320 obtains this firstMinimum editing distance between order of strokes observed in calligraphy sequence and this second order of strokes observed in calligraphy sequence, this editing distance is that this first stroke is suitableSequence is transformed into the needed map function number of times of this second order of strokes observed in calligraphy sequence.
In an embodiment of the present disclosure, this map function comprises deletes stroke operation, the operation of insertion strokeAnd replace at least one in stroke operation. Particularly, can adopt following recurrence formula to calculate the first strokeMinimum editing distance between order row and the second order of strokes observed in calligraphy sequence:
Dmin[i,j]=min(
Editdistance[i-1, j]+1, // on A, stroke is deleted in i position
Editdistance[i, j-1]+1, // on A, stroke is inserted in i+1 position
Editdistance[i-1, j-1]+1//replace stroke to operate
)
Wherein, A is the first stroke order row, and i is the stroke sequence number of the first stroke order row, and j is the second order of strokes observed in calligraphy orderThe stroke sequence number of row, min () is the computing of minimizing, DminFor minimum editing distance, editdistance isEditing distance.
This similar retrieval module 340, for this minimum of obtaining according to this editing distance acquisition module 330Editing distance, obtains the similitude of this first character string and this second character string.
In an embodiment of the present disclosure, this similar retrieval module 340 for:
Calculate according to calculating formula of similarity this first character string that this character string acquisition module 310 obtainsWith the similarity of this second character string, this calculating formula of similarity is:
Sim=1-Dmin/max(A,B)
Wherein, Sim is the similarity of this first character string and this second character string, DminFor this minimumEditing distance, A is the stroke number that these the first stroke order row comprise, B is the pen that this second order of strokes observed in calligraphy sequence comprisesDraw number, max () is maximizing computing.
Max (A, B) is theoretical maximum editing distance, that is to say the editing distance in worst situation,That is to say stroke in the case of two order of strokes observed in calligraphy sequences editing distance completely not identical, easily expect,Theoretical maximum editing distance is the maximum order of strokes observed in calligraphy sequence of stroke number comprising in two order of strokes observed in calligraphy sequences, comprisesStroke number.
In sum, the judgment means of the character similitude that the present embodiment provides, to be detected by obtaining twoThe order of strokes observed in calligraphy sequence that character string is corresponding, and obtain the smallest edit distance of these two order of strokes observed in calligraphy sequences, this minimum is compiledCollect distance and can reflect the similarity between these two order of strokes observed in calligraphy sequences, by this smallest edit distance, that is to say penSimilarity between order row judges the similarity of these two character strings to be detected, thereby has realized automatic decisionThe technique effect of character similitude, has saved human resources, has also strengthened the visitor who judges character similitude simultaneouslySight degree.
About the device in above-described embodiment, wherein the concrete mode of modules executable operations is relevantIn the embodiment of the method, have been described in detail, will not elaborate explanation herein.
Fig. 4 is according to the block diagram of the judgment means 400 of a kind of character similitude shown in an exemplary embodiment.For example, device 400 can be mobile phone, computer, digital broadcast terminal, information receiving and transmitting equipment, tripPlay console, tablet device, Medical Devices, body-building equipment, personal digital assistant etc.
With reference to Fig. 4, device 400 can comprise following one or more assembly: processing components 402, memory408, power supply module 406, multimedia groupware 408, audio-frequency assembly 410, the interface of I/O (I/O)412, sensor cluster 414, and communications component 416.
The integrated operation of processing components 402 common control device 400, such as with demonstration, call, numberAccording to communication, the operation that camera operation and record operation are associated. Processing components 402 can comprise one or manyIndividual processor 420 is carried out instruction, to complete all or part of step of above-mentioned method. In addition process,Assembly 402 can comprise one or more modules, is convenient to mutual between processing components 402 and other assemblies.For example, processing components 402 can comprise multi-media module, to facilitate multimedia groupware 408 and processing componentsMutual between 402.
Memory 408 is configured to store various types of data to be supported in the operation of device 400. TheseThe example of data comprises for any application program of operation on device 400 or the instruction of method, contact personData, telephone book data, message, picture, video etc. Memory 408 can be by any type volatileProperty or non-volatile memory device or their combination realize, as static RAM (SRAM),Electrically Erasable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory EPROM (EPROM),Programmable read only memory (PROM), read-only storage (ROM), magnetic memory, flash memory,Disk or CD.
Power supply module 406 provides electric power for installing 400 various assemblies. Power supply module 406 can comprise electricityManagement system, one or more power supplys, and other with generate, manage for device 400 and distribute electric power phaseAssociated assembly.
Multimedia groupware 408 is included in the screen that an output interface is provided between described device 400 and userCurtain. In certain embodiments, screen can comprise liquid crystal display (LCD) and touch panel (TP). AsFruit screen comprises touch panel, and screen may be implemented as touch-screen, to receive the input signal from user.Touch panel comprises that one or more touch sensors are with the gesture on sensing touch, slip and touch panel.Described touch sensor is the border of sensing touch or sliding action not only, but also detects and described touchOr relevant duration and the pressure of slide. In certain embodiments, multimedia groupware 408 comprises oneIndividual front-facing camera and/or post-positioned pick-up head. When device 400 is in operator scheme, as screening-mode or videoWhen pattern, front-facing camera and/or post-positioned pick-up head can receive outside multi-medium data. Each preposition taking the photographPicture head and post-positioned pick-up head can be fixing optical lens systems or have focal length and optical zoom energyPower.
Audio-frequency assembly 410 is configured to output and/or input audio signal. For example, audio-frequency assembly 410 comprisesA microphone (MIC), when device 400 is in operator scheme, as call model, logging mode and voiceWhen recognition mode, microphone is configured to receive external audio signal. The audio signal receiving can be enteredOne step is stored in memory 408 or sends via communications component 416. In certain embodiments, audio-frequency assembly410 also comprise a loudspeaker, for output audio signal.
I/O interface 412 is for providing interface between processing components 402 and peripheral interface module, above-mentioned peripheral interfaceModule can be keyboard, some striking wheel, button etc. These buttons can include but not limited to: home button, soundAmount button, start button and locking press button.
Sensor cluster 414 comprises one or more sensors, is used to device 400 that various aspects are providedState estimation. For example, sensor cluster 414 can detect the opening/closing state of device 400, assemblyRelative positioning, for example described assembly is display and the keypad of device 400, sensor cluster 414 alsoCan checkout gear 400 or the position of 400 1 assemblies of device change, user with install 400 contact depositOr not there is not the variations in temperature of device 400 orientation or acceleration/deceleration and device 400. Sensor cluster 414Can comprise proximity transducer, be configured to without any physical contact time detect near the depositing of object. Sensor cluster 414 can also comprise optical sensor, as CMOS or ccd image sensor, forIn imaging applications, use. In certain embodiments, this sensor cluster 414 can also comprise that acceleration passesSensor, gyro sensor, Magnetic Sensor, pressure sensor or temperature sensor.
Communications component 416 is configured to be convenient to the logical of wired or wireless mode between device 400 and other equipmentLetter. Device 400 wireless networks that can access based on communication standard, as WiFi, 2G or 3G, or theyCombination. In one exemplary embodiment, communication component 416 receives wide from outside via broadcast channelThe broadcast singal of broadcast management system or broadcast related information. In one exemplary embodiment, described Department of Communication ForcePart 416 also comprises near-field communication (NFC) module, to promote junction service. For example, can in NFC moduleBased on RF identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra broadband (UWB) skillArt, bluetooth (BT) technology and other technologies realize.
In the exemplary embodiment, device 400 can by one or more application specific integrated circuits (ASIC),Digital signal processor (DSP), digital signal processing appts (DSPD), PLD (PLD),Field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realExisting, for carrying out said method.
In the exemplary embodiment, also provide a kind of non-provisional computer-readable storage medium that comprises instructionMatter, for example, comprise the memory 408 of instruction, and above-mentioned instruction can be carried out with complete by the processor 420 of device 400Become said method. For example, described non-provisional computer-readable recording medium can be ROM, arbitrary accessMemory (RAM), CD-ROM, tape, floppy disk and optical data storage equipment etc.
In the exemplary embodiment, also provide a kind of non-provisional computer-readable recording medium, when describedWhen instruction in storage medium is carried out by the processor of mobile terminal, make mobile terminal can carry out following sideMethod: obtain the first character string to be detected and the second character string; Obtaining the first stroke order is listed as and secondOrder row, this first stroke order is classified the stroke of arranging according to sequential write in this first character string as, and this is the years oldTwo order of strokes observed in calligraphy sequences are the stroke of arranging according to sequential write in this second character string; Obtain this first stroke orderMinimum editing distance between row and this second order of strokes observed in calligraphy sequence, this editing distance turns for this first stroke order is listed asBecome the needed map function number of times of this second order of strokes observed in calligraphy sequence, according to this minimum editing distance, obtain thisThe similitude of the first character string and this second character string.
Fig. 5 is according to the block diagram of the judgment means 500 of a kind of character similitude shown in an exemplary embodiment.For example, device 500 may be provided in a server. With reference to Fig. 5, device 500 comprises processing components 522,It further comprises one or more processors, and by the memory resource of memory 532 representatives, usesCan for example, by the instruction of the execution of processing unit 522, application program in storage. In memory 532, storeApplication program can comprise one or more each module corresponding to one group of instruction. In addition locate,Reason assembly 522 is configured to carry out instruction, to carry out following method: obtain the first character string to be detectedWith the second character string; Obtain the first stroke order row and the second order of strokes observed in calligraphy sequence, this first stroke sequentially classify as thisThe stroke of arranging according to sequential write in one character string, this second order of strokes observed in calligraphy sequence is in this second character stringThe stroke of arranging according to sequential write; Obtain the minimum between these the first stroke order row and this second order of strokes observed in calligraphy sequenceEditing distance, this editing distance is transformed into the needed change of this second order of strokes observed in calligraphy sequence for this first stroke order rowChange number of operations, according to this minimum editing distance, obtain this first character string and this second character stringSimilitude.
Device 500 can also comprise that a power supply module 526 is configured to the power management of actuating unit 500,A wired or wireless network interface 550 is configured to device 500 to be connected to network, and an input is defeatedGo out (I/O) interface 558. Device 500 operating systems that can operate based on being stored in memory 532, exampleAs WindowsServerTM,MacOSXTM,UnixTM,LinuxTM,FreeBSDTMOr similar.
Those skilled in the art, considering description and putting into practice after invention disclosed herein, will easily expect these public affairsOther embodiment of opening. The application is intended to contain any modification of the present disclosure, purposes or adaptations,These modification, purposes or adaptations are followed general principle of the present disclosure and are comprised that the disclosure is unexposedCommon practise in the art or conventional techniques means. Description and embodiment are only regarded as exemplary, true scope of the present disclosure and spirit are pointed out by claim below.
Should be understood that, the disclosure is not limited to accurate knot described above and illustrated in the accompanying drawingsStructure, and can carry out various amendments and change not departing from its scope. The scope of the present disclosure is only by appendedClaim limits.

Claims (11)

1. a determination methods for character similitude, is characterized in that, described method comprises:
Obtain the first character string to be detected and the second character string;
Obtain the first stroke order row and the second order of strokes observed in calligraphy sequence, described the first stroke order is classified described the first character order asIn row according to sequential write arrange stroke, described the second order of strokes observed in calligraphy sequence be in described the second character string according toThe stroke that sequential write is arranged;
Obtain the minimum editing distance between described the first stroke order row and described the second order of strokes observed in calligraphy sequence, described inEditing distance is that described the first stroke order row are transformed into the needed map function of described the second order of strokes observed in calligraphy sequence timeNumber;
According to the editing distance of described minimum, obtain described the first character string and described the second character stringSimilitude.
2. method according to claim 1, is characterized in that, described according to editor's distance of described minimumFrom, the similitude of obtaining described the first character string and described the second character string comprises:
Calculate the similar of described the first character string and described the second character string according to calculating formula of similarityDegree, described calculating formula of similarity is:
Sim=1-Dmin/max(A,B)
Wherein, Sim is the similarity of described the first character string and described the second character string, DminFor describedMinimum editing distance, A is the stroke number that described the first stroke order row comprise, B is described the second order of strokes observed in calligraphy orderThe stroke number that row comprise, max () is maximizing computing.
3. method according to claim 1, is characterized in that, described map function comprises deletion strokeAt least one in operation, the operation of insertion stroke and the operation of replacement stroke.
4. method according to claim 1, is characterized in that, described the first character string and describedTwo character strings comprise one or more characters.
5. method according to claim 1, is characterized in that, described the first character string and describedTwo character strings are for cutting apart Chinese character.
6. a judgment means for character similitude, is characterized in that, described device comprises:
Character string acquisition module, for obtaining the first character string to be detected and the second character string;
Order of strokes observed in calligraphy retrieval module, for obtaining the first stroke order row and the second order of strokes observed in calligraphy sequence, described the first strokeOrder is classified as in described the first character string that described character string acquisition module obtains and is arranged according to sequential writeStroke, described the second order of strokes observed in calligraphy sequence is described the second character string that described character string acquisition module obtainsAccording to sequential write arrange stroke;
Editing distance acquisition module, suitable for obtaining described the first stroke that described order of strokes observed in calligraphy retrieval module obtainsMinimum editing distance between sequence and described the second order of strokes observed in calligraphy sequence, described editing distance is described the first strokeOrder row are transformed into the needed map function number of times of described the second order of strokes observed in calligraphy sequence;
Similar retrieval module, for the editor of the described minimum obtained according to described editing distance acquisition moduleDistance, obtains the similitude of described the first character string and described the second character string.
7. device according to claim 6, is characterized in that, described similar retrieval module is used for:
Calculate according to calculating formula of similarity described the first character string that described character string acquisition module obtainsWith the similarity of described the second character string, described calculating formula of similarity is:
Sim=1-Dmin/max(A,B)
Wherein, Sim is the similarity of described the first character string and described the second character string, DminFor describedMinimum editing distance, A is the stroke number that described the first stroke order row comprise, B is described the second order of strokes observed in calligraphy orderThe stroke number that row comprise, max () is maximizing computing.
8. device according to claim 6, is characterized in that, described map function comprises deletion strokeAt least one in operation, the operation of insertion stroke and the operation of replacement stroke.
9. device according to claim 6, is characterized in that, described the first character string and describedTwo character strings comprise one or more characters.
10. device according to claim 6, is characterized in that, described the first character string and described inThe second character string is for cutting apart Chinese character.
The judgment means of 11. 1 kinds of character similitudes, is characterized in that, described device comprises:
Processor;
For the memory of storage of processor executable instruction;
Wherein, described processor is configured to:
Obtain the first character string to be detected and the second character string;
Obtain the first stroke order row and the second order of strokes observed in calligraphy sequence, described the first stroke order is classified described the first character order asIn row according to sequential write arrange stroke, described the second order of strokes observed in calligraphy sequence be in described the second character string according toThe stroke that sequential write is arranged;
Obtain the minimum editing distance between described the first stroke order row and described the second order of strokes observed in calligraphy sequence, described inEditing distance is that described the first stroke order row are transformed into the needed map function of described the second order of strokes observed in calligraphy sequence timeNumber;
According to the editing distance of described minimum, obtain described the first character string and described the second character stringSimilitude.
CN201510917453.5A 2015-12-10 2015-12-10 Character similarity judgment method and device Pending CN105608462A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510917453.5A CN105608462A (en) 2015-12-10 2015-12-10 Character similarity judgment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510917453.5A CN105608462A (en) 2015-12-10 2015-12-10 Character similarity judgment method and device

Publications (1)

Publication Number Publication Date
CN105608462A true CN105608462A (en) 2016-05-25

Family

ID=55988386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510917453.5A Pending CN105608462A (en) 2015-12-10 2015-12-10 Character similarity judgment method and device

Country Status (1)

Country Link
CN (1) CN105608462A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664957A (en) * 2017-03-31 2018-10-16 杭州海康威视数字技术股份有限公司 Number-plate number matching process and device, character information matching process and device
CN108763468A (en) * 2018-05-29 2018-11-06 周宇 Dictionary sequence processing method, device and e-learning equipment
CN110069753A (en) * 2018-01-24 2019-07-30 北京京东尚科信息技术有限公司 A kind of method and apparatus generating similarity information
CN110097002A (en) * 2019-04-30 2019-08-06 北京达佳互联信息技术有限公司 Nearly word form determines method, apparatus, computer equipment and storage medium
CN110287286A (en) * 2019-06-13 2019-09-27 北京百度网讯科技有限公司 The determination method, apparatus and storage medium of short text similarity
CN110377914A (en) * 2019-07-25 2019-10-25 腾讯科技(深圳)有限公司 Character identifying method, device and storage medium
CN110413990A (en) * 2019-06-20 2019-11-05 平安科技(深圳)有限公司 The configuration method of term vector, device, storage medium, electronic device
CN110717158A (en) * 2019-09-06 2020-01-21 平安普惠企业管理有限公司 Information verification method, device, equipment and computer readable storage medium
CN111222590A (en) * 2019-12-31 2020-06-02 咪咕文化科技有限公司 Font-near word determining method, electronic device and computer-readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298582A (en) * 2010-06-23 2011-12-28 商业对象软件有限公司 Data searching and matching method and system
CN102393850A (en) * 2011-07-22 2012-03-28 镇江诺尼基智能技术有限公司 Chinese character pattern cognition similarity computing method
CN103970798A (en) * 2013-02-04 2014-08-06 商业对象软件有限公司 Technology for searching and matching data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298582A (en) * 2010-06-23 2011-12-28 商业对象软件有限公司 Data searching and matching method and system
CN102393850A (en) * 2011-07-22 2012-03-28 镇江诺尼基智能技术有限公司 Chinese character pattern cognition similarity computing method
CN103970798A (en) * 2013-02-04 2014-08-06 商业对象软件有限公司 Technology for searching and matching data

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664957A (en) * 2017-03-31 2018-10-16 杭州海康威视数字技术股份有限公司 Number-plate number matching process and device, character information matching process and device
CN108664957B (en) * 2017-03-31 2021-08-24 杭州海康威视数字技术股份有限公司 License plate number matching method and device, and character information matching method and device
US11093782B2 (en) 2017-03-31 2021-08-17 Hangzhou Hikvision Digital Technology Co., Ltd. Method for matching license plate number, and method and electronic device for matching character information
CN110069753A (en) * 2018-01-24 2019-07-30 北京京东尚科信息技术有限公司 A kind of method and apparatus generating similarity information
CN108763468B (en) * 2018-05-29 2021-06-22 周宇 Dictionary sorting processing method and device and electronic learning equipment
CN108763468A (en) * 2018-05-29 2018-11-06 周宇 Dictionary sequence processing method, device and e-learning equipment
CN110097002A (en) * 2019-04-30 2019-08-06 北京达佳互联信息技术有限公司 Nearly word form determines method, apparatus, computer equipment and storage medium
CN110287286A (en) * 2019-06-13 2019-09-27 北京百度网讯科技有限公司 The determination method, apparatus and storage medium of short text similarity
CN110413990A (en) * 2019-06-20 2019-11-05 平安科技(深圳)有限公司 The configuration method of term vector, device, storage medium, electronic device
CN110377914A (en) * 2019-07-25 2019-10-25 腾讯科技(深圳)有限公司 Character identifying method, device and storage medium
CN110717158A (en) * 2019-09-06 2020-01-21 平安普惠企业管理有限公司 Information verification method, device, equipment and computer readable storage medium
CN110717158B (en) * 2019-09-06 2024-03-01 冉维印 Information verification method, device, equipment and computer readable storage medium
CN111222590A (en) * 2019-12-31 2020-06-02 咪咕文化科技有限公司 Font-near word determining method, electronic device and computer-readable storage medium
CN111222590B (en) * 2019-12-31 2024-04-12 咪咕文化科技有限公司 Shape-near-word determining method, electronic device, and computer-readable storage medium

Similar Documents

Publication Publication Date Title
CN105608462A (en) Character similarity judgment method and device
US9058375B2 (en) Systems and methods for adding descriptive metadata to digital content
CN105159871B (en) Text message detection method and device
CN106024009A (en) Audio processing method and device
CN105095873A (en) Picture sharing method and apparatus
CN106776890A (en) The method of adjustment and device of video playback progress
CN105094760A (en) Picture marking method and device
CN110175223A (en) A kind of method and device that problem of implementation generates
CN104378441A (en) Schedule creating method and device
CN104035995A (en) Method and device for generating group tags
CN104268150A (en) Method and device for playing music based on image content
CN104636164B (en) Start page generation method and device
CN106202223A (en) Content collection method, device and for collecting the device of content in application program
CN105550643A (en) Medical term recognition method and device
CN103916940A (en) Method and device for acquiring photographing position
CN106547547A (en) Collecting method and device
CN104991910A (en) Album creation method and apparatus
CN106534951A (en) Method and apparatus for video segmentation
CN110717399A (en) Face recognition method and electronic terminal equipment
CN107239351A (en) Method of attaching and device
CN105354284A (en) Template processing method and apparatus and short message identification method and apparatus
CN104090915B (en) Method and device for updating user data
CN104156344B (en) Method for editing text and device
CN113722541A (en) Video fingerprint generation method and device, electronic equipment and storage medium
CN103870544A (en) Method and device for virtually operating file, and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160525

RJ01 Rejection of invention patent application after publication