CN106815593A

CN106815593A - The determination method and apparatus of Chinese text similarity

Info

Publication number: CN106815593A
Application number: CN201510850305.6A
Authority: CN
Inventors: 刘粉香
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2015-11-27
Filing date: 2015-11-27
Publication date: 2017-06-09
Anticipated expiration: 2035-11-27
Also published as: CN106815593B

Abstract

This application discloses a kind of determination method and apparatus of Chinese text similarity.Wherein, the method includes：Chinese character in first Chinese text is converted into phonetic, obtain the first phonetic text, Chinese character in second Chinese text is converted into phonetic, obtain the second phonetic text, according to the number of every kind of phonetic unit in the number and the second phonetic text of every kind of phonetic unit in rule-statistical the first phonetic text of the Chinese phonetic alphabet, first eigenvector is generated by the number of every kind of phonetic unit in the first phonetic text, by the number generation second feature vector of every kind of phonetic unit in the second phonetic text, calculate the distance of first eigenvector and second feature vector, the similarity of the first Chinese text and the second Chinese text is determined according to distance, wherein, apart from smaller, first Chinese text is higher with the similarity of the second Chinese text.Present application addresses the technical problem that prior art is difficult to effectively Similar Text of the identification caused by misspelling.

Description

The determination method and apparatus of Chinese text similarity

Technical field

The application is related to text-processing field, in particular to the determination method and dress of a kind of Chinese text similarity Put.

Background technology

During being analyzed to text, it is often necessary to carry out error correction to text, i.e. appeared in text Mistake word is corrected, such as, according to " the dangerous hand-pulled noodles " of user input, distinguishing the possible target word of user is Similar Text " hand-pulled noodles of taste thousand ".And for the determination method of Similar Text, it is presently mainly similar between calculating character string The number of word, similar number is more, represents that the similarity of text is higher.

However, it is found by the inventors that the scheme of prior art is difficult effectively identification for the Similar Text caused by misspelling, Such as, in its recognition result the similarity ratio " dangerous hand-pulled noodles " of " Chiba hand-pulled noodles " and " hand-pulled noodles of taste thousand " with " taste thousand draws The similarity in face " is higher.

For above-mentioned problem, effective solution is not yet proposed at present.

The content of the invention

The embodiment of the present application provides a kind of determination method and apparatus of Chinese text similarity, at least to solve existing skill Art is difficult to the technical problem of effectively Similar Text of the identification caused by misspelling.

According to the one side of the embodiment of the present application, there is provided a kind of determination method of Chinese text similarity, including： Chinese character in first Chinese text is converted into phonetic, the first phonetic text is obtained, by the Chinese character in the second Chinese text Phonetic is converted into, the second phonetic text is obtained；According to the Chinese phonetic alphabet rule-statistical described in it is every kind of in the first phonetic text The number of every kind of phonetic unit in the number of phonetic unit and the second phonetic text；By in the first phonetic text The number generation first eigenvector of every kind of phonetic unit, by the number of every kind of phonetic unit in the second phonetic text Generation second feature vector；Calculate the distance of the first eigenvector and second feature vector；According to it is described away from From the similarity for determining first Chinese text and second Chinese text, wherein, it is described apart from smaller, it is described First Chinese text is higher with the similarity of second Chinese text.

Further, according to the Chinese phonetic alphabet rule-statistical described in the first phonetic text the number of every kind of phonetic unit and The number of every kind of phonetic unit includes in the second phonetic text：Using an initial consonant in Chinese character as a phonetic list Unit, a simple or compound vowel of a Chinese syllable as a phonetic unit, count every kind of initial consonant and every kind of simple or compound vowel of a Chinese syllable in the first phonetic text The number of every kind of initial consonant and every kind of simple or compound vowel of a Chinese syllable in several and the second phonetic text.

Further, according to the Chinese phonetic alphabet rule-statistical described in the first phonetic text the number of every kind of phonetic unit and The number of every kind of phonetic unit includes in the second phonetic text：An entirety in Chinese character is recognized into pronunciation section as Individual phonetic unit, non-integral recognizes an initial consonant of the Chinese phonetic alphabet of pronunciation section as a phonetic unit, and non-integral recognizes reading One simple or compound vowel of a Chinese syllable of the Chinese phonetic alphabet of syllable as a phonetic unit, count every kind of initial consonant in the first phonetic text, Every kind of simple or compound vowel of a Chinese syllable and every kind of entirety recognize every kind of initial consonant, every kind of simple or compound vowel of a Chinese syllable in the number and the second phonetic text of pronunciation section And every kind of entirety recognizes the number of pronunciation section.

Further, first eigenvector is generated by the number of every kind of phonetic unit in the first phonetic text, by institute The number generation second feature vector for stating every kind of phonetic unit in the second phonetic text includes：By the first phonetic text In every kind of phonetic unit number be inserted respectively into preset vector respective dimensions position, obtain the fisrt feature to Amount, the number of every kind of phonetic unit in the second phonetic text is inserted respectively into the position of the respective dimensions for presetting vector Put, obtain second feature vector, wherein, the default vector be with the phonetic arranged according to preset order The vector of the one-to-one multiple dimension of the species of unit.

Further, calculate the first eigenvector includes with the distance of second feature vector：Calculate described The difference of each corresponding dimension during one characteristic vector is vectorial with the second feature；The difference of each correspondence dimension is taken absolutely To value, and the absolute value is added, obtains the distance.

According to the another aspect of the embodiment of the present application, a kind of determining device of Chinese text similarity is additionally provided, including： Conversion unit, for the Chinese character in the first Chinese text to be converted into phonetic, obtains the first phonetic text, by second Chinese character in text is converted into phonetic, obtains the second phonetic text；Statistic unit, for the rule according to the Chinese phonetic alphabet Then count every kind of phonetic unit in the number and the second phonetic text of every kind of phonetic unit in the first phonetic text Number；Generation unit, for from the first phonetic text every kind of phonetic unit number generation fisrt feature to Amount, by the number generation second feature vector of every kind of phonetic unit in the second phonetic text；Computing unit, is used for Calculate the distance of the first eigenvector and second feature vector；Determining unit, for true according to the distance The similarity of fixed first Chinese text and second Chinese text, wherein, it is described apart from smaller, described first Chinese text is higher with the similarity of second Chinese text.

Further, the statistic unit is specifically for using an initial consonant in Chinese character as a phonetic unit, Simple or compound vowel of a Chinese syllable counts the number and institute of every kind of initial consonant and every kind of simple or compound vowel of a Chinese syllable in the first phonetic text as a phonetic unit State the number of every kind of initial consonant and every kind of simple or compound vowel of a Chinese syllable in the second phonetic text.

Further, the statistic unit using an entirety in Chinese character specifically for recognizing pronunciation section as a phonetic list Unit, non-integral recognizes an initial consonant of the Chinese phonetic alphabet of pronunciation section as a phonetic unit, and non-integral recognizes the Chinese of pronunciation section One simple or compound vowel of a Chinese syllable of language phonetic counts every kind of initial consonant, every kind of simple or compound vowel of a Chinese syllable in the first phonetic text as a phonetic unit And every kind of entirety recognizes every kind of initial consonant in the number and the second phonetic text of pronunciation section, every kind of simple or compound vowel of a Chinese syllable and every kind of Entirety recognizes the number of pronunciation section.

Further, the generation unit is specifically for the number of every kind of phonetic unit in the first phonetic text is divided The position of the respective dimensions for presetting vector is not inserted into, the first eigenvector is obtained, by the second phonetic text In every kind of phonetic unit number be inserted respectively into preset vector respective dimensions position, obtain the second feature to Amount, wherein, the default vector is with many correspondingly with the species of the phonetic unit arranged according to preset order The vector of individual dimension.

Further, the computing unit includes：First computing module, for calculating the first eigenvector and institute State the difference of each correspondence dimension in second feature vector；Second computing module, for by it is described each correspondence dimension difference Take absolute value, and the absolute value is added, obtain the distance.

According to embodiments of the present invention, the Chinese character in the first Chinese text is converted into phonetic, obtains the first phonetic text, Chinese character in second Chinese text is converted into phonetic, the second phonetic text is obtained, according to the rule-statistical of the Chinese phonetic alphabet In first phonetic text in the number of every kind of phonetic unit and the second phonetic text every kind of phonetic unit number, by first The number generation first eigenvector of every kind of phonetic unit in phonetic text, by every kind of phonetic unit in the second phonetic text Number generation second feature vector, calculate the distance of first eigenvector and second feature vector, determined according to distance The similarity of the first Chinese text and the second Chinese text, wherein, apart from smaller, the first Chinese text and the second Chinese The similarity of text is higher, solves the technology that prior art is difficult to effectively Similar Text of the identification caused by misspelling Problem, realizes the identification to the Similar Text caused by misspelling.

Brief description of the drawings

Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen Schematic description and description please does not constitute the improper restriction to the application for explaining the application.In accompanying drawing In：

Fig. 1 is the flow chart of the determination method of the Chinese text similarity according to the embodiment of the present application；

Fig. 2 is the schematic diagram of the determining device of the Chinese text similarity according to the embodiment of the present application.

Specific embodiment

In order that those skilled in the art more fully understand application scheme, below in conjunction with the embodiment of the present application Accompanying drawing, is clearly and completely described to the technical scheme in the embodiment of the present application, it is clear that described embodiment The only embodiment of the application part, rather than whole embodiments.Based on the embodiment in the application, ability The every other embodiment that domain those of ordinary skill is obtained under the premise of creative work is not made, should all belong to The scope of the application protection.

It should be noted that term " first ", " in the description and claims of this application and above-mentioned accompanying drawing Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that this The data that sample is used can be exchanged in the appropriate case, so as to embodiments herein described herein can with except Here the order beyond those for illustrating or describing is implemented.Additionally, term " comprising " and " having " and they Any deformation, it is intended that covering is non-exclusive to be included, for example, containing process, the side of series of steps or unit Method, system, product or equipment are not necessarily limited to those steps clearly listed or unit, but may include unclear List or for these processes, method, product or other intrinsic steps of equipment or unit.

According to the embodiment of the present application, there is provided a kind of embodiment of the method for the determination method of Chinese text similarity, it is necessary to Illustrate, can be in the such as one group department of computer science of computer executable instructions the step of the flow of accompanying drawing is illustrated Performed in system, and, although logical order is shown in flow charts, but in some cases, can be with difference Shown or described step is performed in order herein.

Fig. 1 is the flow chart of the determination method of the Chinese text similarity according to the embodiment of the present application, as shown in figure 1, The method comprises the following steps：

Step S102, phonetic is converted into by the Chinese character in the first Chinese text, the first phonetic text is obtained, by second Chinese character in text is converted into phonetic, obtains the second phonetic text.

Wherein, the first Chinese text and the second Chinese text can be article, sentence, phrase etc..First Chinese text This and the second Chinese text are two texts of similarity to be determined.In the present embodiment, by the first Chinese text and second Chinese text changes into phonetic text respectively.Its corresponding phonetic will be changed into by each word in Chinese text, be formed and spelled Sound text.For example, " in high spirits " to be converted into " xing gao cai lie ".

Step S104, according to the number and second of every kind of phonetic unit in rule-statistical the first phonetic text of the Chinese phonetic alphabet The number of every kind of phonetic unit in phonetic text.

The spelling rules of the Chinese phonetic alphabet is that initial consonant is one or more spelling plus simple or compound vowel of a Chinese syllable, the i.e. corresponding phonetic of each Chinese character Sound unit is constituted, wherein it is possible to using initial consonant and simple or compound vowel of a Chinese syllable as phonetic unit.It is overall due to also including in the Chinese phonetic alphabet Recognize pronunciation section, therefore, the entirety recognizes pronunciation section can also be used as phonetic unit.

For example, above-mentioned " xing gao cai lie ", wherein, the phonetic unit for splitting into can be " x ", " ing ", " g ", " ao ", " c ", " ai ", " l ", " ie ", the number of each phonetic unit are 1.Phonetic text " gao gao Xing xing ", " g ", " ao ", " x ", the number of " ing " are 2 after statistics.

Step S106, first eigenvector is generated by the number of every kind of phonetic unit in the first phonetic text, is spelled by second The number generation second feature vector of every kind of phonetic unit in sound text.

After the number of every kind of phonetic unit in counting two phonetic texts, from the number generate corresponding feature to Amount, this feature vector can be the vector for including multiple dimensions, wherein, first eigenvector and second feature are vectorial Number of dimensions is identical.

Alternatively, the generating mode of characteristic vector can be to the species of all of phonetic unit in the current Chinese phonetic alphabet by According to preset order sequence, a dimension of the phonetic unit character pair vector of each species, every kind of spelling in phonetic text The number of sound unit as phonetic unit respective dimensions in characteristic vector value；Can also be two phonetic texts of statistics Appeared in all of phonetic unit species, the characteristic vector of generation and the dimension of species number respective numbers, wherein, The number of the every kind of phonetic unit counted in each phonetic text is used as phase in the corresponding characteristic vector of corresponding phonetic text Answer the value of dimension.For example, " gao gao xing xing " and " gao gao xin xin " two phonetic texts, its In, the species of phonetic unit has " g ", " ao ", " x ", " ing ", " in ", therefore the characteristic vector of generation has 5 Individual dimension, wherein, according to the first phonetic text that above-mentioned sequence (" g ", " ao ", " x ", " ing ", " in ") is generated Characteristic vector (i.e. first eigenvector) be [2,2,2,2,0], (i.e. second is special for the characteristic vector of the second phonetic text Levy vector) it is [2,2,2,0,2].

Step S108, calculates the distance of first eigenvector and second feature vector.

Step S110, the similarity of the first Chinese text and the second Chinese text is determined according to distance, wherein, distance is got over Small, the first Chinese text is higher with the similarity of the second Chinese text.

After generation first eigenvector with second feature vector, the distance between the two vectors are calculated, the distance It can be Euclidean distance etc..Determine the similarity between two Chinese texts further according to the distance for calculating, distance is bigger, The two similarity is smaller, and apart from smaller, similarity therebetween is bigger.For example, " the Chiba hand-pulled noodles " determined Similarity ratio " dangerous hand-pulled noodles " with " hand-pulled noodles of taste thousand " is lower with the similarity of " hand-pulled noodles of taste thousand ", is capable of determining that The Similar Text of the text of misspelling.

Preferably, spelled according to the number of every kind of phonetic unit in rule-statistical the first phonetic text of the Chinese phonetic alphabet and second The number of every kind of phonetic unit includes in sound text：Using an initial consonant in Chinese character as a phonetic unit, a rhythm Mother counts the number and the second phonetic of every kind of initial consonant and every kind of simple or compound vowel of a Chinese syllable in the first phonetic text as a phonetic unit The number of every kind of initial consonant and every kind of simple or compound vowel of a Chinese syllable in text.

Because the existing Chinese phonetic alphabet uses the Latin alphabet, it is divided into initial consonant and simple or compound vowel of a Chinese syllable, therefore, can in each Chinese character Split into initial consonant and simple or compound vowel of a Chinese syllable (some words then only have simple or compound vowel of a Chinese syllable, such as " love "), in the present embodiment, using each initial consonant as One phonetic unit, each simple or compound vowel of a Chinese syllable as a phonetic unit, by each Chinese character separating in phonetic text into initial consonant and rhythm Mother, and count the number of every kind of initial consonant and every kind of simple or compound vowel of a Chinese syllable.

Alternatively, spelled according to the number of every kind of phonetic unit in rule-statistical the first phonetic text of the Chinese phonetic alphabet and second The number of every kind of phonetic unit includes in sound text：An entirety in Chinese character is recognized into pronunciation section as a phonetic unit, Non-integral recognizes an initial consonant of the Chinese phonetic alphabet of pronunciation section as a phonetic unit, and the Chinese that non-integral recognizes pronunciation section is spelled One simple or compound vowel of a Chinese syllable of sound as a phonetic unit, every kind of initial consonant in the first phonetic text of statistics, every kind of simple or compound vowel of a Chinese syllable and every kind of Integrally recognize every kind of initial consonant, every kind of simple or compound vowel of a Chinese syllable and every kind of entirety in the number and the second phonetic text of pronunciation section and recognize pronunciation section Number.

Due to including one rhythm imperial mother pronunciation of addition in the Chinese phonetic alphabet still as initial consonant (or after one initial consonant of addition Pronunciation is still as simple or compound vowel of a Chinese syllable) syllable, i.e., it is overall to recognize pronunciation section.In the present embodiment, pronunciation section as will be integrally recognized Individual phonetic unit, non-integral recognizes the Chinese phonetic alphabet of pronunciation section, then using initial consonant and simple or compound vowel of a Chinese syllable as phonetic unit, count Go out the number of every kind of phonetic unit.For example, the Chinese phonetic alphabet includes that 23 initial consonants, 24 simple or compound vowel of a Chinese syllable and 16 entirety are recognized Pronunciation section, therefore, phonetic unit has 63 kinds.

Preferably, first eigenvector is generated by the number of every kind of phonetic unit in the first phonetic text, by the second phonetic The number generation second feature vector of every kind of phonetic unit includes in text：By every kind of phonetic unit in the first phonetic text Number be inserted respectively into preset vector respective dimensions position, first eigenvector is obtained, by the second phonetic text In every kind of phonetic unit number be inserted respectively into preset vector respective dimensions position, obtain second feature vector, Wherein, default vector is with the one-to-one multiple dimension of species with the phonetic unit arranged according to preset order Vector.

In the embodiment of the present invention, default each dimension of vector represents a kind of phonetic unit, wherein in generation characteristic vector, The value of each dimension represents the number that the number of times that corresponding phonetic unit occurs in every kind of phonetic text is counted.Its In, all of phonetic unit is ranked up according to preset order, corresponds to each dimension in default vector, and this is preset Order is arbitrarily selected order.

For example, above-mentioned recognize pronunciation section according to initial consonant, simple or compound vowel of a Chinese syllable, entirety in the embodiment for counting phonetic unit, to count two All of initial consonant, simple or compound vowel of a Chinese syllable, the overall number for recognizing pronunciation section, are inserted respectively into the default vector of 63 dimensions in individual phonetic text In, two characteristic vectors of phonetic text are generated, wherein, 63 dimensions are according to being all initial consonants in phonetic, simple or compound vowel of a Chinese syllable, whole Realization pronunciation section number sum is obtained.Phonetic such as " happy " is " gao gao xing xing " statistics " g " " ao " " x " " ing " number respectively is respectively 2, then in 63 Balakrishnan this pronunciation characteristic vectors of " happy " In, corresponding initial consonant and simple or compound vowel of a Chinese syllable position are 2, and other positions are 0, and characteristic vector is [..., 2 ..., 2 ..., 2 ..., 2 ...] (clipped is 0).

In the embodiment of the present application, using default vector is predefined, when characteristic vector is generated, statistics need to only be obtained The number of phonetic unit be inserted into default vector, generating mode is simple.

Preferably, calculate first eigenvector includes with the distance of second feature vector：Calculate first eigenvector and the The difference of each correspondence dimension in two characteristic vectors；The difference of each correspondence dimension is taken absolute value, and absolute value is added, Obtain distance.

Two distances of characteristic vector can be calculated with 1 norm etc., and 1 norm calculation mode is：By two vectors The difference of correspondence position (corresponding to the value of dimension) takes absolute value, and is added, and obtains number and represents two phonetic texts As distance, the number is smaller, represents that similarity is higher.Such as the similarity ratio of " dangerous hand-pulled noodles " and " hand-pulled noodles of taste thousand " The similarity of " Chiba hand-pulled noodles " and " hand-pulled noodles of taste thousand " is higher.

In the embodiment of the present application, the similarity deterministic process of two Chinese texts is converted into the distance between two vectors Judge, improve the accuracy and speed of the identification of Similar Text.

The embodiment of the present application additionally provides a kind of determining device of Chinese text similarity, and the device can be used for performing sheet Apply for the determination method of the Chinese text similarity of embodiment, as shown in Fig. 2 the device includes：Conversion unit 10, Statistic unit 20, generation unit 30, computing unit 40 and determining unit 50.

Conversion unit 10 is used to for the Chinese character in the first Chinese text to be converted into phonetic, obtains the first phonetic text, by the Chinese character in two Chinese texts is converted into phonetic, obtains the second phonetic text.

Statistic unit 20 is used for according to the number of every kind of phonetic unit in rule-statistical the first phonetic text of the Chinese phonetic alphabet With the number of every kind of phonetic unit in the second phonetic text.

Generation unit 30 is used to generate first eigenvector by the number of every kind of phonetic unit in the first phonetic text, by the The number generation second feature vector of every kind of phonetic unit in two phonetic texts.

Computing unit 40 is used to calculate the distance of first eigenvector and second feature vector.

Determining unit 50 is used to determine according to distance the similarity of the first Chinese text and the second Chinese text, wherein, away from From smaller, the first Chinese text is higher with the similarity of the second Chinese text.

Preferably, statistic unit is specifically for using an initial consonant in Chinese character as a phonetic unit, a simple or compound vowel of a Chinese syllable is made It is a phonetic unit, the number and the second phonetic text of every kind of initial consonant and every kind of simple or compound vowel of a Chinese syllable in the first phonetic text of statistics In every kind of initial consonant and every kind of simple or compound vowel of a Chinese syllable number.

Preferably, statistic unit is non-specifically for an entirety in Chinese character is recognized into pronunciation section as a phonetic unit Entirety recognizes an initial consonant of the Chinese phonetic alphabet of pronunciation section as a phonetic unit, and non-integral recognizes the Chinese phonetic alphabet of pronunciation section A simple or compound vowel of a Chinese syllable as a phonetic unit, every kind of initial consonant in the first phonetic text of statistics, every kind of simple or compound vowel of a Chinese syllable and every kind of whole Every kind of initial consonant, every kind of simple or compound vowel of a Chinese syllable and every kind of entirety recognize pronunciation section in realizing the number and the second phonetic text of pronunciation section Number.

Preferably, generation unit is pre- specifically for the number of every kind of phonetic unit in the first phonetic text is inserted respectively into If the position of the respective dimensions of vector, obtains first eigenvector, by the second phonetic text every kind of phonetic unit Number is inserted respectively into the position of the respective dimensions for presetting vector, obtains second feature vector, wherein, it is tool to preset vector There is the vector with the one-to-one multiple dimension of the species of the phonetic unit arranged according to preset order.

Preferably, computing unit includes：First computing module, for calculating first eigenvector with second feature vector In each correspondence dimension difference；Second computing module, for the difference of each correspondence dimension to be taken absolute value, and will be absolute Value is added, and obtains distance.

The determining device of the Chinese text similarity includes processor and memory, and above-mentioned conversion unit 10, statistics are single Unit 20, generation unit 30, computing unit 40 and determining unit 50 etc. are stored in memory as program unit, By computing device storage said procedure unit in memory.It is above-mentioned to may be stored in memory.

Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can set one Or more, the similarity of content of text is determined by adjusting kernel parameter.

Memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/ Or the form, such as read-only storage (ROM) or flash memory (flash RAM) such as Nonvolatile memory, memory includes at least one Individual storage chip.

Present invention also provides a kind of embodiment of computer program product, when being performed on data processing equipment, fit In the program code for performing initialization there are as below methods step：Chinese character in first Chinese text is converted into phonetic, is obtained To the first phonetic text, the Chinese character in the second Chinese text is converted into phonetic, the second phonetic text is obtained, according to the Chinese Every kind of phonetic list in the number and the second phonetic text of every kind of phonetic unit in rule-statistical the first phonetic text of language phonetic The number of unit, first eigenvector is generated by the number of every kind of phonetic unit in the first phonetic text, by the second phonetic text The number generation second feature vector of every kind of phonetic unit in this, calculate first eigenvector and second feature vector away from From, the similarity of the first Chinese text and the second Chinese text is determined according to distance, wherein, apart from smaller, in first Text is higher with the similarity of the second Chinese text.

Above-mentioned the embodiment of the present application sequence number is for illustration only, and the quality of embodiment is not represented.

In above-described embodiment of the application, the description to each embodiment all emphasizes particularly on different fields, and does not have in certain embodiment The part of detailed description, may refer to the associated description of other embodiment.

In several embodiments provided herein, it should be understood that disclosed technology contents, can be by other Mode realize.Wherein, device embodiment described above is only schematical, such as division of described unit, Can be a kind of division of logic function, there can be other dividing mode when actually realizing, for example multiple units or component Can combine or be desirably integrated into another system, or some features can be ignored, or do not perform.It is another, institute Display or the coupling each other for discussing or direct-coupling or communication connection can be by some interfaces, unit or mould The INDIRECT COUPLING of block or communication connection, can be electrical or other forms.

The unit that is illustrated as separating component can be or may not be it is physically separate, it is aobvious as unit The part for showing can be or may not be physical location, you can with positioned at a place, or can also be distributed to On multiple units.Some or all of unit therein can be according to the actual needs selected to realize this embodiment scheme Purpose.

In addition, during each functional unit in the application each embodiment can be integrated in a processing unit, it is also possible to It is that unit is individually physically present, it is also possible to which two or more units are integrated in a unit.It is above-mentioned integrated Unit can both be realized in the form of hardware, it would however also be possible to employ the form of SFU software functional unit is realized.

If the integrated unit is to realize in the form of SFU software functional unit and as independent production marketing or when using, Can store in a computer read/write memory medium.Based on such understanding, the technical scheme essence of the application On all or part of the part that is contributed to prior art in other words or the technical scheme can be with software product Form is embodied, and the computer software product is stored in a storage medium, including some instructions are used to so that one Platform computer equipment (can be personal computer, server or network equipment etc.) performs each embodiment institute of the application State all or part of step of method.And foregoing storage medium includes：USB flash disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD Etc. it is various can be with the medium of store program codes.

The above is only the preferred embodiment of the application, it is noted that for the ordinary skill people of the art For member, on the premise of the application principle is not departed from, some improvements and modifications can also be made, these improve and moisten Decorations also should be regarded as the protection domain of the application.

Claims

1. a kind of determination method of Chinese text similarity, it is characterised in that including：

Chinese character in first Chinese text is converted into phonetic, the first phonetic text is obtained, by the second Chinese text In Chinese character be converted into phonetic, obtain the second phonetic text；

According to the Chinese phonetic alphabet rule-statistical described in the first phonetic text every kind of phonetic unit number and described The number of every kind of phonetic unit in two phonetic texts；

First eigenvector is generated by the number of every kind of phonetic unit in the first phonetic text, by described second The number generation second feature vector of every kind of phonetic unit in phonetic text；

Calculate the distance of the first eigenvector and second feature vector；

The similarity of first Chinese text and second Chinese text is determined according to the distance, wherein, It is described apart from smaller, first Chinese text is higher with the similarity of second Chinese text.

2. method according to claim 1, it is characterised in that according to the Chinese phonetic alphabet rule-statistical described in first spell The number of every kind of phonetic unit includes in the number of every kind of phonetic unit and the second phonetic text in sound text：

Using an initial consonant in Chinese character as a phonetic unit, a simple or compound vowel of a Chinese syllable is used as a phonetic unit, statistics Every kind of sound in the number and the second phonetic text of every kind of initial consonant and every kind of simple or compound vowel of a Chinese syllable in the first phonetic text The number of female and every kind of simple or compound vowel of a Chinese syllable.

3. method according to claim 1, it is characterised in that according to the Chinese phonetic alphabet rule-statistical described in first spell The number of every kind of phonetic unit includes in the number of every kind of phonetic unit and the second phonetic text in sound text：

An entirety in Chinese character is recognized pronunciation section as a phonetic unit, the Chinese that non-integral recognizes pronunciation section is spelled Used as a phonetic unit, non-integral recognizes a simple or compound vowel of a Chinese syllable of the Chinese phonetic alphabet of pronunciation section as one to one initial consonant of sound Individual phonetic unit, every kind of initial consonant, every kind of simple or compound vowel of a Chinese syllable and every kind of entirety recognize pronunciation in counting the first phonetic text Every kind of initial consonant, every kind of simple or compound vowel of a Chinese syllable and every kind of entirety recognize pronunciation section in the number of section and the second phonetic text Number.

4. according to the method in any one of claims 1 to 3, it is characterised in that by the first phonetic text The number generation first eigenvector of every kind of phonetic unit, by every kind of phonetic unit in the second phonetic text Number generation second feature vector includes：

The number of every kind of phonetic unit in the first phonetic text is inserted respectively into the respective dimensions for presetting vector Position, the first eigenvector is obtained, by the number of every kind of phonetic unit in the second phonetic text point The position of the respective dimensions for presetting vector is not inserted into, obtains the second feature vector, wherein, it is described default Vector is the vector with the one-to-one multiple dimension of species with the phonetic unit arranged according to preset order.

5. method according to claim 1, it is characterised in that calculate the first eigenvector and described second special The distance for levying vector includes：

Calculate the difference of the first eigenvector and each corresponding dimension in second feature vector；

The difference of each correspondence dimension is taken absolute value, and the absolute value is added, obtain the distance.

6. a kind of determining device of Chinese text similarity, it is characterised in that including：

Conversion unit, for the Chinese character in the first Chinese text to be converted into phonetic, obtains the first phonetic text, Chinese character in second Chinese text is converted into phonetic, the second phonetic text is obtained；

Statistic unit, for every kind of phonetic unit in the first phonetic text described in the rule-statistical according to the Chinese phonetic alphabet Number and the second phonetic text in every kind of phonetic unit number；

Generation unit, for from the first phonetic text every kind of phonetic unit number generation fisrt feature to Amount, by the number generation second feature vector of every kind of phonetic unit in the second phonetic text；

Computing unit, the distance for calculating the first eigenvector and second feature vector；

Determining unit, for determining first Chinese text and second Chinese text according to the distance Similarity, wherein, described apart from smaller, the similarity of first Chinese text and second Chinese text It is higher.

7. device according to claim 6, it is characterised in that the statistic unit is specifically for by Chinese character Used as a phonetic unit, a simple or compound vowel of a Chinese syllable counts the first phonetic text to individual initial consonant as a phonetic unit In every kind of initial consonant and every kind of simple or compound vowel of a Chinese syllable number and the second phonetic text in every kind of initial consonant and every kind of simple or compound vowel of a Chinese syllable Number.

8. device according to claim 6, it is characterised in that the statistic unit is specifically for by Chinese character Individual entirety recognizes pronunciation section as a phonetic unit, and non-integral recognizes an initial consonant conduct of the Chinese phonetic alphabet of pronunciation section One phonetic unit, non-integral recognizes a simple or compound vowel of a Chinese syllable of the Chinese phonetic alphabet of pronunciation section as a phonetic unit, statistics Every kind of initial consonant, every kind of simple or compound vowel of a Chinese syllable and every kind of entirety recognize the number of pronunciation section and described in the first phonetic text Every kind of initial consonant, every kind of simple or compound vowel of a Chinese syllable and every kind of entirety recognize the number of pronunciation section in second phonetic text.

9. the device according to any one of claim 6 to 8, it is characterised in that the generation unit specifically for The number of every kind of phonetic unit in the first phonetic text is inserted respectively into the position of the respective dimensions for presetting vector Put, obtain the first eigenvector, the number of every kind of phonetic unit in the second phonetic text is inserted respectively Enter the position of the respective dimensions to default vector, obtain the second feature vector, wherein, the default vector It is the vector with the one-to-one multiple dimension of species with the phonetic unit arranged according to preset order.

10. device according to claim 6, it is characterised in that the computing unit includes：

First computing module, for calculating the first eigenvector and the second feature vector in each is corresponding The difference of dimension；

Second computing module, for difference of each correspondence dimension to be taken absolute value, and by the absolute value phase Plus, obtain the distance.