CN109377980A - A kind of syllable splitting method and apparatus - Google Patents

A kind of syllable splitting method and apparatus Download PDF

Info

Publication number
CN109377980A
CN109377980A CN201811009619.3A CN201811009619A CN109377980A CN 109377980 A CN109377980 A CN 109377980A CN 201811009619 A CN201811009619 A CN 201811009619A CN 109377980 A CN109377980 A CN 109377980A
Authority
CN
China
Prior art keywords
syllable
base
array
sequence
check
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811009619.3A
Other languages
Chinese (zh)
Other versions
CN109377980B (en
Inventor
马龙
倪博溢
雷画雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhongan Information Technology Service Co ltd
Original Assignee
Zhongan Information Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongan Information Technology Service Co Ltd filed Critical Zhongan Information Technology Service Co Ltd
Priority to CN201811009619.3A priority Critical patent/CN109377980B/en
Publication of CN109377980A publication Critical patent/CN109377980A/en
Application granted granted Critical
Publication of CN109377980B publication Critical patent/CN109377980B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a kind of syllable splitting method and apparatus, belong to natural language processing technique field.Method includes: the even numbers group Trie tree construction of preparatory construction syllabary;Based on even numbers group Trie tree construction, legal syllables are matched from the pinyin sequence of input, and based on the legal syllables matched, cutting are carried out with power and syllable preferential strategy according to syllable to pinyin sequence, to obtain a variety of syllable splitting schemes;A variety of syllable splitting schemes are stored.Method provided in an embodiment of the present invention realize can in the case where not influencing result accuracy, achieve the purpose that quickly, reasonably carry out syllable splitting.

Description

A kind of syllable splitting method and apparatus
Technical field
The present invention relates to natural language processing technique field, in particular to a kind of syllable splitting method and apparatus.
Background technique
Syllable is most natural structural units in voice.Exactly, syllable is the smallest voice that phoneme combination is constituted Structural units, a syllable are composed of according to certain rules one or several phonemes, and a letter is exactly a phoneme.In the Chinese The pronunciation of a general Chinese character is a syllable in language, for example, including a () with the syllable that a is open, ai (sorrow), ao (are endured), This 5 syllables of an (peace), ang (dirty).Common pinyin syllable table is substantially 405 without tuning section.
Currently, in information input field, keyboard input is still one either in mobile device either PC The information input mode of kind mainstream, and for vast China Internet user, Pinyin Input is undoubtedly in keyboard input again A kind of most popular input mode.Due to the pinyin character string that input content is letter composition, and exporting is Chinese character, this its In require the decoding operate to hold water to the character string of input, this decoding process is just called syllable splitting.Syllable Cutting has vital effect in the input field of the entire Chinese phonetic alphabet.It since Chinese Pinyin syllables are numerous, and include letter Spelling and full form, while being also required to support the processing of fuzzy phoneme, have strictly applying upper to decoding performance and accuracy It is required that so syllable splitting is a well-known difficulties.
Give some syllable splitting schemes in the prior art, comprising: (1) positive maximum cutting, so-called forward direction, refer to from Left-to-right, " maximum " refers to the preferential maximum syllable of length for retaining and matching and, such as: character string " fangan " passes through forward direction Maximum cutting can obtain syllable " fang ' an ", export Chinese character " scheme ", but it is " anti-to cut out the result that user may wish to Sense ";(2) reverse maximum cutting, it is so-called reverse, refer to from right to left, " maximum " refers to that preferential reservation matches the length come most Big syllable can obtain syllable " fan ' gan " for character string " fangan " by reverse maximum cutting, and output Chinese character is " anti- Sense ", but cannot get " scheme ", equally will appear problem ";(3) two-way maximum cutting, it is so-called two-way, refer to from first doing forward direction Maximum cutting, then reverse maximum cutting is done, it is after retaining cutting twice as a result, passing through two-way maximum for character string " fangan " Cutting, available " scheme " and " dislike ", it appears that solve the problems, such as, but still have some exceptions.For example it encounters similar " suiyueran " can only obtain " sui ' yue ' ran " (years are right), but cannot get " making the best of things " by two-way maximum cutting This desired result, it is seen that the case where two-way maximum cutting scheme still has unreasonable cutting.
In conclusion being combined since Chinese Pinyin syllables are various, between word and word changeable, also it is not spaced between word, In the case where not influencing result accuracy, how to realize quick, reasonably progress pinyin syllable cutting, be a urgent need to resolve Technical problem.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of syllable splitting method and apparatus, with In the case where not influencing result accuracy, quick, reasonably progress pinyin syllable cutting is realized.
Technical solution provided in an embodiment of the present invention is as follows:
In a first aspect, providing a kind of syllable splitting method, which comprises
The even numbers group Trie tree construction of construction syllabary in advance;
Based on the even numbers group Trie tree construction, legal syllables are matched from the pinyin sequence of input;
Based on the legal syllables matched, the pinyin sequence is cut according to syllable with the preferential strategy of power and syllable Point, to obtain a variety of syllable splitting schemes;
A variety of syllable splitting schemes are stored.
With reference to first aspect, in the first possible implementation, the even numbers group Trie tree construction includes base number Group and check array, the preparatory even numbers group Trie tree construction for constructing syllabary include:
Construct the Trie tree construction of syllabary;
Multiple and different letters that the syllabary includes are encoded respectively, to obtain each of the Trie tree construction The sequence of conditions code of state jump condition;
The base array and the check array are initialized, and carry out calculating institute according to default calculation method Base array and the check array are stated, to construct the even numbers group Trie tree construction;
Wherein, the default calculation method is expressed as follows:
(1), base [s]=min k | base [s1+ k]=check [s1+ k]=base [s2+ k]=check [s2+k] =...=0 and k >=1 };
(2), s be can terminal node, but be not leaf node, then base [s]=- base [s]
(3) if, s be leaf node, base [s]=- ∞;
(4), check [t]=s;
Wherein, s1,s2,…,snThe respectively corresponding sequence of conditions code of the n child node of state s.
The possible implementation of with reference to first aspect the first, it is in the second possible implementation, described to be based on The even numbers group Trie tree construction, matching legal syllables from the pinyin sequence of input includes:
Determine the corresponding sequence of conditions code of each letter in the pinyin sequence;
Based in the pinyin sequence the corresponding sequence of conditions code of each letter and the default calculation method, press According to the alphabetical input sequence of the pinyin sequence, the corresponding base array of the pinyin sequence and check array are calculated;
By the base in the corresponding base array of the pinyin sequence and check array and the even numbers group Trie tree construction Array and check array are compared;
When comparing successfully, the legal syllables for including in the pinyin sequence are determined.
With reference to first aspect, in the third possible implementation, described based on the legal syllables matched, to described Pinyin sequence carries out cutting with the preferential strategy of power and syllable according to syllable to obtain a variety of syllable splitting schemes
The pinyin sequence is expressed as sn, and by snThe subsequence table of i-th of position to -1 position of jth be shown as s [i, j], wherein n snLength, snIndex position be 0 arrive n;
If s [i, j] is a legal syllables, then retain s [i, j];
If s [i, j] is an illegal syllable, and s [k, m] (0≤k≤i, j≤m≤n) is a legal syllables, then S [i, j] is not retained;
If s [i, j] is an illegal syllable, and is not present and meets 0≤k≤i, j≤m≤n k and m, so that s [k, M] it is a legal syllables, then retain s [i, j].
With reference to first aspect to the third any one possible implementation of first aspect, in the 4th kind of possible reality It is described storage is carried out to a variety of syllable splitting schemes to include: in existing mode
A variety of syllable splitting schemes are stored based on the data structure of figure.
Second aspect, additionally provides a kind of syllable splitting device, and described device includes:
Constructing module, for constructing the even numbers group Trie tree construction of syllabary in advance;
Matching module matches legal sound for being based on the even numbers group Trie tree construction from the pinyin sequence of input Section;
Cutting module, for being weighed together to the pinyin sequence according to syllable and syllable being excellent based on the legal syllables matched First strategy carries out cutting, to obtain a variety of syllable splitting schemes;
Memory module, for being stored to a variety of syllable splitting schemes.
In conjunction with second aspect, in the first possible implementation, the even numbers group Trie tree construction includes base number Group and check array, the constructing module include:
First construction submodule, for constructing the Trie tree construction of syllabary;
Encoding submodule, it is described to obtain for being encoded respectively to multiple and different letters that the syllabary includes The sequence of conditions code of each state jump condition of Trie tree construction;
Second construction submodule, for being initialized to the base array and the check array, and according to default Calculation method carries out calculating the base array and the check array, to construct the even numbers group Trie tree construction;
Wherein, the default calculation method is expressed as follows:
(1), base [s]=min k | base [s1+ k]=check [s1+ k]=base [s2+ k]=check [s2+k] =...=0 and k >=1 };
(2), s be can terminal node, but be not leaf node, then base [s]=- base [s]
(3) if, s be leaf node, base [s]=- ∞;
(4), check [t]=s;
Wherein, s1,s2,…,snThe respectively corresponding sequence of conditions code of the n child node of state s.
In conjunction with the first possible implementation of second aspect, in the second possible implementation, the matching Module includes:
First determines submodule, for determining the corresponding sequence of conditions of each letter in the pinyin sequence Code;
Computational submodule, for based on the corresponding sequence of conditions code of each letter in the pinyin sequence and described Default calculation method calculates the corresponding base array of the pinyin sequence according to the alphabetical input sequence of the pinyin sequence With check array;
Second determines submodule, is used for the corresponding base array of the pinyin sequence and check array and the even numbers Base array and check array in group Trie tree construction are compared, and when comparing successfully, determine in the pinyin sequence The legal syllables for including.
In conjunction with second aspect, in the third possible implementation, the cutting module is specifically used for:
The pinyin sequence is expressed as sn, and by snThe subsequence table of i-th of position to -1 position of jth be shown as s [i, j], wherein n snLength, snIndex position be 0 arrive n;
If s [i, j] is a legal syllables, then retain s [i, j];
If s [i, j] is an illegal syllable, and s [k, m] (0≤j≤i, j≤m≤n) is a legal syllables, then S [i, j] is not retained;
If s [i, j] is an illegal syllable, and is not present and meets 0≤k≤i, j≤m≤n k and m, so that s [k, M] it is a legal syllables, then retain s [i, j].
In conjunction with the third any one possible implementation of second aspect to second aspect, in the 4th kind of possible reality In existing mode, the memory module is specifically used for:
A variety of syllable splitting schemes are stored based on the data structure of figure.
The third aspect provides a kind of electronic equipment, comprising:
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes pinyin syllable cutting method as described in relation to the first aspect.
Fourth aspect provides a kind of computer readable storage medium, is stored thereon with computer program, and described program is located Manage the pinyin syllable cutting method realized when device executes as described in relation to the first aspect.
Syllable splitting method and apparatus provided in an embodiment of the present invention, firstly, the even numbers group Trie of construction syllabary in advance Then tree construction is based on even numbers group Trie tree construction, legal syllables are matched from the pinyin sequence of input, and based on matching Legal syllables out carry out cutting with the preferential strategy of power and syllable according to syllable to pinyin sequence, are cut with obtaining a variety of syllables Offshoot program, finally, storing to a variety of syllable splitting schemes, very efficient method is come on time or space as a result, The syllable splitting work of pinyin character string is completed, so as to avoid using positive maximum cutting or reverse maximum cutting or two-way The slit mode of the maximum cutting unavailability or irrationality that may be present that syllable splitting is carried out to pinyin sequence, realizes Can in the case where not influencing result accuracy, achieve the purpose that quickly, reasonably carry out syllable splitting.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is the flow chart for the syllable splitting method that the embodiment of the present invention one provides;
Fig. 2 is the partial schematic diagram of the Trie tree construction for the syllabary that the embodiment of the present invention one provides;
Fig. 3 is the schematic diagram for the even numbers group Trie tree construction that the embodiment of the present invention one provides;
Fig. 4 a~4c carries out a variety of syllable splitting schemes based on the data structure of figure for what the embodiment of the present invention one provided The schematic diagram of storage;
Fig. 5 is the block diagram of syllable splitting device provided by Embodiment 2 of the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
Embodiment one
Fig. 1 is the flow chart for the syllable splitting method that the embodiment of the present invention one provides, and shown referring to Fig.1, this method includes Following steps:
Step S1: the even numbers group Trie tree construction of construction syllabary in advance.
Specifically, the process of step S1 may include:
Construct the Trie tree construction of syllabary;
Multiple and different letters that syllabary includes are encoded respectively, to obtain each state transfer of Trie tree construction The sequence of conditions code of condition;
Base array and check array are initialized, and according to default calculation method carry out calculate base array and Check array, to construct even numbers group Trie tree construction.
Specifically, Trie tree, dictionary tree are one kind of search tree, it be substantially a determining finite state from Motivation, a state of each node on behalf automatic machine.Root node indicates original state, leaf node and middle node in Trie tree Point indicates respectively a legal syllables, and the jump condition between node and node is letter.
Fig. 2 is that the partial schematic diagram of the Trie tree construction for the syllabary that the embodiment of the present invention one provides is opened in Fig. 2 with a The pinyin syllable (a, ai, ao, an, ang) of head constitutes the part-structure of Trie tree, wherein filled circles indicate the node or the section Point can be a legal syllables, and number 0,1,2,3,4,5 shows respectively different states, constitutes the alphabetical a of phonetic, i, o, N, g are respectively the jump condition of state transfer.
State transitional function is described in Fig. 2 to be indicated are as follows:
G (s, c)=t (1)
Wherein, s is expressed as current state, and c is expressed as jump condition, and t is expressed as next state, in Fig. 3 from 0 to 1 The transformation of state can be described with state equation are as follows: g (0, ' a ')=1.
DAT (Double-Array Trie), i.e. even numbers group Trie are to indicate one using two one-dimensional integer arrays Trie tree, one is base array, and one is check array, and base array is used to determine the transfer of state, and check array is used In examine transfer correctness, examine the state whether there is, thus state transition equation (1) can be indicated with two equations as Under:
T=base [s]+c (2)
Check [t]=s (3)
And construct even numbers group, therefore, to assure that the array position that next state t is occupied is not used, therefore, first It first needs to encode multiple and different letters that syllabary includes respectively by pre-arranged code rule, to obtain Trie tree construction The sequence of conditions code of each state jump condition.Wherein, the corresponding jump condition of a letter, it is assumed that letter shared m, that It is 1,2 that each state jump condition, which can be separately encoded, ..., m.It is pre- due to including 26 English alphabets in syllabary Letter is encoded according to the positive sequence or backward or out-of-order sequence of English alphabet if coding rule can be, for example, according to English Positive sequence a, b, the c of text mother ..., z, being separately encoded is 1,2,3 ..., 26;In addition, pre-arranged code rule can also be according to sound The positive sequence or backward or out-of-order sequence of section table encode letter, for example, according to positive sequence a, o, e ... the w of syllabary, respectively 1,2,3 are encoded to ..., 26.The embodiment of the present invention is not limited specific pre-arranged code rule.
In the present embodiment, base array and check array are initialized, when agreement base and check is 0, then The state is represented as dummy status, the value of base [s] and check [t] are determined according to following default calculation method, i.e., it is default to calculate Method are as follows:
(1), base [s]=min k | base [s1+ k]=check [s1+ k]=base [s2+ k]=check [s2+k] =...=0 and k >=1 };
(2), s be can terminal node, but be not leaf node, then base [s]=- base [s]
(3) if, s be leaf node, base [s]=- ∞;
(4), check [t]=s;
Wherein, s1,s2,…,snThe respectively corresponding sequence of conditions code of the n child node of state s.
The process of the construction even numbers group Trie tree construction of the embodiment of the present invention is further described by taking Fig. 2 as an example below:
1) jump condition a, i, o are set, the corresponding sequence of conditions code of n, g is 1,2,3,4,5, and by base and Check array is initialized to 0;
2) for node 0, received jump condition is a, and corresponding sequence of conditions code is 1, according to above-mentioned default calculating Method is calculated:
Base [0]=min k | base [1+k]=check [1+k]=0, k >=1 }=1.
3) it is calculated according to above-mentioned default calculation method: check [base [0]+1]=0, that is: check [2]= 0。
4) for node 1, received condition is i, o, n, and corresponding sequence of conditions code is 2,3,4, according to above-mentioned default Calculation method is calculated:
Base [2]=min k | base [2+k]=check [2+k]=base [3+k]=check [3+k]=base [4+ K]=check [4+k]=0, k >=1 }=1.
5) due to node 1 be one can terminal node: base [2]=- base [2]=- 1.
6) check [| base [2] |+2]=check [| base [2] |+3]=check [| base [2] |+4]=1, also It is: check [3]=check [4]=check [5]=1.
7) since node 2 and node 3 are leaf nodes, so that base [3]=base [4]=- ∞.
8) for node 4, received jump condition is g, and corresponding sequential coding is 5, so that
Base [5]=min k | base [k+5]=check [k+5]=0, k >=1 }=1.
9) due to node 4 be one can terminal node: base [5]=- base [5]=- 1.
10) check [| base [5] |+5]=5, that is: check [6]=5.
11) since node 5 is a leaf node: base [6]=- ∞.
Therefore, for the even numbers group constructed as shown in figure 3, in Fig. 3, first row corresponds to the state value of even numbers group, secondary series The value of corresponding base array, third arrange the value of corresponding check array.
It is worth noting that, in the embodiment of the present invention illustratively in the Chinese phonetic alphabet with a beginning serial syllable (ai, Ao, an, ang) it is illustrated the process of construction even numbers group Trie tree construction and syllable is carried out based on even numbers group Trie tree construction Matching process, and for pinyin syllable cutting, the Trie tree that actual needs constitutes all Chinese Pinyin syllables is all logical The procedure construction of step S1 is crossed into the form of DAT, then carries out the retrieval matching process of pinyin syllable in step.
It should be noted that mixed if necessary to processing Chinese and English it is defeated, then only need for English word to be considered as a syllable, It is added in syllabary, such as orange, it is only necessary to orange word be added in syllabary, and then by all Chinese Language pinyin syllable and English word constitute Trie tree, then, by way of the procedure construction of step S1 is at DAT.
In the embodiment of the present invention, by constructing the even numbers group Trie tree construction of syllabary in advance, realizing will be in syllabary Syllable storage to DAT purpose so that it is subsequent to syllable carry out retrieval matching when, can use the data structure of DAT Complete the efficient matchings to syllable.
Step S2: it is based on even numbers group Trie tree construction, matches legal syllables from the pinyin sequence of input.
Specifically, the process of step S2 may include:
Determine the corresponding sequence of conditions code of each letter in pinyin sequence;
Based in pinyin sequence the corresponding sequence of conditions code of each letter and default calculation method, according to phonetic sequence The alphabetical input sequence of column calculates the corresponding base array of pinyin sequence and check array;
By in the corresponding base array of pinyin sequence and check array and even numbers group Trie tree construction base array and Check array is compared;
When comparing successfully, the legal syllables for including in pinyin sequence are determined.
In the specific implementation process, according to pre-arranged code rule, determine that each letter in pinyin sequence respectively corresponds Sequence of conditions code.It is assumed that the corresponding sequence of conditions code of a, i, o, n, g is 1,2,3,4,5, as input " ang ", " ang " Each phoneme corresponding sequence of conditions code difference 1,4,5.
Base array corresponding to pinyin sequence and check array initialize, and defeated according to the letter of pinyin sequence Enter sequence, the corresponding sequence of conditions code of each letter in pinyin sequence is updated in default calculation method, is calculated The corresponding base array of pinyin sequence and check array.By the corresponding base array of pinyin sequence and check array and even numbers group Base array and check array in Trie tree construction are compared, when comparing successfully, determine include in pinyin sequence Legal syllables.
Wherein, the default calculation method in step S2 is identical as the default calculation method in step S1, and details are not described herein again.
Illustratively, according to Fig.3, base array and check array further describes the matching process of syllable, Such as the pinyin sequence of input is " agn ":
1) init state first, is calculated base [0]=1 according to default calculation formula, reads from dummy status Jump condition is the initial a of pinyin sequence, and the corresponding sequence of conditions code of a is 1, therefore next state are as follows: t=base [0]+1=2, and check [t]=check [2]=0, it is possible thereby to determine a in even numbers group Trie tree.
2) state becomes base [2]=- 1 at this time, illustrates the node and non-leaf nodes, therefore calculating input is under g One state, the corresponding sequence of conditions code of g are 5, therefore next state is t=| base [2] |+5=1+5=6, check [t] =check [6]=5 ≠ 2 may determine that agn is not present in even numbers group Trie so can not have child node g after a node at this time In.
In the present embodiment, the pinyin sequence of input is matched by the even numbers group Trie tree construction of syllabary, not only It can be realized the efficient matchings of constant rank time complexity, but also can reach and terminate matched effect early.
Step S3: based on the legal syllables matched, to pinyin sequence according to syllable with the preferential strategy of power and syllable into Row cutting, to obtain a variety of syllable splitting schemes.
In the present embodiment, syllable refers to all syllables in Chinese syllables, equal, there is no excellent with weighing It first selects longest syllable or screens syllable according to other modes, such as:
Xiao, can cutting be [xi ' ao, xiao, xi ' a ' o, xia ' o], due to a, ao, xia, xi, o are syllable, therefore It gives and retains, other syllables can't be given up because of xiao syllable longest.
Syllable is preferential, refers to that syllable is compared with non-syllable, preferentially retains syllable, abandons non-syllable, such as:
Long meets Chinese Pinyin syllables, therefore not continuing cutting is l ' o ' n ' g.
Xi belongs to Chinese Pinyin syllables, therefore not continuing cutting is x ' i.
Xim, xi belong to Chinese Pinyin syllables, therefore retain xi, but xim is not belonging to Chinese Pinyin syllables, and therefore, m makees It is retained for a simplicity, therefore cutting result is xi ' m.
Specifically, the process of step S3 may include:
Pinyin sequence is expressed as sn, and by snThe subsequence table of i-th of position to -1 position of jth be shown as s [i, j], Wherein, n snLength, snIndex position be 0 arrive n;
If s [i, j] is a legal syllables, then retain s [i, j];
If s [i, j] is an illegal syllable, and s [k, m] (0≤k≤i, j≤m≤n) is a legal syllables, then S [i, j] is not retained;
If s [i, j] is an illegal syllable, and is not present and meets 0≤k≤i, j≤m≤n k and m, so that s [k, M] it is a legal syllables, then retain s [i, j].
Illustratively, syllable splitting is carried out to pinyin character string " suiyueran ", due to sui, yu, er, an, yue, ran All be legal syllables, need to retain, therefore available two kinds of cutting schemes, i.e., " sui, yu, er, an " and " sui, yue, ran”。
In the present embodiment, by being weighed together to pinyin sequence according to syllable and syllable being preferential based on the legal syllables matched Strategy carry out cutting, to obtain a variety of syllable splitting schemes, avoided as a result, using positive maximum cutting or reverse maximum cutting Or the slit mode unavailability or irrationality that may be present that syllable splitting is carried out to pinyin sequence of two-way maximum cutting, Realize the purpose that reasonable Chinese Pinyin syllables combination is decoded into from pinyin sequence.
Step S4: a variety of syllable splitting schemes are stored.
Specifically, the process of step S4 may include:
A variety of syllable splitting schemes are stored based on the data structure of figure.
4a~4c further describes the step S4 in the embodiment of the present invention with reference to the accompanying drawing.
Since in step s3 according to syllable with power and the preferential cutting strategy of syllable, a string of pinyin character strings may go out Now a large amount of syllable splitting scheme, the pinyin sequence of input is longer, and cutting scheme number is more, is with user input sequence length For 64 pinyin string, if user's input is that pinyin sequence is:
Xiaoxiaoxiaoxiaoxiaoxiaoxiaoxiaoxiaoxiaoxiaoxiaoxiaoxiao xiaoxiao, 16 The corresponding cutting scheme of xiao, each xiao has 4 kinds [xiao, xia ' o, xi ' ao ', xi ' a ' o], then 16 xiao are corresponding Cutting scheme number of combinations is 416Kind, if needing to occupy great space using chain structure or other modes to store this A little cutting schemes.And if stored using figure, since a large amount of common node is utilized, it is only necessary to which the storage of very little is empty Between the storages of tens kinds of cutting schemes can be completed, greatly saved memory space, cutting result carried out using figure Storage is expressed using the data structure of figure with limited some nodes as shown in Fig. 4 a as long as can be seen that from Fig. 4 a Thousands of cutting scheme out.
For another example, to pinyin character string " suiyueran " carry out syllable splitting, two kinds of obtained cutting schemes be " sui, Yu, er, an " and " sui, yue, ran ", to cutting scheme carried out using figure store as shown in Fig. 4 b.
For another example, Chinese and English is mixed it is defeated, such as processing juzishiorange, three kinds of obtained cutting schemes be " ju, zi, Shi, o, ran, ge ", " ju, zi, shi, o, rang, e " and " ju, zi, shi, orange " carry out cutting scheme using figure Storage is as shown in Fig. 4 c.
Syllable splitting method provided in an embodiment of the present invention, firstly, the even numbers group Trie tree construction of construction syllabary in advance, Then, it is based on even numbers group Trie tree construction, legal syllables are matched from the pinyin sequence of input, and legal based on what is matched Syllable carries out cutting with the preferential strategy of power and syllable according to syllable to pinyin sequence, to obtain a variety of syllable splitting schemes, most Afterwards, a variety of syllable splitting schemes are stored, very efficient method completes phonetic word on time or space as a result, The syllable splitting work of symbol string, so as to avoid using positive maximum cutting or reverse maximum cutting or two-way maximum cutting The slit mode unavailability or irrationality that may be present that syllable splitting is carried out to pinyin sequence, realizing can be in not shadow Ring result accuracy in the case where, achieve the purpose that quickly, reasonably carry out syllable splitting.
Embodiment two
Fig. 5 is the block diagram of syllable splitting device provided by Embodiment 2 of the present invention, as shown in figure 5, cutting device includes:
Constructing module 51, for constructing the even numbers group Trie tree construction of syllabary in advance;
Matching module 52 matches legal syllables from the pinyin sequence of input for being based on even numbers group Trie tree construction;
Cutting module 53, for being weighed together to pinyin sequence according to syllable and syllable being preferential based on the legal syllables matched Strategy carry out cutting, to obtain a variety of syllable splitting schemes;
Memory module 54, for being stored to a variety of syllable splitting schemes.
Further, even numbers group Trie tree construction includes base array and check array, and constructing module 51 includes:
First construction submodule 511, for constructing the Trie tree construction of syllabary;
Encoding submodule 512, for being encoded respectively to multiple and different letters that syllabary includes, to obtain Trie tree The sequence of conditions code of each state jump condition of structure;
Second construction submodule 513, for being initialized to base array and check array, and according to default calculating Method carries out calculating base array and check array, to construct even numbers group Trie tree construction;
Wherein, default calculation method is expressed as follows:
(1), base [s]=min k | base [s1+ k]=check [s1+ k]=base [s2+ k]=check [s2+k] =...=0 and k >=1 };
(2), s be can terminal node, but be not leaf node, then base [s]=- base [s]
(3) if, s be leaf node, base [s]=- ∞;
(4), check [t]=s;
Wherein, s1,s2,…,snThe respectively corresponding sequence of conditions code of the n child node of state s.
Further, matching module 52 includes:
First determines submodule 521, for determining the corresponding sequence of conditions code of each letter in pinyin sequence;
Computational submodule 522, for according in pinyin sequence the corresponding sequence of conditions code of each letter and calculating Formula calculates the corresponding base array of pinyin sequence and check array;
Second determines submodule 523, is used for the corresponding base array of pinyin sequence and check array and even numbers group Trie Base array and check array in tree construction are compared, and when comparing successfully, that determines to include in pinyin sequence is legal Syllable.
Further, cutting module 53 is specifically used for:
Pinyin sequence is expressed as sn, and by snThe subsequence table of i-th of position to -1 position of jth be shown as s [i, j], Wherein, n snLength, snIndex position be 0 arrive n;
If s [i, j] is a legal syllables, then retain s [i, j];
If s [i, j] is an illegal syllable, and s [k, m] (0≤k≤i, j≤m≤n) is a legal syllables, then S [i, j] is not retained;
If s [i, j] is an illegal syllable, and is not present and meets 0≤k≤i, j≤m≤n k and m, so that s [k, M] it is a legal syllables, then retain s [i, j].
Further, memory module 54 is specifically used for:
A variety of syllable splitting schemes are stored based on the data structure of figure.
Syllable splitting device provided in an embodiment of the present invention, belongs to syllable splitting method provided by the embodiment of the present invention Syllable splitting method provided by any embodiment of the invention can be performed in same inventive concept, has and executes syllable splitting method Corresponding functional module and beneficial effect.The not technical detail of detailed description in the present embodiment, reference can be made to the embodiment of the present invention The syllable splitting method of offer, is not repeated here herein.
In addition, another embodiment of the present invention also provides a kind of electronic equipment, comprising:
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes the syllable splitting method as described in embodiment one.
In addition, another embodiment of the present invention also provides a kind of computer readable storage medium, it is stored thereon with computer journey Sequence realizes the syllable splitting method as described in above-described embodiment when described program is executed by processor.
It should be understood by those skilled in the art that, the embodiment in the embodiment of the present invention can provide as method, system or meter Calculation machine program product.Therefore, complete hardware embodiment, complete software embodiment can be used in the embodiment of the present invention or combine soft The form of the embodiment of part and hardware aspect.Moreover, being can be used in the embodiment of the present invention in one or more wherein includes meter Computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, the optical memory of calculation machine usable program code Deng) on the form of computer program product implemented.
It is referring to the method for middle embodiment, equipment (system) according to embodiments of the present invention and to calculate in the embodiment of the present invention The flowchart and/or the block diagram of machine program product describes.It should be understood that can be realized by computer program instructions flow chart and/or The combination of the process and/or box in each flow and/or block and flowchart and/or the block diagram in block diagram.It can mention For the processing of these computer program instructions to general purpose computer, special purpose computer, Embedded Processor or other programmable datas The processor of equipment is to generate a machine, so that being executed by computer or the processor of other programmable data processing devices Instruction generation refer to for realizing in one or more flows of the flowchart and/or one or more blocks of the block diagram The device of fixed function.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Although the preferred embodiment in the embodiment of the present invention has been described, once a person skilled in the art knows Basic creative concept, then additional changes and modifications may be made to these embodiments.So appended claims are intended to explain Being includes preferred embodiment and all change and modification for falling into range in the embodiment of the present invention.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.

Claims (12)

1. a kind of syllable splitting method, which is characterized in that the described method includes:
The even numbers group Trie tree construction of construction syllabary in advance;
Based on the even numbers group Trie tree construction, legal syllables are matched from the pinyin sequence of input;
Based on the legal syllables matched, cutting is carried out with power and syllable preferential strategy according to syllable to the pinyin sequence, To obtain a variety of syllable splitting schemes;
A variety of syllable splitting schemes are stored.
2. the method according to claim 1, wherein the even numbers group Trie tree construction include base array and Check array, the preparatory even numbers group Trie tree construction for constructing syllabary include:
Construct the Trie tree construction of syllabary;
Multiple and different letters that the syllabary includes are encoded respectively, to obtain each state of the Trie tree construction The sequence of conditions code of jump condition;
The base array and the check array are initialized, and described in being calculated according to default calculation method Base array and the check array, to construct the even numbers group Trie tree construction;
Wherein, the default calculation method is expressed as follows:
(1), base [s]=min k | base [s1+ k]=check [s1+ k]=base [s2+ k]=check [s2+ k]=...=0 And k >=1;
(2), s be can terminal node, but be not leaf node, then base [s]=- base [s]
(3) if, s be leaf node, base [s]=- ∞;
(4), check [t]=s;
Wherein, s1,s2,…,snThe respectively corresponding sequence of conditions code of the n child node of state s.
3. according to the method described in claim 2, it is characterized in that, described be based on the even numbers group Trie tree construction, from input Pinyin sequence in match legal syllables and include:
Determine the corresponding sequence of conditions code of each letter in the pinyin sequence;
Based in the pinyin sequence the corresponding sequence of conditions code of each letter and the default calculation method, according to institute The alphabetical input sequence for stating pinyin sequence calculates the corresponding base array of the pinyin sequence and check array;
By the base array in the corresponding base array of the pinyin sequence and check array and the even numbers group Trie tree construction It is compared with check array;
When comparing successfully, the legal syllables for including in the pinyin sequence are determined.
4. the method according to claim 1, wherein described based on the legal syllables matched, to the phonetic Sequence carries out cutting with the preferential strategy of power and syllable according to syllable to obtain a variety of syllable splitting schemes
The pinyin sequence is expressed as sn, and by snThe subsequence table of i-th of position to -1 position of jth be shown as s [i, j], Wherein, n snLength, snIndex position be 0 arrive n;
If s [i, j] is a legal syllables, then retain s [i, j];
If s [i, j] is an illegal syllable, and s [k, m] (0≤k≤i, j≤m≤n) is a legal syllables, then does not protect Stay s [i, j];
If s [i, j] is an illegal syllable, and there is no 0≤k≤i, j≤m≤n k and m is met, so that s [k, m] is One legal syllables then retains s [i, j].
5. method according to any one of claims 1 to 4, which is characterized in that described to a variety of syllable splitting schemes Carrying out storage includes:
A variety of syllable splitting schemes are stored based on the data structure of figure.
6. a kind of syllable splitting device, which is characterized in that described device includes:
Constructing module, for constructing the even numbers group Trie tree construction of syllabary in advance;
Matching module matches legal syllables from the pinyin sequence of input for being based on the even numbers group Trie tree construction;
Cutting module, for based on the legal syllables matched, preferential with power and syllable according to syllable to the pinyin sequence Strategy carries out cutting, to obtain a variety of syllable splitting schemes;
Memory module, for being stored to a variety of syllable splitting schemes.
7. device according to claim 6, which is characterized in that the even numbers group Trie tree construction include base array and Check array, the constructing module include:
First construction submodule, for constructing the Trie tree construction of syllabary;
Encoding submodule, for being encoded respectively to multiple and different letters that the syllabary includes, to obtain the Trie The sequence of conditions code of each state jump condition of tree construction;
Second construction submodule, for being initialized to the base array and the check array, and according to default calculating Method carries out calculating the base array and the check array, to construct the even numbers group Trie tree construction;
Wherein, the default calculation method is expressed as follows:
(1), base [s]=min k | base [s1+ k]=check [s1+ k]=base [s2+ k]=check [s2+ k]=...=0 And k >=1;
(2), s be can terminal node, but be not leaf node, then base [s]=- base [s]
(3) if, s be leaf node, base [s]=- ∞;
(4), check [t]=s;
Wherein, s1,s2,…,snThe respectively corresponding sequence of conditions code of the n child node of state s.
8. device according to claim 7, which is characterized in that the matching module includes:
First determines submodule, for determining the corresponding sequence of conditions code of each letter in the pinyin sequence;
Computational submodule, for according in the pinyin sequence the corresponding sequence of conditions code of each letter and the calculating Formula calculates the corresponding base array of the pinyin sequence and check array;
Second determines submodule, is used for the corresponding base array of the pinyin sequence and check array and the even numbers group Base array and check array in Trie tree construction are compared, and when comparing successfully, determine to wrap in the pinyin sequence The legal syllables contained.
9. device according to claim 6, which is characterized in that the cutting module is specifically used for:
The pinyin sequence is expressed as sn, and by snThe subsequence table of i-th of position to -1 position of jth be shown as s [i, j], Wherein, n snLength, snIndex position be 0 arrive n;
If s [i, j] is a legal syllables, then retain s [i, j];
If s [i, j] is an illegal syllable, and s [k, m] (0≤k≤i, j≤m≤n) is a legal syllables, then does not protect Stay s [i, j];
If s [i, j] is an illegal syllable, and there is no 0≤k≤i, j≤m≤n k and m is met, so that s [k, m] is One legal syllables then retains s [i, j].
10. according to the described in any item devices of claim 6~9, which is characterized in that the memory module is specifically used for:
A variety of syllable splitting schemes are stored based on the data structure of figure.
11. a kind of electronic equipment characterized by comprising
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors are real Existing syllable splitting method as claimed in any one of claims 1 to 5.
12. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed Device realizes syllable splitting method as claimed in any one of claims 1 to 5 when executing.
CN201811009619.3A 2018-08-31 2018-08-31 Syllable segmentation method and device Active CN109377980B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811009619.3A CN109377980B (en) 2018-08-31 2018-08-31 Syllable segmentation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811009619.3A CN109377980B (en) 2018-08-31 2018-08-31 Syllable segmentation method and device

Publications (2)

Publication Number Publication Date
CN109377980A true CN109377980A (en) 2019-02-22
CN109377980B CN109377980B (en) 2022-06-07

Family

ID=65404121

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811009619.3A Active CN109377980B (en) 2018-08-31 2018-08-31 Syllable segmentation method and device

Country Status (1)

Country Link
CN (1) CN109377980B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185356A (en) * 2020-09-29 2021-01-05 北京百度网讯科技有限公司 Speech recognition method, speech recognition device, electronic device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0289097A (en) * 1988-09-26 1990-03-29 Sharp Corp Syllable pattern segmenting system
CN102651026A (en) * 2012-04-01 2012-08-29 百度在线网络技术(北京)有限公司 Method for optimizing word segmentation of search engine through precomputation and word segmenting device of search engine
CN102866783A (en) * 2011-07-06 2013-01-09 哈尔滨工业大学 Syncopation method of Chinese phonetic string and system thereof
CN102955770A (en) * 2011-08-17 2013-03-06 腾讯科技(深圳)有限公司 Method and system for automatic recognition of pinyin
CN103823814A (en) * 2012-11-19 2014-05-28 腾讯科技(深圳)有限公司 Information processing method and information processing device
CN104239289A (en) * 2013-06-24 2014-12-24 富士通株式会社 Syllabication method and syllabication device
CN104423621A (en) * 2013-08-22 2015-03-18 北京搜狗科技发展有限公司 Pinyin string processing method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0289097A (en) * 1988-09-26 1990-03-29 Sharp Corp Syllable pattern segmenting system
CN102866783A (en) * 2011-07-06 2013-01-09 哈尔滨工业大学 Syncopation method of Chinese phonetic string and system thereof
CN102955770A (en) * 2011-08-17 2013-03-06 腾讯科技(深圳)有限公司 Method and system for automatic recognition of pinyin
CN102651026A (en) * 2012-04-01 2012-08-29 百度在线网络技术(北京)有限公司 Method for optimizing word segmentation of search engine through precomputation and word segmenting device of search engine
CN103823814A (en) * 2012-11-19 2014-05-28 腾讯科技(深圳)有限公司 Information processing method and information processing device
CN104239289A (en) * 2013-06-24 2014-12-24 富士通株式会社 Syllabication method and syllabication device
CN104423621A (en) * 2013-08-22 2015-03-18 北京搜狗科技发展有限公司 Pinyin string processing method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185356A (en) * 2020-09-29 2021-01-05 北京百度网讯科技有限公司 Speech recognition method, speech recognition device, electronic device and storage medium

Also Published As

Publication number Publication date
CN109377980B (en) 2022-06-07

Similar Documents

Publication Publication Date Title
CN109840287A (en) A kind of cross-module state information retrieval method neural network based and device
CN110309287B (en) Retrieval type chatting dialogue scoring method for modeling dialogue turn information
Bille et al. Random access to grammar-compressed strings
CN101183281B (en) Method for inputting word related to candidate word in input method and system
CN111368514B (en) Model training and ancient poem generating method, ancient poem generating device, equipment and medium
CN110188362A (en) Text handling method and device
CN110309511B (en) Shared representation-based multitask language analysis system and method
CN111626062B (en) Text semantic coding method and system
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN111159414B (en) Text classification method and system, electronic equipment and computer readable storage medium
CN108763529A (en) A kind of intelligent search method, device and computer readable storage medium
CN110222194B (en) Data chart generation method based on natural language processing and related device
US11790174B2 (en) Entity recognition method and apparatus
CN114281968B (en) Model training and corpus generation method, device, equipment and storage medium
CN102314876B (en) Speech retrieval method and system
CN110348020A (en) A kind of English- word spelling error correction method, device, equipment and readable storage medium storing program for executing
CN111767394A (en) Abstract extraction method and device based on artificial intelligence expert system
CN113590784A (en) Triple information extraction method and device, electronic equipment and storage medium
CN114239589A (en) Robustness evaluation method and device of semantic understanding model and computer equipment
CN109377980A (en) A kind of syllable splitting method and apparatus
CN113095082A (en) Method, device, computer device and computer readable storage medium for text processing based on multitask model
WO2023103914A1 (en) Text sentiment analysis method and device, and computer-readable storage medium
CN109117471A (en) A kind of calculation method and terminal of the word degree of correlation
CN112560466B (en) Link entity association method, device, electronic equipment and storage medium
CN114841172A (en) Knowledge distillation method, apparatus and program product for text matching double tower model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240306

Address after: Room 1179, W Zone, 11th Floor, Building 1, No. 158 Shuanglian Road, Qingpu District, Shanghai, 201702

Patentee after: Shanghai Zhongan Information Technology Service Co.,Ltd.

Country or region after: China

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Patentee before: ZHONGAN INFORMATION TECHNOLOGY SERVICE Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240415

Address after: Room 1179, W Zone, 11th Floor, Building 1, No. 158 Shuanglian Road, Qingpu District, Shanghai, 201702

Patentee after: Shanghai Zhongan Information Technology Service Co.,Ltd.

Country or region after: China

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Patentee before: ZHONGAN INFORMATION TECHNOLOGY SERVICE Co.,Ltd.

Country or region before: China