CN109255106A

CN109255106A - A kind of text handling method and terminal

Info

Publication number: CN109255106A
Application number: CN201710574188.4A
Authority: CN
Inventors: 刘辉
Original assignee: TCL Corp
Current assignee: TCL Corp
Priority date: 2017-07-13
Filing date: 2017-07-13
Publication date: 2019-01-22

Abstract

The embodiment of the present invention provides a kind of text handling method and terminal, is related to field of computer technology.Wherein method includes: that text segmentation to be processed is obtained text sequence set at text sequence；The text sequence in the text sequence set is converted into pinyin sequence respectively；The pinyin sequence is converted into using hidden Markov model trained in advance by new text sequence respectively, generates new text sequence set；According to the new text sequence set, new text is generated.Can be there is the unisonance word problem of mistake in effective solution text in the embodiment of the present invention, the problem of improving time-consuming existing for the existing method based on homonym in artificial Custom Dictionaries processing text, effort and dictionary completeness, expand the application scenarios and application range of the disambiguation of homonym semanteme.

Description

A kind of text handling method and terminal

Technical field

The present invention relates to field of computer technology more particularly to a kind of text handling method and terminals.

Background technique

With universal and information technology the fast development of internet, the text data in network is sent out in blowout How exhibition, excavate one of the hot spot that valuable information is current research from the text data of magnanimity.Network text data Different from traditional text data, due to the irregular equal spies of the randomness and user's educational level of network user's expression way Point causes to be flooded with a large amount of homonym problem in network text data, i.e., user gets used to a wrong word come table Up to the word with its unisonance, such as: user wants to input word " culture " result often because carelessness is entered as " Wenhua ", it is contemplated that Semantic difference between the homonym of the overwhelming majority is very big, if not carrying out to the homonym of these mistakes occurred in network text Semantic disambiguation processing, it is futile to frequently can lead to the subsequent all working for the text.Therefore, it is deposited for network text data These problems, using corresponding semantic disambiguation technology be in data preprocessing phase it is very necessary and valuable, it is Basis text data analysis and excavated.

It is general using establishing unisonance and synonymicon come to the wrong unisonance occurred in network text in the prior art Word carries out semantic disambiguation processing, and this mode is disadvantageous in that Custom Dictionaries need to expend a large amount of manual times and carry out structure It builds, while the completeness of dictionary seriously constrains its practical application.

Summary of the invention

In view of this, the embodiment of the present invention provides a kind of text handling method and terminal, it is intended to solve above-mentioned custom words The problem of allusion quotation needs to expend a large amount of manual times and constructs, while the completeness of dictionary seriously constrains its practical application.

The first aspect of the embodiment of the present invention provides a kind of text handling method, comprising:

By text segmentation to be processed at text sequence, text sequence set is obtained；

The text sequence in the text sequence set is converted into pinyin sequence respectively；

The pinyin sequence is converted into using hidden Markov model trained in advance by new text sequence respectively, is generated New text sequence set；

According to the new text sequence set, new text is generated.

The second aspect of the embodiment of the present invention provides a kind of terminal, comprising:

Text segmentation unit, for text segmentation to be processed at text sequence, to be obtained text sequence set；

Pinyin sequence acquiring unit, for the text sequence in the text sequence set to be converted into phonetic sequence respectively Column；

Text sequence acquiring unit, for respectively being turned the pinyin sequence using hidden Markov model trained in advance It changes new text sequence into, generates new text sequence set；

Text generation unit, for generating new text according to the new text sequence set.

The third aspect of the embodiment of the present invention provides a kind of terminal, including memory, processor and is stored in described In memory and the computer program that can run on the processor, which is characterized in that the processor executes the calculating The step of text handling method described in above-mentioned first aspect is realized when machine program.

The fourth aspect of the embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storage Media storage has computer program, wherein realizes as described in above-mentioned first aspect when the computer program is executed by processor Text handling method the step of.

Existing beneficial effect is the embodiment of the present invention compared with prior art:

Then the embodiment of the present invention is due to being converted to phonetic sequence for the text sequence first by text segmentation at text sequence Column, then pinyin sequence is converted by hidden Markov model by new text sequence, new according to new text sequence synthesis Text improves existing based on manually making by oneself so as to there is the unisonance word problem of mistake in effective solution text In adopted dictionary processing text the problem of time-consuming existing for the method for homonym, effort and dictionary completeness, homonym is expanded The application scenarios and application range that semanteme disambiguates.

Detailed description of the invention

It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only of the invention some Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these Attached drawing obtains other attached drawings.

Fig. 1 is a kind of schematic flow diagram for text handling method that the embodiment of the present invention one provides；

Fig. 2 is a kind of schematic flow diagram of text handling method provided by Embodiment 2 of the present invention；

Fig. 3 is the specific implementation flow chart of step S200 in a kind of text handling method provided by Embodiment 2 of the present invention；

Fig. 4 is a kind of schematic block diagram for terminal that the embodiment of the present invention three provides；

Fig. 5 is a kind of schematic block diagram for terminal that the embodiment of the present invention four provides；

Fig. 6 is a kind of schematic block diagram for terminal that the embodiment of the present invention five provides.

Specific embodiment

In being described below, for illustration and not for limitation, the tool of such as particular system structure, technology etc is proposed Body details, to understand thoroughly the embodiment of the present invention.However, it will be clear to one skilled in the art that there is no these specific The present invention also may be implemented in the other embodiments of details.In other situations, it omits to well-known system, device, electricity The detailed description of road and method, in case unnecessary details interferes description of the invention.

In order to illustrate technical solutions according to the invention, the following is a description of specific embodiments.

Fig. 1 is a kind of schematic flow diagram for text handling method that the embodiment of the present invention one provides.It is shown in Figure 1, this A kind of text handling method that embodiment provides, may comprise steps of:

Step S101 obtains text sequence set by text segmentation to be processed at text sequence.

In the present embodiment, text to be processed is that the network user is inputted by voice or inputted by spelling input method Text.Step S101 is specifically included:

Text to be processed is obtained into text sequence collection at a section text sequence according to preset Segmentation of Punctuation It closes.The preset punctuation mark includes but is not limited to comma, fullstop etc..

Text sequence in the text sequence set is converted into pinyin sequence respectively by step S102.

In this embodiment, text and phonetic are many-to-one mapping relations, wherein the word or word of composition text are limited , specific number is determined by dictionary or dictionary.Therefore, text sequence is converted into pinyin sequence, is easier to realize, and is accurate Rate is very high.

Preferably, in the present embodiment, tool is segmented by text and the text sequence is converted into pinyin sequence.It is described It includes but is not limited to the jieba participle tool increased income, the ICTCLLAS participle tool of the Chinese Academy of Sciences that text, which segments tool,.

The pinyin sequence is converted into new text using hidden Markov model trained in advance by step S103 respectively Sequence generates new text sequence set.

Wherein, the hidden Markov model is a kind of statistical model, for describing one containing implicit unknown parameter Markov process.Hidden state sequence in model cannot observe directly, but can be by observing associated observation sequence Column, which derive, to be obtained.The hidden Markov model can usually be described with five members, including 2 state sets and 3 probability squares Battle array, specific as follows:

1) hidden state S

Meet Markov property between these states, is the practical state implied, these shapes in Markov model State can not usually be obtained by directly observing, such as: S1, S2, S3 etc..

2) Observable state O

It is associated with hidden state in a model, it can be obtained by directly observing, such as: O1, O2, O3 etc., it is considerable The number of survey state is not necessarily intended to consistent with the number of hidden state.

3) initial state probabilities matrix F

Indicate that hidden state carves the probability matrix of t=1 at the beginning, (such as when t=1, P (S1)=p1, P (S2)=P2, P (S3)=p3, then initial state probabilities matrix F=[p1p2p3].

4) hidden state transition probability matrix M

Describe the transition probability in HMM model between each state；

Wherein Mij=P (Sj | Si), 1≤i, j≤X.

It indicates under conditions of t moment, state are Si, is the probability of Sj in t+1 moment state.

5) observation state transition probability matrix C

X is enabled to represent hidden state number, Y represents Observable state number, then:

Bij=P (Oi | Sj), 1≤i≤Y, 1≤j≤X；

It indicates under the conditions of t moment, hidden state are Sj, observation state is the probability of Oi.

It in the present embodiment, can as long as given pinyin sequence under the Parameter Conditions of known hidden Markov model Its optimal text sequence is obtained, eliminates mistake unisonance word problem present in text to reach.

Preferably, in the present embodiment, step S103 is specifically included:

It is right that pinyin sequence institute according to the hidden Markov model is solved using Viterbi Viterbi algorithm respectively The optimal hidden state sequence answered, the optimal hidden state sequence are text sequence new corresponding to the pinyin sequence Column.

Wherein, the Viterbi algorithm uses four tunnel of solution of dynamic programming path, i.e. global optimum is by local optimum Combination is formed, while hidden state sequence has Markov property, based on there are correlations between local optimum, it can be ensured that The sequence finally obtained is global optimum, therefore problem is converted into the optimal solution for seeking each step in pinyin sequence.

For example: assuming that observation sequence A=[A1, A2, A3...An] known to known hypothesis, hidden state sequence B= [B1, B2, B3....Bm], hidden state transition probability matrix are M, observation state transition probability matrix C, and original state matrix is F, specific solution procedure are as follows:

1) when solving the corresponding hidden state of A1 in observation sequence, initial state probabilities matrix F and observation need to only be paid close attention to State transition probability matrix C, process are as follows:

B1=max (F (b) FOR b IN C (A1)

Wherein, hidden state when b1 is F (b) value maximum

2) since second element, need to pay close attention to hidden state transition probability matrix simultaneously is that M and observation state transfer are general Rate matrix C, process are as follows:

FOR hidden state sequence B=[B1, B2, B3....Bm]

P (Bi)=M (Bi | bx) * C (Bi | Ax)

Bi=max (P (Bi)) FOR i IN m

Wherein, hidden state Bi when bi is P (Bi) value maximum

3) second step operation is repeated, the subsequent hidden state of iterative solution is circuited sequentially

The result [b1, b2....bn] finally obtained is given observation sequence A=[A1, A2, A3...An] and model ginseng It is optimal under the conditions of number (hidden state transition probability matrix is M, and observation state transition probability matrix C, original state matrix is F) Hidden state sequence.

Therefore, under the Parameter Conditions of known hidden Markov model, it is optimal as long as given pinyin sequence to obtain its Text sequence, to reach the semantic disambiguation function in network text.Certainly, new text sequence is converted into pinyin sequence In the process, model parameter is extremely important, under the conditions of same pinyin sequence, the optimal text sequence of different model parameter acquisitions May be inconsistent, it in this case only need to be for its corresponding model parameter of the text data training of different field, to realize Optimal text sequence is obtained under different field, while but also the application scenarios of this method are more, range is wider.

Step S104 generates new text according to the new text sequence set.

In the present embodiment, after getting new text sequence set, that is, use preset text synthetics by institute The text sequence stated in new text sequence set recombines new text, is then updated to text before described new Text, homonym wrong present in text before can eliminating in this way, for the analysis of subsequent text data and digger It lays a solid foundation.

Above as can be seen that provided in this embodiment carry out " text -> pinyin sequence -> text " by that will input text Conversion, using the many-one mapping relations between word and phonetic, introduces hidden Ma Erke in " pinyin sequence -> text " conversion process Husband's model, thus homonym problem existing for effective solution network text data, while should during model parameter energy It is enough to be trained acquisitions automatically, it can improve existing being based on dictionary completeness existing for artificial dictionary methods and artificial time-consuming is asked Topic can effectively improve the working efficiency that network text disambiguates processing stage in homonym semanteme.

Fig. 2 is a kind of schematic flow diagram of text handling method provided by Embodiment 2 of the present invention.It is shown in Figure 2, phase For a upper embodiment, a kind of text handling method provided in this embodiment, it is described by text segmentation to be processed at text sequence Column, before obtaining text sequence set further include:

Step S200 obtains the hidden Markov model by preset corpus of text and lexicon with Pinyin training.Referring to Shown in Fig. 3, step S200 is specifically included:

Step S301 determines the Observable state O and hidden state S of the hidden Markov model, the Observable shape State O is the set of all phonetics in the corpus of text, and the hidden state S is all words or word group in the corpus of text At set.

Step S302, by the preset corpus of text according to specific Segmentation of Punctuation at a section text sequence T forms text sequence set D.

Step S303 loops through whether the text sequence set D is empty；If not empty, then S304 is entered step；If For sky, then go to step S311.

Step S304, the text sequence T taken out in the text sequence set D one by one are further processed.

Step S305 carries out participle operation to the text sequence T, forms text and segments set of words U.

Step S306, traverses whether the text participle set of words U is empty；If not empty, then S307 is entered step；If For sky, then step S303 is back to.

Step S307, the word I being successively read in the text participle set of words U.

Step S308 judges whether institute predicate I is first word in the text sequence；If first word, then into Enter step S309；If not first word, then enter step S310.

Institute predicate I is added in head-word set R by step S309.

Step S310, by the previous word K composition phrase of institute predicate I and institute predicate I in the text sequence to (I, K), and by the phrase to (I, K) it is added to set of words N.

Step S311, the number and occur in the corpus of text that statistics institute predicate I occurs in the head-word set R Number, the probability that institute predicate I appears in the original position of the text sequence T is calculated according to statistical result, is obtained The initial state probabilities matrix F of the hidden Markov.

Step S312 counts the number that (I, K) and institute predicate I occurs in the corpus of text in the phrase, according to Statistical result is calculated, and word I appears in the probability after word K, and the hidden state transfer for obtaining the hidden Markov model is general Rate matrix M.

Step S313 obtains the corresponding word of each phonetic in dictionary by lexicon with Pinyin, to form " word-phonetic " Relational matrix obtains the observation state transition probability matrix C in the hidden Markov model.

It should be noted that during pinyin sequence is converted into new text sequence, the hidden Markov model Parameter is extremely important, under the conditions of same pinyin sequence, the optimal text sequence of the parameter acquisition of different hidden Markov models Column may be inconsistent, in this case, only need to be in advance for its corresponding hidden Markov of the text data training of different field Model, can be realized can obtain optimal text sequence under different field, while but also this method application scenarios More, range is wider.

In addition, in the present embodiment step S201~step S204 implementation due to respectively with step in a upper embodiment S101~step S104 implementation is identical, and therefore, details are not described herein.

Thus, it will be seen that a kind of text handling method provided in this embodiment also due to by will input text into The conversion of row " text -> pinyin sequence -> text ", using the many-one mapping relations between word and phonetic, in " pinyin sequence -> text This " introduce hidden Markov model in conversion process, thus homonym problem existing for effective solution network text data, Model parameter during being somebody's turn to do simultaneously can be trained acquisition automatically, can improve existing based on the presence of artificial dictionary methods Dictionary completeness and artificial time-consuming problem, the work that network text disambiguates processing stage in homonym semanteme can be effectively improved Efficiency.

Fig. 4 is the schematic block diagram that the present invention implements a kind of terminal that three provide.Illustrate only for ease of description with The relevant part of the present embodiment.

It is shown in Figure 4, a kind of terminal 4 provided in this embodiment, comprising:

Text segmentation unit 41, for text segmentation to be processed at text sequence, to be obtained text sequence set；

Pinyin sequence acquiring unit 42, for the text sequence in the text sequence set to be converted into phonetic sequence respectively Column；

Text sequence acquiring unit 43, for using hidden Markov model trained in advance respectively by the pinyin sequence It is converted into new text sequence, generates new text sequence set；

Text generation unit 44, for generating new text according to the new text sequence set.

Optionally, the text sequence acquiring unit 43 is specifically used for:

Optionally, shown in Figure 5, in another embodiment, the terminal 4 further include:

Model training unit 45, for obtaining the hidden Markov by preset corpus of text and lexicon with Pinyin training Model.

Optionally, the model training unit 45 is specifically used for:

Determine the Observable state O and hidden state S of the hidden Markov model, the Observable state O is described The set of all phonetics in corpus of text, the hidden state S are the set of all words or word composition in the corpus of text；

The preset corpus of text is organized written according to specific Segmentation of Punctuation at a section text sequence T This arrangement set D；

Loop through whether the text sequence set D is empty；

If not empty, then the text sequence T taken out one by one in the text sequence set D is further processed；

Participle operation is carried out to the text sequence T, text is formed and segments set of words U；

Traverse whether the text participle set of words U is empty；

If not empty, then the word I being successively read in the text participle set of words U；

Judge whether institute predicate I is first word in the text sequence；

If institute predicate I is then added in head-word set R by first word；

If not first word, then the previous word K by institute predicate I and institute predicate I in the text sequence forms word Group is added to set of words N to (I, K) to (I, K), and by the phrase；

Count the institute predicate I number occurred in the head-word set R and the number occurred in the corpus of text, root The probability that institute predicate I appears in the original position of the text sequence T is calculated in result according to statistics, obtains the hidden Ma Er Can husband initial state probabilities matrix F；

The number that (I, K) and institute predicate I occurs in the corpus of text in the phrase is counted, according to statistical result meter It obtains, word I appears in the probability after word K, obtains the hidden state transition probability matrix M of the hidden Markov model；

The corresponding word of each phonetic in dictionary is obtained by lexicon with Pinyin, to form the relational matrix of " word-phonetic ", Obtain the observation state transition probability matrix C in the hidden Markov model.

It should be noted that each unit in above-mentioned terminal provided in an embodiment of the present invention, due to the method for the present invention Embodiment is based on same design, and bring technical effect is identical as embodiment of the present invention method, and particular content can be found in this hair Narration in bright embodiment of the method, details are not described herein again.

Therefore, it can be seen that terminal provided in an embodiment of the present invention also due to by will input text progress " text -> The conversion of pinyin sequence -> text ", it is converted at " pinyin sequence -> text " using the many-one mapping relations between word and phonetic Hidden Markov model is introduced in journey, thus homonym problem existing for effective solution network text data, while the mistake Model parameter in journey can be trained acquisition automatically, can improve existing complete based on dictionary existing for artificial dictionary methods Standby property and artificial time-consuming problem, can effectively improve the working efficiency that network text disambiguates processing stage in homonym semanteme.

Fig. 6 is a kind of schematic diagram for terminal that the embodiment of the present invention five provides.As shown in fig. 6, the terminal 6 of the embodiment is wrapped It includes: processor 60, memory 61 and being stored in the computer that can be run in the memory 61 and on the processor 60 Program 62.The processor 60 realizes the step in above-mentioned each embodiment of the method when executing the computer program 62, such as Step 101 shown in FIG. 1 is to 104.Alternatively, the processor 60 realizes that above-mentioned each device is real when executing the computer program 62 Apply the function of each module/unit in example, such as the function of module 41 to 44 shown in Fig. 4.

Illustratively, the computer program 62 can be divided into one or more module/units, it is one or Multiple module/units are stored in the memory 61, and are executed by the processor 60, to complete the present invention.Described one A or multiple module/units can be the series of computation machine program instruction section that can complete specific function, which is used for Implementation procedure of the computer program 62 in the terminal 6 is described.For example, the computer program 62 can be divided into Text segmentation unit 41, pinyin sequence acquiring unit 42, text sequence acquiring unit 43, text generation unit 44, each unit tool Body function is as follows:

The terminal 6 can be desktop PC, notebook, palm PC and cloud server etc. and calculate equipment.Institute Stating terminal 6 may include, but be not limited only to, processor 60, memory 61.It will be understood by those skilled in the art that Fig. 6 is only eventually The example at end 6, the not restriction of structure paired terminal 6 may include than illustrating more or fewer components, or the certain portions of combination Part or different components, such as the terminal can also include input-output equipment, network access equipment, bus etc..

Alleged processor 60 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng.

The memory 61 can be the internal storage unit of the terminal 6, such as the hard disk or memory of terminal 6.It is described Memory 61 is also possible to the External memory equipment of the terminal 6, such as the plug-in type hard disk being equipped in the terminal 6, intelligence Storage card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) Deng.Further, the memory 61 can also both include the internal storage unit of the terminal 6 or set including external storage It is standby.The memory 61 is for other programs and data needed for storing the computer program and the terminal.It is described to deposit Reservoir 61 can be also used for temporarily storing the data that has exported or will export.

It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing The all or part of function of description.Each functional unit in embodiment, module can integrate in one processing unit, can also To be that each unit physically exists alone, can also be integrated in one unit with two or more units, it is above-mentioned integrated Unit both can take the form of hardware realization, can also realize in the form of software functional units.In addition, each function list Member, the specific name of module are also only for convenience of distinguishing each other, the protection scope being not intended to limit this application.Above system The specific work process of middle unit, module, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, is not described in detail or remembers in some embodiment The part of load may refer to the associated description of other embodiments.

Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually It is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technician Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed The scope of the present invention.

In embodiment provided by the present invention, it should be understood that disclosed terminal and method can pass through others Mode is realized.For example, terminal device embodiment described above is only schematical, for example, the module or unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling or direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit Conjunction or communication connection can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated module/unit be realized in the form of SFU software functional unit and as independent product sale or In use, can store in a computer readable storage medium.Based on this understanding, the present invention realizes above-mentioned implementation All or part of the process in example method, can also instruct relevant hardware to complete, the meter by computer program Calculation machine program can be stored in a computer readable storage medium, the computer program when being executed by processor, it can be achieved that on The step of stating each embodiment of the method.Wherein, the computer program includes computer program code, the computer program generation Code can be source code form, object identification code form, executable file or certain intermediate forms etc..The computer-readable medium It may include: any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic that can carry the computer program code Dish, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that described The content that computer-readable medium includes can carry out increasing appropriate according to the requirement made laws in jurisdiction with patent practice Subtract, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium do not include be electric carrier signal and Telecommunication signal.

It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.

Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations；Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features；And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims

1. a kind of text handling method characterized by comprising

According to the new text sequence set, new text is generated.

2. text handling method according to claim 1, which is characterized in that it is described by text segmentation to be processed at text sequence Column, before obtaining text sequence set further include:

The hidden Markov model is obtained by preset corpus of text and lexicon with Pinyin training.

3. text handling method as claimed in claim 2, which is characterized in that described to pass through preset file corpus and pinyin word Allusion quotation training obtains the hidden Markov model and includes:

The Observable state O and hidden state S, the Observable state O for determining the hidden Markov model are the text The set of all phonetics in corpus, the hidden state S are the set of all words or word composition in the corpus of text；

The preset corpus of text is formed into text sequence at a section text sequence T according to specific Segmentation of Punctuation Arrange set D；

Loop through whether the text sequence set D is empty；

Traverse whether the text participle set of words U is empty；

Judge whether institute predicate I is first word in the text sequence；

If institute predicate I is then added in head-word set R by first word；

If not first word, then the previous word K by institute predicate I and institute predicate I in the text sequence forms phrase pair (I, K), and the phrase is added to set of words N to (I, K)；

The institute predicate I number occurred in the head-word set R and the number occurred in the corpus of text are counted, according to system The probability that institute predicate I appears in the original position of the text sequence T is calculated in meter result, obtains the hidden Markov Initial state probabilities matrix F；

The number that (I, K) and institute predicate I occurs in the corpus of text in the phrase is counted, is calculated according to statistical result Out, word I appears in the probability after word K, obtains the hidden state transition probability matrix M of the hidden Markov model；

It is obtained by the corresponding word of each phonetic in lexicon with Pinyin acquisition dictionary with forming the relational matrix of " word-phonetic " The observation state transition probability matrix C in the hidden Markov model.

4. text handling method as described in claim 1, which is characterized in that using hidden Markov model point trained in advance The pinyin sequence, which is not converted into new text sequence, includes:

Corresponding to solving the pinyin sequence respectively according to the hidden Markov model using Viterbi Viterbi algorithm Optimal hidden state sequence, the optimal hidden state sequence is text sequence new corresponding to the pinyin sequence.

5. a kind of terminal characterized by comprising

Pinyin sequence acquiring unit, for the text sequence in the text sequence set to be converted into pinyin sequence respectively；

Text sequence acquiring unit, for being respectively converted into the pinyin sequence using hidden Markov model trained in advance New text sequence generates new text sequence set；

6. terminal according to claim 5, which is characterized in that further include:

Model training unit, for obtaining the hidden Markov model by preset corpus of text and lexicon with Pinyin training.

7. terminal as claimed in claim 6, which is characterized in that the model training unit is specifically used for:

Loop through whether the text sequence set D is empty；

Traverse whether the text participle set of words U is empty；

Judge whether institute predicate I is first word in the text sequence；

If institute predicate I is then added in head-word set R by first word；

8. text handling method as described in claim 1, which is characterized in that the text sequence acquiring unit is specifically used for:

9. a kind of terminal, including memory, processor and storage can be run in the memory and on the processor Computer program, which is characterized in that the processor is realized when executing the computer program as Claims 1-4 is any The step of item the method.

10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In when the computer program is executed by processor the step of any one of such as Claims 1-4 of realization the method.