CN109255106A - A kind of text handling method and terminal - Google Patents
A kind of text handling method and terminal Download PDFInfo
- Publication number
- CN109255106A CN109255106A CN201710574188.4A CN201710574188A CN109255106A CN 109255106 A CN109255106 A CN 109255106A CN 201710574188 A CN201710574188 A CN 201710574188A CN 109255106 A CN109255106 A CN 109255106A
- Authority
- CN
- China
- Prior art keywords
- text
- sequence
- word
- text sequence
- pinyin
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the present invention provides a kind of text handling method and terminal, is related to field of computer technology.Wherein method includes: that text segmentation to be processed is obtained text sequence set at text sequence;The text sequence in the text sequence set is converted into pinyin sequence respectively;The pinyin sequence is converted into using hidden Markov model trained in advance by new text sequence respectively, generates new text sequence set;According to the new text sequence set, new text is generated.Can be there is the unisonance word problem of mistake in effective solution text in the embodiment of the present invention, the problem of improving time-consuming existing for the existing method based on homonym in artificial Custom Dictionaries processing text, effort and dictionary completeness, expand the application scenarios and application range of the disambiguation of homonym semanteme.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of text handling method and terminals.
Background technique
With universal and information technology the fast development of internet, the text data in network is sent out in blowout
How exhibition, excavate one of the hot spot that valuable information is current research from the text data of magnanimity.Network text data
Different from traditional text data, due to the irregular equal spies of the randomness and user's educational level of network user's expression way
Point causes to be flooded with a large amount of homonym problem in network text data, i.e., user gets used to a wrong word come table
Up to the word with its unisonance, such as: user wants to input word " culture " result often because carelessness is entered as " Wenhua ", it is contemplated that
Semantic difference between the homonym of the overwhelming majority is very big, if not carrying out to the homonym of these mistakes occurred in network text
Semantic disambiguation processing, it is futile to frequently can lead to the subsequent all working for the text.Therefore, it is deposited for network text data
These problems, using corresponding semantic disambiguation technology be in data preprocessing phase it is very necessary and valuable, it is
Basis text data analysis and excavated.
It is general using establishing unisonance and synonymicon come to the wrong unisonance occurred in network text in the prior art
Word carries out semantic disambiguation processing, and this mode is disadvantageous in that Custom Dictionaries need to expend a large amount of manual times and carry out structure
It builds, while the completeness of dictionary seriously constrains its practical application.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of text handling method and terminal, it is intended to solve above-mentioned custom words
The problem of allusion quotation needs to expend a large amount of manual times and constructs, while the completeness of dictionary seriously constrains its practical application.
The first aspect of the embodiment of the present invention provides a kind of text handling method, comprising:
By text segmentation to be processed at text sequence, text sequence set is obtained;
The text sequence in the text sequence set is converted into pinyin sequence respectively;
The pinyin sequence is converted into using hidden Markov model trained in advance by new text sequence respectively, is generated
New text sequence set;
According to the new text sequence set, new text is generated.
The second aspect of the embodiment of the present invention provides a kind of terminal, comprising:
Text segmentation unit, for text segmentation to be processed at text sequence, to be obtained text sequence set;
Pinyin sequence acquiring unit, for the text sequence in the text sequence set to be converted into phonetic sequence respectively
Column;
Text sequence acquiring unit, for respectively being turned the pinyin sequence using hidden Markov model trained in advance
It changes new text sequence into, generates new text sequence set;
Text generation unit, for generating new text according to the new text sequence set.
The third aspect of the embodiment of the present invention provides a kind of terminal, including memory, processor and is stored in described
In memory and the computer program that can run on the processor, which is characterized in that the processor executes the calculating
The step of text handling method described in above-mentioned first aspect is realized when machine program.
The fourth aspect of the embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storage
Media storage has computer program, wherein realizes as described in above-mentioned first aspect when the computer program is executed by processor
Text handling method the step of.
Existing beneficial effect is the embodiment of the present invention compared with prior art:
Then the embodiment of the present invention is due to being converted to phonetic sequence for the text sequence first by text segmentation at text sequence
Column, then pinyin sequence is converted by hidden Markov model by new text sequence, new according to new text sequence synthesis
Text improves existing based on manually making by oneself so as to there is the unisonance word problem of mistake in effective solution text
In adopted dictionary processing text the problem of time-consuming existing for the method for homonym, effort and dictionary completeness, homonym is expanded
The application scenarios and application range that semanteme disambiguates.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art
Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only of the invention some
Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these
Attached drawing obtains other attached drawings.
Fig. 1 is a kind of schematic flow diagram for text handling method that the embodiment of the present invention one provides;
Fig. 2 is a kind of schematic flow diagram of text handling method provided by Embodiment 2 of the present invention;
Fig. 3 is the specific implementation flow chart of step S200 in a kind of text handling method provided by Embodiment 2 of the present invention;
Fig. 4 is a kind of schematic block diagram for terminal that the embodiment of the present invention three provides;
Fig. 5 is a kind of schematic block diagram for terminal that the embodiment of the present invention four provides;
Fig. 6 is a kind of schematic block diagram for terminal that the embodiment of the present invention five provides.
Specific embodiment
In being described below, for illustration and not for limitation, the tool of such as particular system structure, technology etc is proposed
Body details, to understand thoroughly the embodiment of the present invention.However, it will be clear to one skilled in the art that there is no these specific
The present invention also may be implemented in the other embodiments of details.In other situations, it omits to well-known system, device, electricity
The detailed description of road and method, in case unnecessary details interferes description of the invention.
In order to illustrate technical solutions according to the invention, the following is a description of specific embodiments.
Fig. 1 is a kind of schematic flow diagram for text handling method that the embodiment of the present invention one provides.It is shown in Figure 1, this
A kind of text handling method that embodiment provides, may comprise steps of:
Step S101 obtains text sequence set by text segmentation to be processed at text sequence.
In the present embodiment, text to be processed is that the network user is inputted by voice or inputted by spelling input method
Text.Step S101 is specifically included:
Text to be processed is obtained into text sequence collection at a section text sequence according to preset Segmentation of Punctuation
It closes.The preset punctuation mark includes but is not limited to comma, fullstop etc..
Text sequence in the text sequence set is converted into pinyin sequence respectively by step S102.
In this embodiment, text and phonetic are many-to-one mapping relations, wherein the word or word of composition text are limited
, specific number is determined by dictionary or dictionary.Therefore, text sequence is converted into pinyin sequence, is easier to realize, and is accurate
Rate is very high.
Preferably, in the present embodiment, tool is segmented by text and the text sequence is converted into pinyin sequence.It is described
It includes but is not limited to the jieba participle tool increased income, the ICTCLLAS participle tool of the Chinese Academy of Sciences that text, which segments tool,.
The pinyin sequence is converted into new text using hidden Markov model trained in advance by step S103 respectively
Sequence generates new text sequence set.
Wherein, the hidden Markov model is a kind of statistical model, for describing one containing implicit unknown parameter
Markov process.Hidden state sequence in model cannot observe directly, but can be by observing associated observation sequence
Column, which derive, to be obtained.The hidden Markov model can usually be described with five members, including 2 state sets and 3 probability squares
Battle array, specific as follows:
1) hidden state S
Meet Markov property between these states, is the practical state implied, these shapes in Markov model
State can not usually be obtained by directly observing, such as: S1, S2, S3 etc..
2) Observable state O
It is associated with hidden state in a model, it can be obtained by directly observing, such as: O1, O2, O3 etc., it is considerable
The number of survey state is not necessarily intended to consistent with the number of hidden state.
3) initial state probabilities matrix F
Indicate that hidden state carves the probability matrix of t=1 at the beginning, (such as when t=1, P (S1)=p1, P (S2)=P2,
P (S3)=p3, then initial state probabilities matrix F=[p1p2p3].
4) hidden state transition probability matrix M
Describe the transition probability in HMM model between each state;
Wherein Mij=P (Sj | Si), 1≤i, j≤X.
It indicates under conditions of t moment, state are Si, is the probability of Sj in t+1 moment state.
5) observation state transition probability matrix C
X is enabled to represent hidden state number, Y represents Observable state number, then:
Bij=P (Oi | Sj), 1≤i≤Y, 1≤j≤X;
It indicates under the conditions of t moment, hidden state are Sj, observation state is the probability of Oi.
It in the present embodiment, can as long as given pinyin sequence under the Parameter Conditions of known hidden Markov model
Its optimal text sequence is obtained, eliminates mistake unisonance word problem present in text to reach.
Preferably, in the present embodiment, step S103 is specifically included:
It is right that pinyin sequence institute according to the hidden Markov model is solved using Viterbi Viterbi algorithm respectively
The optimal hidden state sequence answered, the optimal hidden state sequence are text sequence new corresponding to the pinyin sequence
Column.
Wherein, the Viterbi algorithm uses four tunnel of solution of dynamic programming path, i.e. global optimum is by local optimum
Combination is formed, while hidden state sequence has Markov property, based on there are correlations between local optimum, it can be ensured that
The sequence finally obtained is global optimum, therefore problem is converted into the optimal solution for seeking each step in pinyin sequence.
For example: assuming that observation sequence A=[A1, A2, A3...An] known to known hypothesis, hidden state sequence B=
[B1, B2, B3....Bm], hidden state transition probability matrix are M, observation state transition probability matrix C, and original state matrix is
F, specific solution procedure are as follows:
1) when solving the corresponding hidden state of A1 in observation sequence, initial state probabilities matrix F and observation need to only be paid close attention to
State transition probability matrix C, process are as follows:
B1=max (F (b) FOR b IN C (A1)
Wherein, hidden state when b1 is F (b) value maximum
2) since second element, need to pay close attention to hidden state transition probability matrix simultaneously is that M and observation state transfer are general
Rate matrix C, process are as follows:
FOR hidden state sequence B=[B1, B2, B3....Bm]
P (Bi)=M (Bi | bx) * C (Bi | Ax)
Bi=max (P (Bi)) FOR i IN m
Wherein, hidden state Bi when bi is P (Bi) value maximum
3) second step operation is repeated, the subsequent hidden state of iterative solution is circuited sequentially
The result [b1, b2....bn] finally obtained is given observation sequence A=[A1, A2, A3...An] and model ginseng
It is optimal under the conditions of number (hidden state transition probability matrix is M, and observation state transition probability matrix C, original state matrix is F)
Hidden state sequence.
Therefore, under the Parameter Conditions of known hidden Markov model, it is optimal as long as given pinyin sequence to obtain its
Text sequence, to reach the semantic disambiguation function in network text.Certainly, new text sequence is converted into pinyin sequence
In the process, model parameter is extremely important, under the conditions of same pinyin sequence, the optimal text sequence of different model parameter acquisitions
May be inconsistent, it in this case only need to be for its corresponding model parameter of the text data training of different field, to realize
Optimal text sequence is obtained under different field, while but also the application scenarios of this method are more, range is wider.
Step S104 generates new text according to the new text sequence set.
In the present embodiment, after getting new text sequence set, that is, use preset text synthetics by institute
The text sequence stated in new text sequence set recombines new text, is then updated to text before described new
Text, homonym wrong present in text before can eliminating in this way, for the analysis of subsequent text data and digger
It lays a solid foundation.
Above as can be seen that provided in this embodiment carry out " text -> pinyin sequence -> text " by that will input text
Conversion, using the many-one mapping relations between word and phonetic, introduces hidden Ma Erke in " pinyin sequence -> text " conversion process
Husband's model, thus homonym problem existing for effective solution network text data, while should during model parameter energy
It is enough to be trained acquisitions automatically, it can improve existing being based on dictionary completeness existing for artificial dictionary methods and artificial time-consuming is asked
Topic can effectively improve the working efficiency that network text disambiguates processing stage in homonym semanteme.
Fig. 2 is a kind of schematic flow diagram of text handling method provided by Embodiment 2 of the present invention.It is shown in Figure 2, phase
For a upper embodiment, a kind of text handling method provided in this embodiment, it is described by text segmentation to be processed at text sequence
Column, before obtaining text sequence set further include:
Step S200 obtains the hidden Markov model by preset corpus of text and lexicon with Pinyin training.Referring to
Shown in Fig. 3, step S200 is specifically included:
Step S301 determines the Observable state O and hidden state S of the hidden Markov model, the Observable shape
State O is the set of all phonetics in the corpus of text, and the hidden state S is all words or word group in the corpus of text
At set.
Step S302, by the preset corpus of text according to specific Segmentation of Punctuation at a section text sequence
T forms text sequence set D.
Step S303 loops through whether the text sequence set D is empty;If not empty, then S304 is entered step;If
For sky, then go to step S311.
Step S304, the text sequence T taken out in the text sequence set D one by one are further processed.
Step S305 carries out participle operation to the text sequence T, forms text and segments set of words U.
Step S306, traverses whether the text participle set of words U is empty;If not empty, then S307 is entered step;If
For sky, then step S303 is back to.
Step S307, the word I being successively read in the text participle set of words U.
Step S308 judges whether institute predicate I is first word in the text sequence;If first word, then into
Enter step S309;If not first word, then enter step S310.
Institute predicate I is added in head-word set R by step S309.
Step S310, by the previous word K composition phrase of institute predicate I and institute predicate I in the text sequence to (I,
K), and by the phrase to (I, K) it is added to set of words N.
Step S311, the number and occur in the corpus of text that statistics institute predicate I occurs in the head-word set R
Number, the probability that institute predicate I appears in the original position of the text sequence T is calculated according to statistical result, is obtained
The initial state probabilities matrix F of the hidden Markov.
Step S312 counts the number that (I, K) and institute predicate I occurs in the corpus of text in the phrase, according to
Statistical result is calculated, and word I appears in the probability after word K, and the hidden state transfer for obtaining the hidden Markov model is general
Rate matrix M.
Step S313 obtains the corresponding word of each phonetic in dictionary by lexicon with Pinyin, to form " word-phonetic "
Relational matrix obtains the observation state transition probability matrix C in the hidden Markov model.
It should be noted that during pinyin sequence is converted into new text sequence, the hidden Markov model
Parameter is extremely important, under the conditions of same pinyin sequence, the optimal text sequence of the parameter acquisition of different hidden Markov models
Column may be inconsistent, in this case, only need to be in advance for its corresponding hidden Markov of the text data training of different field
Model, can be realized can obtain optimal text sequence under different field, while but also this method application scenarios
More, range is wider.
In addition, in the present embodiment step S201~step S204 implementation due to respectively with step in a upper embodiment
S101~step S104 implementation is identical, and therefore, details are not described herein.
Thus, it will be seen that a kind of text handling method provided in this embodiment also due to by will input text into
The conversion of row " text -> pinyin sequence -> text ", using the many-one mapping relations between word and phonetic, in " pinyin sequence -> text
This " introduce hidden Markov model in conversion process, thus homonym problem existing for effective solution network text data,
Model parameter during being somebody's turn to do simultaneously can be trained acquisition automatically, can improve existing based on the presence of artificial dictionary methods
Dictionary completeness and artificial time-consuming problem, the work that network text disambiguates processing stage in homonym semanteme can be effectively improved
Efficiency.
Fig. 4 is the schematic block diagram that the present invention implements a kind of terminal that three provide.Illustrate only for ease of description with
The relevant part of the present embodiment.
It is shown in Figure 4, a kind of terminal 4 provided in this embodiment, comprising:
Text segmentation unit 41, for text segmentation to be processed at text sequence, to be obtained text sequence set;
Pinyin sequence acquiring unit 42, for the text sequence in the text sequence set to be converted into phonetic sequence respectively
Column;
Text sequence acquiring unit 43, for using hidden Markov model trained in advance respectively by the pinyin sequence
It is converted into new text sequence, generates new text sequence set;
Text generation unit 44, for generating new text according to the new text sequence set.
Optionally, the text sequence acquiring unit 43 is specifically used for:
It is right that pinyin sequence institute according to the hidden Markov model is solved using Viterbi Viterbi algorithm respectively
The optimal hidden state sequence answered, the optimal hidden state sequence are text sequence new corresponding to the pinyin sequence
Column.
Optionally, shown in Figure 5, in another embodiment, the terminal 4 further include:
Model training unit 45, for obtaining the hidden Markov by preset corpus of text and lexicon with Pinyin training
Model.
Optionally, the model training unit 45 is specifically used for:
Determine the Observable state O and hidden state S of the hidden Markov model, the Observable state O is described
The set of all phonetics in corpus of text, the hidden state S are the set of all words or word composition in the corpus of text;
The preset corpus of text is organized written according to specific Segmentation of Punctuation at a section text sequence T
This arrangement set D;
Loop through whether the text sequence set D is empty;
If not empty, then the text sequence T taken out one by one in the text sequence set D is further processed;
Participle operation is carried out to the text sequence T, text is formed and segments set of words U;
Traverse whether the text participle set of words U is empty;
If not empty, then the word I being successively read in the text participle set of words U;
Judge whether institute predicate I is first word in the text sequence;
If institute predicate I is then added in head-word set R by first word;
If not first word, then the previous word K by institute predicate I and institute predicate I in the text sequence forms word
Group is added to set of words N to (I, K) to (I, K), and by the phrase;
Count the institute predicate I number occurred in the head-word set R and the number occurred in the corpus of text, root
The probability that institute predicate I appears in the original position of the text sequence T is calculated in result according to statistics, obtains the hidden Ma Er
Can husband initial state probabilities matrix F;
The number that (I, K) and institute predicate I occurs in the corpus of text in the phrase is counted, according to statistical result meter
It obtains, word I appears in the probability after word K, obtains the hidden state transition probability matrix M of the hidden Markov model;
The corresponding word of each phonetic in dictionary is obtained by lexicon with Pinyin, to form the relational matrix of " word-phonetic ",
Obtain the observation state transition probability matrix C in the hidden Markov model.
It should be noted that each unit in above-mentioned terminal provided in an embodiment of the present invention, due to the method for the present invention
Embodiment is based on same design, and bring technical effect is identical as embodiment of the present invention method, and particular content can be found in this hair
Narration in bright embodiment of the method, details are not described herein again.
Therefore, it can be seen that terminal provided in an embodiment of the present invention also due to by will input text progress " text ->
The conversion of pinyin sequence -> text ", it is converted at " pinyin sequence -> text " using the many-one mapping relations between word and phonetic
Hidden Markov model is introduced in journey, thus homonym problem existing for effective solution network text data, while the mistake
Model parameter in journey can be trained acquisition automatically, can improve existing complete based on dictionary existing for artificial dictionary methods
Standby property and artificial time-consuming problem, can effectively improve the working efficiency that network text disambiguates processing stage in homonym semanteme.
Fig. 6 is a kind of schematic diagram for terminal that the embodiment of the present invention five provides.As shown in fig. 6, the terminal 6 of the embodiment is wrapped
It includes: processor 60, memory 61 and being stored in the computer that can be run in the memory 61 and on the processor 60
Program 62.The processor 60 realizes the step in above-mentioned each embodiment of the method when executing the computer program 62, such as
Step 101 shown in FIG. 1 is to 104.Alternatively, the processor 60 realizes that above-mentioned each device is real when executing the computer program 62
Apply the function of each module/unit in example, such as the function of module 41 to 44 shown in Fig. 4.
Illustratively, the computer program 62 can be divided into one or more module/units, it is one or
Multiple module/units are stored in the memory 61, and are executed by the processor 60, to complete the present invention.Described one
A or multiple module/units can be the series of computation machine program instruction section that can complete specific function, which is used for
Implementation procedure of the computer program 62 in the terminal 6 is described.For example, the computer program 62 can be divided into
Text segmentation unit 41, pinyin sequence acquiring unit 42, text sequence acquiring unit 43, text generation unit 44, each unit tool
Body function is as follows:
Text segmentation unit 41, for text segmentation to be processed at text sequence, to be obtained text sequence set;
Pinyin sequence acquiring unit 42, for the text sequence in the text sequence set to be converted into phonetic sequence respectively
Column;
Text sequence acquiring unit 43, for using hidden Markov model trained in advance respectively by the pinyin sequence
It is converted into new text sequence, generates new text sequence set;
Text generation unit 44, for generating new text according to the new text sequence set.
The terminal 6 can be desktop PC, notebook, palm PC and cloud server etc. and calculate equipment.Institute
Stating terminal 6 may include, but be not limited only to, processor 60, memory 61.It will be understood by those skilled in the art that Fig. 6 is only eventually
The example at end 6, the not restriction of structure paired terminal 6 may include than illustrating more or fewer components, or the certain portions of combination
Part or different components, such as the terminal can also include input-output equipment, network access equipment, bus etc..
Alleged processor 60 can be central processing unit (Central Processing Unit, CPU), can also be
Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor
Deng.
The memory 61 can be the internal storage unit of the terminal 6, such as the hard disk or memory of terminal 6.It is described
Memory 61 is also possible to the External memory equipment of the terminal 6, such as the plug-in type hard disk being equipped in the terminal 6, intelligence
Storage card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card)
Deng.Further, the memory 61 can also both include the internal storage unit of the terminal 6 or set including external storage
It is standby.The memory 61 is for other programs and data needed for storing the computer program and the terminal.It is described to deposit
Reservoir 61 can be also used for temporarily storing the data that has exported or will export.
It is apparent to those skilled in the art that for convenience of description and succinctly, only with above-mentioned each function
Can unit, module division progress for example, in practical application, can according to need and by above-mentioned function distribution by different
Functional unit, module are completed, i.e., the internal structure of described device is divided into different functional unit or module, more than completing
The all or part of function of description.Each functional unit in embodiment, module can integrate in one processing unit, can also
To be that each unit physically exists alone, can also be integrated in one unit with two or more units, it is above-mentioned integrated
Unit both can take the form of hardware realization, can also realize in the form of software functional units.In addition, each function list
Member, the specific name of module are also only for convenience of distinguishing each other, the protection scope being not intended to limit this application.Above system
The specific work process of middle unit, module, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, is not described in detail or remembers in some embodiment
The part of load may refer to the associated description of other embodiments.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure
Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually
It is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technician
Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed
The scope of the present invention.
In embodiment provided by the present invention, it should be understood that disclosed terminal and method can pass through others
Mode is realized.For example, terminal device embodiment described above is only schematical, for example, the module or unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components
It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or
The mutual coupling or direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit
Conjunction or communication connection can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated module/unit be realized in the form of SFU software functional unit and as independent product sale or
In use, can store in a computer readable storage medium.Based on this understanding, the present invention realizes above-mentioned implementation
All or part of the process in example method, can also instruct relevant hardware to complete, the meter by computer program
Calculation machine program can be stored in a computer readable storage medium, the computer program when being executed by processor, it can be achieved that on
The step of stating each embodiment of the method.Wherein, the computer program includes computer program code, the computer program generation
Code can be source code form, object identification code form, executable file or certain intermediate forms etc..The computer-readable medium
It may include: any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic that can carry the computer program code
Dish, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM,
Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that described
The content that computer-readable medium includes can carry out increasing appropriate according to the requirement made laws in jurisdiction with patent practice
Subtract, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium do not include be electric carrier signal and
Telecommunication signal.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process
Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit
It is fixed.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality
Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified
Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all
It is included within protection scope of the present invention.
Claims (10)
1. a kind of text handling method characterized by comprising
By text segmentation to be processed at text sequence, text sequence set is obtained;
The text sequence in the text sequence set is converted into pinyin sequence respectively;
The pinyin sequence is converted into using hidden Markov model trained in advance by new text sequence respectively, is generated new
Text sequence set;
According to the new text sequence set, new text is generated.
2. text handling method according to claim 1, which is characterized in that it is described by text segmentation to be processed at text sequence
Column, before obtaining text sequence set further include:
The hidden Markov model is obtained by preset corpus of text and lexicon with Pinyin training.
3. text handling method as claimed in claim 2, which is characterized in that described to pass through preset file corpus and pinyin word
Allusion quotation training obtains the hidden Markov model and includes:
The Observable state O and hidden state S, the Observable state O for determining the hidden Markov model are the text
The set of all phonetics in corpus, the hidden state S are the set of all words or word composition in the corpus of text;
The preset corpus of text is formed into text sequence at a section text sequence T according to specific Segmentation of Punctuation
Arrange set D;
Loop through whether the text sequence set D is empty;
If not empty, then the text sequence T taken out one by one in the text sequence set D is further processed;
Participle operation is carried out to the text sequence T, text is formed and segments set of words U;
Traverse whether the text participle set of words U is empty;
If not empty, then the word I being successively read in the text participle set of words U;
Judge whether institute predicate I is first word in the text sequence;
If institute predicate I is then added in head-word set R by first word;
If not first word, then the previous word K by institute predicate I and institute predicate I in the text sequence forms phrase pair
(I, K), and the phrase is added to set of words N to (I, K);
The institute predicate I number occurred in the head-word set R and the number occurred in the corpus of text are counted, according to system
The probability that institute predicate I appears in the original position of the text sequence T is calculated in meter result, obtains the hidden Markov
Initial state probabilities matrix F;
The number that (I, K) and institute predicate I occurs in the corpus of text in the phrase is counted, is calculated according to statistical result
Out, word I appears in the probability after word K, obtains the hidden state transition probability matrix M of the hidden Markov model;
It is obtained by the corresponding word of each phonetic in lexicon with Pinyin acquisition dictionary with forming the relational matrix of " word-phonetic "
The observation state transition probability matrix C in the hidden Markov model.
4. text handling method as described in claim 1, which is characterized in that using hidden Markov model point trained in advance
The pinyin sequence, which is not converted into new text sequence, includes:
Corresponding to solving the pinyin sequence respectively according to the hidden Markov model using Viterbi Viterbi algorithm
Optimal hidden state sequence, the optimal hidden state sequence is text sequence new corresponding to the pinyin sequence.
5. a kind of terminal characterized by comprising
Text segmentation unit, for text segmentation to be processed at text sequence, to be obtained text sequence set;
Pinyin sequence acquiring unit, for the text sequence in the text sequence set to be converted into pinyin sequence respectively;
Text sequence acquiring unit, for being respectively converted into the pinyin sequence using hidden Markov model trained in advance
New text sequence generates new text sequence set;
Text generation unit, for generating new text according to the new text sequence set.
6. terminal according to claim 5, which is characterized in that further include:
Model training unit, for obtaining the hidden Markov model by preset corpus of text and lexicon with Pinyin training.
7. terminal as claimed in claim 6, which is characterized in that the model training unit is specifically used for:
The Observable state O and hidden state S, the Observable state O for determining the hidden Markov model are the text
The set of all phonetics in corpus, the hidden state S are the set of all words or word composition in the corpus of text;
The preset corpus of text is formed into text sequence at a section text sequence T according to specific Segmentation of Punctuation
Arrange set D;
Loop through whether the text sequence set D is empty;
If not empty, then the text sequence T taken out one by one in the text sequence set D is further processed;
Participle operation is carried out to the text sequence T, text is formed and segments set of words U;
Traverse whether the text participle set of words U is empty;
If not empty, then the word I being successively read in the text participle set of words U;
Judge whether institute predicate I is first word in the text sequence;
If institute predicate I is then added in head-word set R by first word;
If not first word, then the previous word K by institute predicate I and institute predicate I in the text sequence forms phrase pair
(I, K), and the phrase is added to set of words N to (I, K);
The institute predicate I number occurred in the head-word set R and the number occurred in the corpus of text are counted, according to system
The probability that institute predicate I appears in the original position of the text sequence T is calculated in meter result, obtains the hidden Markov
Initial state probabilities matrix F;
The number that (I, K) and institute predicate I occurs in the corpus of text in the phrase is counted, is calculated according to statistical result
Out, word I appears in the probability after word K, obtains the hidden state transition probability matrix M of the hidden Markov model;
It is obtained by the corresponding word of each phonetic in lexicon with Pinyin acquisition dictionary with forming the relational matrix of " word-phonetic "
The observation state transition probability matrix C in the hidden Markov model.
8. text handling method as described in claim 1, which is characterized in that the text sequence acquiring unit is specifically used for:
Corresponding to solving the pinyin sequence respectively according to the hidden Markov model using Viterbi Viterbi algorithm
Optimal hidden state sequence, the optimal hidden state sequence is text sequence new corresponding to the pinyin sequence.
9. a kind of terminal, including memory, processor and storage can be run in the memory and on the processor
Computer program, which is characterized in that the processor is realized when executing the computer program as Claims 1-4 is any
The step of item the method.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists
In when the computer program is executed by processor the step of any one of such as Claims 1-4 of realization the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710574188.4A CN109255106A (en) | 2017-07-13 | 2017-07-13 | A kind of text handling method and terminal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710574188.4A CN109255106A (en) | 2017-07-13 | 2017-07-13 | A kind of text handling method and terminal |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109255106A true CN109255106A (en) | 2019-01-22 |
Family
ID=65051020
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710574188.4A Pending CN109255106A (en) | 2017-07-13 | 2017-07-13 | A kind of text handling method and terminal |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109255106A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112149411A (en) * | 2020-09-22 | 2020-12-29 | 常州大学 | Ontology construction method in field of clinical use of antibiotics |
CN114201958A (en) * | 2020-09-02 | 2022-03-18 | 中国移动通信集团广东有限公司 | Network resource data processing method and system and electronic equipment |
CN116013278A (en) * | 2023-01-06 | 2023-04-25 | 杭州健海科技有限公司 | Speech recognition multi-model result merging method and device based on pinyin alignment algorithm |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1293428A (en) * | 2000-11-10 | 2001-05-02 | 清华大学 | Information check method based on speed recognition |
CN102789779A (en) * | 2012-07-12 | 2012-11-21 | 广东外语外贸大学 | Speech recognition system and recognition method thereof |
CN104882139A (en) * | 2015-05-28 | 2015-09-02 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and device |
-
2017
- 2017-07-13 CN CN201710574188.4A patent/CN109255106A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1293428A (en) * | 2000-11-10 | 2001-05-02 | 清华大学 | Information check method based on speed recognition |
CN102789779A (en) * | 2012-07-12 | 2012-11-21 | 广东外语外贸大学 | Speech recognition system and recognition method thereof |
CN104882139A (en) * | 2015-05-28 | 2015-09-02 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and device |
Non-Patent Citations (1)
Title |
---|
张俊: "基于神经网络的拼音汉字转换", 《中国优秀硕士学位论文全文数据库-信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114201958A (en) * | 2020-09-02 | 2022-03-18 | 中国移动通信集团广东有限公司 | Network resource data processing method and system and electronic equipment |
CN112149411A (en) * | 2020-09-22 | 2020-12-29 | 常州大学 | Ontology construction method in field of clinical use of antibiotics |
CN112149411B (en) * | 2020-09-22 | 2024-06-04 | 常州大学 | Method for constructing body in clinical application field of antibiotics |
CN116013278A (en) * | 2023-01-06 | 2023-04-25 | 杭州健海科技有限公司 | Speech recognition multi-model result merging method and device based on pinyin alignment algorithm |
CN116013278B (en) * | 2023-01-06 | 2023-08-08 | 杭州健海科技有限公司 | Speech recognition multi-model result merging method and device based on pinyin alignment algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hwang et al. | (comet-) atomic 2020: On symbolic and neural commonsense knowledge graphs | |
WO2020215870A1 (en) | Named entity identification method and apparatus | |
WO2018040899A1 (en) | Error correction method and device for search term | |
CN111797241B (en) | Event Argument Extraction Method and Device Based on Reinforcement Learning | |
US20150135166A1 (en) | Source code generation, completion, checking, correction | |
CN107885874A (en) | Data query method and apparatus, computer equipment and computer-readable recording medium | |
CN106649742A (en) | Database maintenance method and device | |
CN110287489A (en) | Document creation method, device, storage medium and electronic equipment | |
CN110427486B (en) | Body condition text classification method, device and equipment | |
CN106776544A (en) | Character relation recognition methods and device and segmenting method | |
CN109739960A (en) | Sentiment analysis method, sentiment analysis device and the terminal of text | |
CN109255106A (en) | A kind of text handling method and terminal | |
CN107357785A (en) | Theme feature word abstracting method and system, feeling polarities determination methods and system | |
CN112417846A (en) | Text automatic generation method and device, electronic equipment and storage medium | |
US10241767B2 (en) | Distributed function generation with shared structures | |
CN109918494A (en) | Context relation based on figure replys generation method, computer and medium | |
CN103678336A (en) | Method and device for identifying entity words | |
CN118170894B (en) | Knowledge graph question-answering method, knowledge graph question-answering device and storage medium | |
CN110019713A (en) | Based on the data retrieval method and device, equipment and storage medium for being intended to understand | |
CN113407675A (en) | Automatic education subject correcting method and device and electronic equipment | |
CN113343696A (en) | Electronic medical record named entity identification method, device, remote terminal and system | |
CN111178537A (en) | Feature extraction model training method and device | |
Sokolovska et al. | Efficient learning of sparse conditional random fields for supervised sequence labeling | |
Zhang et al. | HRCA+: Advanced multiple-choice machine reading comprehension method | |
WO2024146468A1 (en) | Model generation method and apparatus, entity recognition method and apparatus, electronic device, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190122 |