CN107807918A - The method and device of Thai words recognition - Google Patents
The method and device of Thai words recognition Download PDFInfo
- Publication number
- CN107807918A CN107807918A CN201710982841.0A CN201710982841A CN107807918A CN 107807918 A CN107807918 A CN 107807918A CN 201710982841 A CN201710982841 A CN 201710982841A CN 107807918 A CN107807918 A CN 107807918A
- Authority
- CN
- China
- Prior art keywords
- character string
- thai
- slices
- language character
- comentropy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses the method and device of Thai words recognition, belong to technical field of information retrieval.This method includes:According to setting step-length, filter segmentation processing is carried out to Thai document to be identified, obtains the set of slices for including at least one section Thai language character string;According to the comentropy process parameter values of each section Thai language character string, brush choosing is carried out to the set of slices, forms words output set of slices;From the words output set of slices, the section Thai language character string for setting number is defined as the Thai word identified.So, it can be handled by comentropy, Thai word is identified from Thai document, so, improved the efficiency of Thai words recognition, can also increase the brose and reading speed of Thai document.
Description
Technical field
The present invention relates to the method and device of technical field of information retrieval, more particularly to Thai words recognition.
Background technology
ThaiAlso referred to as Dai Nationality's language (Dai language), it is the safe national language of the Dai Nationality, belongs to East Asia language
A kind of language of system/Han-Tibetan family.The whole world has about 68,000,000 populations to use Thai.In the document of Thai, do not have between word and word
Punctuate, space, does not spell continuously in short from the beginning to the end, typically, among empty two alphabetical intervals or sentence
Dwell represent a sentence.So, for Thai learner, translator or other Thais user, it is difficult to by word
Frequently, the method for the existing words recognition such as word length, space or punctuation mark, identifies Thai word from Thai language document.
The content of the invention
The embodiments of the invention provide a kind of method and device of Thai words recognition.For one of the embodiment to disclosure
A little aspects have a basic understanding, shown below is simple summary.The summarized section is not extensive overview, nor true
Determine key/critical component or describe the protection domain of these embodiments.Its sole purpose is to be presented one with simple form
A little concepts, in this, as the preamble of following detailed description.
First aspect according to embodiments of the present invention, there is provided a kind of method of Thai words recognition, including:
According to setting step-length, filter segmentation processing is carried out to Thai document to be identified, it is safe that acquisition includes at least one section
The set of slices of Chinese character string;
According to the comentropy process parameter values of each section Thai language character string, brush choosing is carried out to the set of slices, is formed
Words output set of slices;
From the words output set of slices, the section Thai language character string for setting number is defined as the Thai identified
Word.
In one embodiment of the invention, when described information entropy process parameter values include frequency, coagulation grade value, Yi Jixin occur
When ceasing the free angle value of entropy, the basis is each cut into slices the comentropy process parameter values of Thai language character string, and the words output is cut
Piece set carries out brush choosing, and forming words output set of slices includes:
The section Thai language character string for exceeding setting frequency according to there is frequency, form the first set of slices to be output;
Determine each coagulation grade value of section Thai language character string, and according to solidification in the described first set of slices to be output
Degree value is more than the section Thai language character string of the first setting value, forms the second set of slices to be output;
Determine in the described second set of slices to be output the free angle value of comentropy of each section Thai language character string, and according to
The free angle value of comentropy is more than the section Thai language character string of the second setting value, forms words output set of slices.
It is described to determine each section Thai language character string in first set of slices to be output in one embodiment of the invention
Coagulation grade value includes:
According to formula (2), the information solidification of current slice Thai language character string in the described first set of slices to be output is determined
Degree value;
Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is corresponding in current slice Thai language character string
The appearance frequency of son section Thai language character string, co is coagulation grade value.
It is described to determine each section Thai language character string in second set of slices to be output in one embodiment of the invention
The free angle value of comentropy includes:
According to formula (3), the left adjacent word comentropy of current slice Thai language character string and right adjacent word comentropy are determined;
According to formula (4), the smaller value in the left adjacent word comentropy and right adjacent word comentropy is defined as described current
The free angle value of comentropy for Thai language character string of cutting into slices;
Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
It is described from the words output set of slices in one embodiment of the invention, the section Thai language word of number will be set
The Thai word that symbol string is defined as identifying includes:
Before being carried out according to the height of the frequency of occurrences to each section Thai language character string in the words output set of slices
After sort;
The Thai word identified will be defined as positioned at the section Thai language character string of the setting number of forefront.
Second aspect according to embodiments of the present invention, there is provided a kind of device of Thai words recognition, including:
Filter segmentation unit, for according to setting step-length, carrying out filter segmentation processing to Thai document to be identified, being wrapped
Include the set of slices of at least one section Thai language character string;
Comentropy brush menu member, for the comentropy process parameter values according to each section Thai language character string, cut to described
Piece set carries out brush choosing, forms words output set of slices;
Word determining unit, for from the words output set of slices, the section Thai language character string of number will to be set
It is defined as Thai word.
In one embodiment of the invention, described information entropy brush menu member includes:
Frequency brush modeling block, for the section Thai language character string for exceeding setting frequency according to there is frequency, form first and treat
Export set of slices;
Coagulation grade brush modeling block, for determining Thai language character string of each being cut into slices in the described first set of slices to be output
Coagulation grade value, and according to section Thai language character string of the coagulation grade value more than the first setting value, form the second section to be output
Set;
Free degree brush modeling block, for determining the letter of each section Thai language character string in the described second set of slices to be output
The free angle value of entropy is ceased, and is more than the section Thai language character string of the second setting value according to the free angle value of comentropy, forms words output
Set of slices.
In one embodiment of the invention, the coagulation grade brush modeling block, specifically for according to formula (2), determining described
The coagulation grade value of current slice Thai language character string in one set of slices to be output;
Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is corresponding in current slice Thai language character string
The appearance frequency of son section Thai language character string, co is coagulation grade value.
In one embodiment of the invention, the free degree brush modeling block, specifically for according to formula (3), determining current slice
The left adjacent word comentropy and right adjacent word comentropy of Thai language character string;According to formula (4), by the left adjacent word comentropy and right adjacent word
Smaller value in comentropy, it is defined as the free angle value of comentropy of the current slice Thai language character string;
Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
In one embodiment of the invention, the word determining unit, specifically for the height according to the frequency of occurrences to institute's predicate
The each section Thai language character string exported in set of slices of converging carries out front and rear sort;By positioned at the section of the setting number of forefront
Thai language character string is defined as the Thai word identified.
Technical scheme provided in an embodiment of the present invention can include the following benefits:
It in the embodiment of the present invention, can be handled by comentropy, Thai word is identified from Thai document, so, improved
The efficiency of Thai words recognition, also, can also increase the brose and reading speed of Thai document.
It should be appreciated that the general description and following detailed description of the above are only exemplary and explanatory, not
Can the limitation present invention.
Brief description of the drawings
Accompanying drawing herein is merged in specification and forms the part of this specification, shows the implementation for meeting the present invention
Example, and for explaining principle of the invention together with specification.
Fig. 1 is a kind of flow chart of Thai words recognition method according to an exemplary embodiment;
Fig. 2 is a kind of flow chart of Thai words recognition method according to an exemplary embodiment;
Fig. 3 is a kind of block diagram of Thai words recognition device according to an exemplary embodiment;
Fig. 4 is a kind of block diagram of Thai words recognition device according to an exemplary embodiment.
Embodiment
The following description and drawings fully show specific embodiments of the present invention, to enable those skilled in the art to
Put into practice them.Embodiment only represents possible change.Unless explicitly requested, otherwise single components and functionality is optional, and
And the order of operation can change.The part of some embodiments and feature can be included in or replace other embodiments
Part and feature.The scope of embodiment of the present invention includes the gamut of claims, and the institute of claims
There is obtainable equivalent.Herein, each embodiment can individually or generally be represented that this is only with term " invention "
It is merely for convenience, and if in fact disclosing the invention more than one, it is not meant to automatically limit the scope of the application
For any single invention or inventive concept.Herein, such as first and second or the like relational terms are used only for one
Entity or operation make a distinction with another entity or operation, exist without requiring or implying between these entities or operation
Any actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant be intended to it is non-exclusive
Property includes, so that process, method or equipment including a series of elements not only include those key elements, but also including
The other element being not expressly set out.Each embodiment herein is described by the way of progressive, and each embodiment stresses
Be all difference with other embodiment, between each embodiment identical similar portion mutually referring to.For implementing
For structure, product etc. disclosed in example, due to its with embodiment disclosed in part it is corresponding, so fairly simple, the phase of description
Part is closed referring to method part illustration.
In Thai document, punctuate is not had between word and word, not space, in short continuously spelt from the beginning to the end,
Thai word is identified in the more difficult document from Thai language.It in the embodiment of the present invention, can be handled by comentropy, known from Thai document
Do not go out Thai word, so, improve the efficiency of Thai words recognition, also, can also increase the brose and reading speed of Thai document
Degree.
Fig. 1 is a kind of flow chart of Thai words recognition method according to an exemplary embodiment.As shown in figure 1,
The process of Thai words recognition includes:
Step 101:According to setting step-length, filter segmentation processing is carried out to Thai document to be identified, obtaining includes at least one
The set of slices of individual section Thai language character string.
User need to identify the word in the Thai document, that is, be extracted the Thailand of information when obtaining the information of Thai document
Chinese language shelves are Thai document to be identified.Main character is all Thai characters in Thai document to be identified, it is also possible in the presence of
Some digital informations, website information, mailbox message, English character etc., these information need to filter, therefore, need to be to Thailand to be identified
Chinese language shelves carry out filtration treatment, form the first Thai document for only including Thai language character.
In Thai document, punctuate is not had between word and word, not space, in short continuously spelt from the beginning to the end,
Therefore, Thai document is divided into some short circuits, and further segmentation, then may be partitioned into some short sentences, short sentence is then by continuous Thai
Character forms, and therefore, at least one Thai short sentence in the first Thai document can be split, formed according to setting step-length
Include the set of slices of at least one section Thai language character string.
Such as:After Thai document to be identified carries out filtration treatment, the first Thai document D 1 is formd, and a Thailand in D1
Language short sentence Si, i=1,2 ... n.Thai short sentence D1 can be split according to setting step-length, form one, two or multiple cut
Piece Thai language character string.If Si includesIt is corresponding to cut if carrying out cutting by step-length step=1
Piece setIt is corresponding if carrying out cutting by step-length step=2
Set of slicesIt is corresponding if carrying out cutting by step-length step=3
Set of slicesEach Thai short sentence Si can be cut into slices successively
Dividing processing, set of slices M corresponding to formation, including one, two or more section Thai language character strings.
Step 102:According to the comentropy process parameter values of each section Thai language character string, brush choosing is carried out to set of slices,
Form words output set of slices.
In the embodiment of the present invention, comentropy processing need to be carried out to each section Thai language character string in set of slices, then,
According to corresponding comentropy process parameter values, brush choosing is carried out to set of slices, forms words output set of slices.Wherein, information
Entropy process parameter values include at least one of frequency, coagulation grade value and the free angle value of comentropy occur.There is frequency to use
With the frequency of occurrences of instruction section Thai language character string, occur that frequency is higher, show the frequency of occurrences of the section Thai language character string
Greatly.One section Thai language character string may wrap a word, it is also possible to the phrase that two or more words are formed, therefore, solidify journey
For angle value to indicate that section Thai language character string is the probability of a word, coagulation grade value is more big, represents the section Thai language character string
Be a word probability it is higher.And comentropy is the uncertainty for describing information source.Generally, an information source sends out any symbol
Number it is uncertain, weighing it can measure according to the probability of its appearance.Probability is big, and it is more chance occur, uncertain small;Instead
It is just big.If source symbol has n kind values:U1 ... Ui ... Un, corresponding probability are:P1 ... Pi ... Pn, and the appearance of various symbols
Independently of one another.At this moment, the average uncertainty of information source should be single symbol uncertainty-logPi assembly average (E),
It can be described as comentropy.Here, when section Thai language character string has corresponding left adjacent information and right adjacent information, comentropy can be used certainly
Information source certainty corresponding to Thai language character string of cutting into slices is indicated by angle value.
In the embodiment of the present invention, set of slices can be carried out using one, two or more comentropy process parameter values
Brush choosing, forms words output set of slices.Such as:The section Thai language character string for exceeding setting frequency according to there is frequency, formed
Words output set of slices.Or it is more than the section Thai language character string of the second setting value according to the free angle value of comentropy, form word
Converge and export set of slices etc..To be further, improve the precision of Thai words recognition, can according to occur frequency, coagulation grade value,
And the free angle value of comentropy carries out brush choosing to set of slices, words output set of slices is formed.
Specifically it may include:There are one, two or more section Thai language character strings in set of slices M, each section can be counted
The appearance frequency of Thai language character string, then, according to there is section Thai language character string of the frequency more than setting frequency, form first and treat
Export set of slices.
Wherein, the appearance frequency of each section Thai language character string can be determined according to formula (1).
Pi=Wi/ ∑sMWi------------------------------------------------- formula (1)
Wherein, Wi is the frequency of each section Thai language character string, and Pi is the appearance frequency of each section Thai language character string, M
For set of slices.
Wi is the frequency of each section Thai language character string, represents what section Thai language character string occurred in segmentation process
Number.So, frequency is set as A, by the appearance frequency Pi of each section Thai language character string compared with setting frequency A, if
The appearance frequency Pi of current slice Thai language character string is more than A, then it is to be output current slice Thai language character string Pi can be put into first
In set of slices.So, by occur frequency carried out first brush choosing after, form the first set of slices to be output.
There is the higher section Thai language character string of frequency and be likely to be a word, or the word that two or more words are formed
Group.Therefore, also need to carry out the first set of slices to be output further brush choosing.In the embodiment of the present invention, it may be determined that first treats
The coagulation grade value of each section Thai language character string in set of slices is exported, and the first setting value is more than according to coagulation grade value
Section Thai language character string, forms the second set of slices to be output.
Wherein, the solidification journey of current slice Thai language character string in the second set of slices to be output can according to formula (2), be determined
Angle value;
Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is corresponding in current slice Thai language character string
The appearance frequency of son section Thai language character string, co is coagulation grade value.
In the present embodiment, there is frequency to indicate the frequency of occurrences of section Thai language character string, i.e. Pi can be specifically with current
The probability for the Thai language character string of cutting into slices indicates.Such as:Current slice Thai language character stringCorresponding sub- section Thai language
Character string is respectivelyWith Wherein, current slice Thai language character stringProbability P=0.0005, and son section Thai language character stringProbability P 11=0.0002, sub- section
Thai language character stringProbability P 12=0.0003 etc., according to formula (2), you can determine current slice Thai language character stringCoagulation grade value co.
Then, by the coagulation grade value of each section Thai language character string compared with the first setting value, if current slice
The coagulation grade value of Thai language character string is more than the first setting value, then it is to be output the current slice Thai language character string can be put into second
In set of slices, i.e., it is more than the section Thai language character string of the first setting value according to coagulation grade value, forms the second section to be output
Set.
Also need the comentropy free degree to Thai language character string of each being cut into slices in the second set of slices to be output according to comentropy
Value, further brush choosing is carried out to the second set of slices to be output.In the embodiment of the present invention, the second set of slices to be output is determined
In each section Thai language character string the free angle value of comentropy, and be more than according to the free angle value of comentropy the section of the second setting value
Thai language character string, form words output set of slices.
Wherein, the left adjacent word comentropy of current slice Thai language character string and right adjacent word information can according to formula (3), be determined
Entropy;Then, according to formula (4), by the smaller value in left adjacent word comentropy and right adjacent word comentropy, it is defined as current slice Thai language
The free angle value of comentropy of character string.
Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
Multiple left adjacent word comentropy H (U) and right adjacent word comentropy H (U), i.e. H (U) 1, H (U) can obtain by formula (3)
2, H (U) 3 ... H (U) n, the minimum value in multiple comentropy H (U) then can be obtained by formula (4), so as to obtain comentropy
Free angle value., can will be every after determining in the second set of slices to be output each free angle value of comentropy of section Thai language character string
The free angle value of comentropy of individual section Thai language character string is compared with the second setting value, if the letter of current slice Thai language character string
When the breath free angle value of entropy is more than the second setting value, you can current slice Thai language character string is added in words output set of slices,
It is more than the section Thai language character string of the second setting value according to the free angle value of comentropy, forms words output set of slices.
Above-mentioned basis is each cut into slices the appearance frequency of Thai language character string, coagulation grade value and the free angle value of comentropy
Sequentially, brush choosing is carried out to words output set of slices, forms words output set of slices.Certainly, the embodiment of the present invention is not limited to
This, can also according to coagulation grade value, there is the order of frequency and the free angle value of comentropy, words output set of slices is entered
Row brush choosing, forms words output set of slices.Or can be according to there is the free angle value of frequency, comentropy and coagulation grade value
Deng order, brush choosing is carried out to words output set of slices, forms words output set of slices.Stated particularly as tiring out one by one.
Step 103:From words output set of slices, the section Thai language character string for setting number is defined as what is identified
Thai word.
Here, the section Thai language character string of setting number can be selected at random from words output set of slices, and be defined as
The Thai word identified.Or include occurring frequency, coagulation grade value or comentropy certainly according to comentropy process parameter values
By angle value, from words output set of slices, selection sets the section Thai language character string of number, and is defined as the Thai identified
Word.
Wherein, before being carried out according to the height of the frequency of occurrences to each section Thai language character string in words output set of slices
After sort;The Thai word identified will be defined as positioned at the section Thai language character string of the setting number of forefront.
It can be seen that brush choosing can be carried out to the section Thai language character string in Thai document, finally by comentropy process parameter values
Thai word is identified from Thai document, so, improves the efficiency of Thai words recognition, also, can also increase Thai text
The brose and reading speed of shelves.
Below by operating process set into specific embodiment, the method that the embodiment of the present disclosure provides is illustrated.
In the present embodiment, comentropy process parameter values include:There is frequency, coagulation grade value and the comentropy free degree
Value.Therefore, setting frequency, the first setting value and the second setting value can be configured in advance.
Fig. 2 is a kind of flow chart of Thai words recognition method according to an exemplary embodiment.Such as Fig. 2, Thai
Words recognition process includes:
Step 201:Filtration treatment is carried out to Thai document to be identified, forms the first Thai text for only including Thai language character
Shelves.
Full half-angle character in Thai language is included into other non-Thai language characters such as English, mathematics and carries out filtration treatment, is only retained safe
The character of the scope [0x0E00,0x0E7F] of text, the pure Thai document of a completion is thus formed, that is, formed and only include Thailand
First Thai document of Chinese character.
Step 202:According to setting step-length, at least one Thai short sentence in the first Thai document is split, formed
Include the set of slices of at least one section character string.
Such as:A Thai short sentence in first Thai documentLength N=8 is carried out by step=2
Segmentation, the set of slices of formation
Step 203:According to formula (1), each appearance frequency of section Thai language character string is determined in set of slices.
Step 204:The section Thai language character string for exceeding setting frequency according to there is frequency, form the first section collection to be output
Close.
Step 205:According to formula (2), each solidification of section Thai language character string in the first set of slices to be output is determined
Degree value.
Step 206:It is more than the section Thai language character string of the first setting value according to coagulation grade value, formation second is to be output to cut
Piece set.
Step 207:According to formula (3) and formula (4), Thai language word of each being cut into slices in the second set of slices to be output is determined
Accord with the free angle value of comentropy of string.
Such as:Thai short sentenceThis section Thai language character string occurs
Four times, wherein left adjacent word is respectivelyRight adjacent word is respectively
According to formula (3),The comentropy of the left adjacent word of this section Thai language character string is-(1/2) log (1/2)-(1/2)
Log (1/2) ≈ 0.51, the comentropy of its right adjacent word is then-(1/2) log (1/2)-(1/4) log (1/4)-(1/4)
log(1/4)≈1.73.So as to which the corresponding free angle value of comentropy is 0.51.
Step 208:It is more than the section Thai language character string of the second setting value according to the free angle value of comentropy, forms words output
Set of slices.
Step 209:Each section Thai language character string in words output set of slices is entered according to the height of the frequency of occurrences
Row is front and rear to sort.
Such as:Words output set of slices includes 50 section Thai language character strings, and the height of the corresponding frequency of occurrences is suitable
Sequence is 25,23,19,15,10,7,5,4,4,4,3,3,2,2 ....Then corresponding section Thai language character string can be subjected to front and rear row
Sequence.
Step 210:The Thai word identified will be defined as positioned at the section Thai language character string of the setting number of forefront.
If setting number as 5, frequency section Thai language character string corresponding to 25,23,19,15,10 can be will appear from respectively
It is defined as the Thai word identified.
It can be seen that in the present embodiment, can be by there is frequency, coagulation grade value and the free angle value of comentropy, to Thai text
Section Thai language character string in shelves carries out brush choosing, finally identifies Thai word from Thai document more accurately, so,
The efficiency and accuracy rate of Thai words recognition are improved, also, can also increase the brose and reading speed of Thai document.
Following is embodiment of the present disclosure, can be used for performing embodiments of the present disclosure.
According to the process of above-mentioned Thai words recognition, a kind of device of Thai words recognition can be built.
Fig. 3 is a kind of block diagram of Thai words recognition device according to an exemplary embodiment.As shown in figure 3, should
Device includes:Filter segmentation unit 310, comentropy brush menu member 320 and word determining unit 330, wherein,
Filter segmentation unit 310, for according to setting step-length, carrying out filter segmentation processing to Thai document to be identified, obtaining
The set of slices of at least one section Thai language character string must be included.
Comentropy brush menu member 320, for the comentropy process parameter values according to each section Thai language character string, to section
Set carries out brush choosing, forms words output set of slices.
Word determining unit 330, it is for from words output set of slices, the section Thai language character string for setting number is true
It is set to the Thai word identified.
In one embodiment of the invention, comentropy brush menu member 320 includes:
Frequency brush modeling block, for the section Thai language character string for exceeding setting frequency according to there is frequency, form first and treat
Export set of slices.
Coagulation grade brush modeling block, for determining the solidification of each section Thai language character string in the first set of slices to be output
Degree value, and according to section Thai language character string of the coagulation grade value more than the first setting value, form the second set of slices to be output.
Free degree brush modeling block, for determining the comentropy of each section Thai language character string in the second set of slices to be output
Free angle value, and according to section Thai language character string of the free angle value of comentropy more than the second setting value, form words output section
Set.
In one embodiment of the invention, coagulation grade brush modeling block, specifically for according to formula (2), determining that first is to be output
The coagulation grade value of current slice Thai language character string in set of slices.
Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is corresponding in current slice Thai language character string
The appearance frequency of son section Thai language character string, co is coagulation grade value.
In one embodiment of the invention, free degree brush modeling block, specifically for according to formula (3), determining current slice Thai language
The left adjacent word comentropy and right adjacent word comentropy of character string;According to formula (4), by left adjacent word comentropy and right adjacent word comentropy
Smaller value, be defined as the free angle value of comentropy of current slice Thai language character string.
Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
In one embodiment of the invention, word determining unit 330, specifically for according to the height of the frequency of occurrences to words output
Each section Thai language character string in set of slices carries out front and rear sort;By positioned at the section Thai language word of the setting number of forefront
Symbol string is defined as the Thai word identified.
The device of embodiment of the present disclosure offer is provided.
Fig. 4 is a kind of block diagram of Thai words recognition device according to an exemplary embodiment.As shown in figure 4, should
Device includes:Filter segmentation unit 310, comentropy brush menu member 320 and word determining unit 330, wherein, comentropy brush
Menu member 320 includes:Frequency brush modeling block 321, coagulation grade brush modeling block 322 and free degree brush modeling block 323.
Wherein, filter segmentation unit 310 can carry out filtration treatment to Thai document to be identified, and formation only includes Thai language character
The first Thai document, then according to setting step-length, at least one Thai short sentence in the first Thai document is split, shape
Into the set of slices for including at least one section character string.
So, the frequency brush modeling block 321 in comentropy brush menu member 320 can be determined in set of slices according to formula (1)
The appearance frequency of each section Thai language character string, and the section Thai language character string for exceeding setting frequency according to there is frequency, are formed
First set of slices to be output.And the coagulation grade brush modeling block 322 in comentropy brush menu member 320 can be according to formula (2), really
The coagulation grade value of each section Thai language character string in fixed first set of slices to be output, and it is more than first according to coagulation grade value
The section Thai language character string of setting value, form the second set of slices to be output.Free degree brush choosing in comentropy brush menu member 320
Module 323 can determine Thai language character string of each being cut into slices in the second set of slices to be output according to formula (3) and formula (4)
The free angle value of comentropy, and according to section Thai language character string of the free angle value of comentropy more than the second setting value, it is defeated to form vocabulary
Go out set of slices.
So as to which word determining unit 330 can be according to the height of the frequency of occurrences to each cutting in words output set of slices
Piece Thai language character string carries out front and rear sequence, and will be defined as identifying positioned at the section Thai language character string of the setting number of forefront
Thai word.
It can be seen that in the present embodiment, can be by there is frequency, coagulation grade value and the free angle value of comentropy, to Thai text
Section Thai language character string in shelves carries out brush choosing, finally identifies Thai word from Thai document more accurately, so,
The efficiency and accuracy rate of Thai words recognition are improved, also, can also increase the brose and reading speed of Thai document.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program
Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more
The shape for the computer program product that usable storage medium is implemented on (including but is not limited to magnetic disk storage and optical memory etc.)
Formula.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram
Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided
The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real
The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to
Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or
The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in individual square frame or multiple square frames.
It should be appreciated that the invention is not limited in the flow and structure for being described above and being shown in the drawings,
And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claim
System.
Claims (10)
- A kind of 1. method of Thai words recognition, it is characterised in that including:According to setting step-length, filter segmentation processing is carried out to Thai document to be identified, acquisition includes at least one section Thai language word Accord with the set of slices of string;According to the comentropy process parameter values of each section Thai language character string, brush choosing is carried out to the set of slices, forms vocabulary Export set of slices;From the words output set of slices, the section Thai language character string for setting number is defined as the Thai word identified Language.
- 2. the method as described in claim 1, it is characterised in that when described information entropy process parameter values are including frequency occur, coagulating Gu when degree value and the free angle value of comentropy, the basis is each cut into slices the comentropy process parameter values of Thai language character string, right The set of slices carries out brush choosing, and forming words output set of slices includes:The section Thai language character string for exceeding setting frequency according to there is frequency, form the first set of slices to be output;Determine each coagulation grade value of section Thai language character string, and according to coagulation grade in the described first set of slices to be output Value forms the second set of slices to be output more than the section Thai language character string of the first setting value;Determine each free angle value of comentropy of section Thai language character string, and according to information in the described second set of slices to be output The free angle value of entropy is more than the section Thai language character string of the second setting value, forms words output set of slices.
- 3. method as claimed in claim 2, it is characterised in that described to determine each to cut in first set of slices to be output The coagulation grade value of piece Thai language character string includes:According to formula (2), the coagulation grade value of current slice Thai language character string in the described first set of slices to be output is determined;Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is that corresponding son is cut in current slice Thai language character string The appearance frequency of piece Thai language character string, co are coagulation grade value.
- 4. method as claimed in claim 2, it is characterised in that described to determine each to cut in second set of slices to be output The free angle value of comentropy of piece Thai language character string includes:According to formula (3), the left adjacent word comentropy of current slice Thai language character string and right adjacent word comentropy are determined;According to formula (4), by the smaller value in the left adjacent word comentropy and right adjacent word comentropy, it is defined as the current slice The free angle value of comentropy of Thai language character string;Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
- 5. the method as described in claim 1, it is characterised in that it is described from the words output set of slices, by setting The Thai word that several section Thai language character strings is defined as identifying includes:Front and rear row is carried out to each section Thai language character string in the words output set of slices according to the height of the frequency of occurrences Sequence;The Thai word identified will be defined as positioned at the section Thai language character string of the setting number of forefront.
- A kind of 6. device of Thai words recognition, it is characterised in that including:Filter segmentation unit, for according to setting step-length, carrying out filter segmentation processing to Thai document to be identified, obtaining is included extremely The set of slices of few section Thai language character string;Comentropy brush menu member, for the comentropy process parameter values according to each section Thai language character string, the section is collected Conjunction carries out brush choosing, forms words output set of slices;Word determining unit, for from the words output set of slices, the section Thai language character string for setting number to be determined For the Thai word identified.
- 7. device as claimed in claim 6, it is characterised in that described information entropy brush menu member includes:Frequency brush modeling block, for the section Thai language character string for exceeding setting frequency according to there is frequency, it is to be output to form first Set of slices;Coagulation grade brush modeling block, for determining each solidification of section Thai language character string in the described first set of slices to be output Degree value, and according to section Thai language character string of the coagulation grade value more than the first setting value, form the second set of slices to be output;Free degree brush modeling block, for determining the comentropy of each section Thai language character string in the described second set of slices to be output Free angle value, and according to section Thai language character string of the free angle value of comentropy more than the second setting value, form words output section Set.
- 8. device as claimed in claim 7, it is characterised in thatThe coagulation grade brush modeling block, specifically for according to formula (2), determining current in the described first set of slices to be output The coagulation grade value for Thai language character string of cutting into slices;Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is that corresponding son is cut in current slice Thai language character string The appearance frequency of piece Thai language character string, co are coagulation grade value.
- 9. device as claimed in claim 7, it is characterised in thatThe free degree brush modeling block, the left adjacent word information specifically for according to formula (3), determining current slice Thai language character string Entropy and right adjacent word comentropy;According to formula (4), by the smaller value in the left adjacent word comentropy and right adjacent word comentropy, it is determined that For the free angle value of comentropy of the current slice Thai language character string;Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
- 10. device as claimed in claim 6, it is characterised in thatThe word determining unit, specifically for the height according to the frequency of occurrences to each in the words output set of slices Thai language of cutting into slices character string carries out front and rear sort;It will be defined as identifying positioned at the section Thai language character string of the setting number of forefront Thai word.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710982841.0A CN107807918A (en) | 2017-10-20 | 2017-10-20 | The method and device of Thai words recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710982841.0A CN107807918A (en) | 2017-10-20 | 2017-10-20 | The method and device of Thai words recognition |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107807918A true CN107807918A (en) | 2018-03-16 |
Family
ID=61592904
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710982841.0A Pending CN107807918A (en) | 2017-10-20 | 2017-10-20 | The method and device of Thai words recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107807918A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111209946A (en) * | 2019-12-31 | 2020-05-29 | 上海联影智能医疗科技有限公司 | Three-dimensional image processing method, image processing model training method, and medium |
WO2021051600A1 (en) * | 2019-09-19 | 2021-03-25 | 平安科技(深圳)有限公司 | Method, apparatus and device for identifying new word based on information entropy, and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110137642A1 (en) * | 2007-08-23 | 2011-06-09 | Google Inc. | Word Detection |
CN105320960A (en) * | 2015-10-14 | 2016-02-10 | 北京航空航天大学 | Voting based classification method for cross-language subjective and objective sentiments |
CN106815190A (en) * | 2015-11-27 | 2017-06-09 | 阿里巴巴集团控股有限公司 | A kind of words recognition method, device and server |
CN107180025A (en) * | 2017-03-31 | 2017-09-19 | 北京奇艺世纪科技有限公司 | A kind of recognition methods of neologisms and device |
-
2017
- 2017-10-20 CN CN201710982841.0A patent/CN107807918A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110137642A1 (en) * | 2007-08-23 | 2011-06-09 | Google Inc. | Word Detection |
CN105320960A (en) * | 2015-10-14 | 2016-02-10 | 北京航空航天大学 | Voting based classification method for cross-language subjective and objective sentiments |
CN106815190A (en) * | 2015-11-27 | 2017-06-09 | 阿里巴巴集团控股有限公司 | A kind of words recognition method, device and server |
CN107180025A (en) * | 2017-03-31 | 2017-09-19 | 北京奇艺世纪科技有限公司 | A kind of recognition methods of neologisms and device |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021051600A1 (en) * | 2019-09-19 | 2021-03-25 | 平安科技(深圳)有限公司 | Method, apparatus and device for identifying new word based on information entropy, and storage medium |
CN111209946A (en) * | 2019-12-31 | 2020-05-29 | 上海联影智能医疗科技有限公司 | Three-dimensional image processing method, image processing model training method, and medium |
CN111209946B (en) * | 2019-12-31 | 2024-04-30 | 上海联影智能医疗科技有限公司 | Three-dimensional image processing method, image processing model training method and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109710947B (en) | Electric power professional word bank generation method and device | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN108845982B (en) | Chinese word segmentation method based on word association characteristics | |
CN105893478A (en) | Tag extraction method and equipment | |
CN105787121B (en) | A kind of microblogging event summary extracting method based on more story lines | |
CN106708798B (en) | Character string segmentation method and device | |
CN109829151B (en) | Text segmentation method based on hierarchical dirichlet model | |
CN106445915B (en) | New word discovery method and device | |
CN105593845B (en) | Generating means and its method based on the arrangement corpus for learning by oneself arrangement, destructive expression morpheme analysis device and its morpheme analysis method using arrangement corpus | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN110941959A (en) | Text violation detection method, text restoration method, data processing method and data processing equipment | |
CN110413998B (en) | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof | |
WO2019100458A1 (en) | Method and device for segmenting thai syllables | |
CN103955450A (en) | Automatic extraction method of new words | |
CN104978354A (en) | Text classification method and text classification device | |
CN112445912A (en) | Fault log classification method, system, device and medium | |
CN107665188A (en) | A kind of semantic understanding method and device | |
CN108153851B (en) | General forum subject post page information extraction method based on rules and semantics | |
CN106446051A (en) | Deep search method of Eagle media assets | |
CN107807918A (en) | The method and device of Thai words recognition | |
CN106126495B (en) | One kind being based on large-scale corpus prompter method and apparatus | |
CN109213974B (en) | Electronic document conversion method and device | |
CN114912425A (en) | Presentation generation method and device | |
CN113934910A (en) | Automatic optimization and updating theme library construction method and hot event real-time updating method | |
CN106933818A (en) | A kind of quick multiple key text matching technique and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180316 |