CN107807918A - The method and device of Thai words recognition - Google Patents

The method and device of Thai words recognition Download PDF

Info

Publication number
CN107807918A
CN107807918A CN201710982841.0A CN201710982841A CN107807918A CN 107807918 A CN107807918 A CN 107807918A CN 201710982841 A CN201710982841 A CN 201710982841A CN 107807918 A CN107807918 A CN 107807918A
Authority
CN
China
Prior art keywords
character string
thai
slices
language character
comentropy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710982841.0A
Other languages
Chinese (zh)
Inventor
张凯
闫昊
车双武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TRANSN (BEIJING) INFORMATION TECHNOLOGY Co Ltd
Original Assignee
TRANSN (BEIJING) INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TRANSN (BEIJING) INFORMATION TECHNOLOGY Co Ltd filed Critical TRANSN (BEIJING) INFORMATION TECHNOLOGY Co Ltd
Priority to CN201710982841.0A priority Critical patent/CN107807918A/en
Publication of CN107807918A publication Critical patent/CN107807918A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses the method and device of Thai words recognition, belong to technical field of information retrieval.This method includes:According to setting step-length, filter segmentation processing is carried out to Thai document to be identified, obtains the set of slices for including at least one section Thai language character string;According to the comentropy process parameter values of each section Thai language character string, brush choosing is carried out to the set of slices, forms words output set of slices;From the words output set of slices, the section Thai language character string for setting number is defined as the Thai word identified.So, it can be handled by comentropy, Thai word is identified from Thai document, so, improved the efficiency of Thai words recognition, can also increase the brose and reading speed of Thai document.

Description

The method and device of Thai words recognition
Technical field
The present invention relates to the method and device of technical field of information retrieval, more particularly to Thai words recognition.
Background technology
ThaiAlso referred to as Dai Nationality's language (Dai language), it is the safe national language of the Dai Nationality, belongs to East Asia language A kind of language of system/Han-Tibetan family.The whole world has about 68,000,000 populations to use Thai.In the document of Thai, do not have between word and word Punctuate, space, does not spell continuously in short from the beginning to the end, typically, among empty two alphabetical intervals or sentence Dwell represent a sentence.So, for Thai learner, translator or other Thais user, it is difficult to by word Frequently, the method for the existing words recognition such as word length, space or punctuation mark, identifies Thai word from Thai language document.
The content of the invention
The embodiments of the invention provide a kind of method and device of Thai words recognition.For one of the embodiment to disclosure A little aspects have a basic understanding, shown below is simple summary.The summarized section is not extensive overview, nor true Determine key/critical component or describe the protection domain of these embodiments.Its sole purpose is to be presented one with simple form A little concepts, in this, as the preamble of following detailed description.
First aspect according to embodiments of the present invention, there is provided a kind of method of Thai words recognition, including:
According to setting step-length, filter segmentation processing is carried out to Thai document to be identified, it is safe that acquisition includes at least one section The set of slices of Chinese character string;
According to the comentropy process parameter values of each section Thai language character string, brush choosing is carried out to the set of slices, is formed Words output set of slices;
From the words output set of slices, the section Thai language character string for setting number is defined as the Thai identified Word.
In one embodiment of the invention, when described information entropy process parameter values include frequency, coagulation grade value, Yi Jixin occur When ceasing the free angle value of entropy, the basis is each cut into slices the comentropy process parameter values of Thai language character string, and the words output is cut Piece set carries out brush choosing, and forming words output set of slices includes:
The section Thai language character string for exceeding setting frequency according to there is frequency, form the first set of slices to be output;
Determine each coagulation grade value of section Thai language character string, and according to solidification in the described first set of slices to be output Degree value is more than the section Thai language character string of the first setting value, forms the second set of slices to be output;
Determine in the described second set of slices to be output the free angle value of comentropy of each section Thai language character string, and according to The free angle value of comentropy is more than the section Thai language character string of the second setting value, forms words output set of slices.
It is described to determine each section Thai language character string in first set of slices to be output in one embodiment of the invention Coagulation grade value includes:
According to formula (2), the information solidification of current slice Thai language character string in the described first set of slices to be output is determined Degree value;
Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is corresponding in current slice Thai language character string The appearance frequency of son section Thai language character string, co is coagulation grade value.
It is described to determine each section Thai language character string in second set of slices to be output in one embodiment of the invention The free angle value of comentropy includes:
According to formula (3), the left adjacent word comentropy of current slice Thai language character string and right adjacent word comentropy are determined;
According to formula (4), the smaller value in the left adjacent word comentropy and right adjacent word comentropy is defined as described current The free angle value of comentropy for Thai language character string of cutting into slices;
Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
It is described from the words output set of slices in one embodiment of the invention, the section Thai language word of number will be set The Thai word that symbol string is defined as identifying includes:
Before being carried out according to the height of the frequency of occurrences to each section Thai language character string in the words output set of slices After sort;
The Thai word identified will be defined as positioned at the section Thai language character string of the setting number of forefront.
Second aspect according to embodiments of the present invention, there is provided a kind of device of Thai words recognition, including:
Filter segmentation unit, for according to setting step-length, carrying out filter segmentation processing to Thai document to be identified, being wrapped Include the set of slices of at least one section Thai language character string;
Comentropy brush menu member, for the comentropy process parameter values according to each section Thai language character string, cut to described Piece set carries out brush choosing, forms words output set of slices;
Word determining unit, for from the words output set of slices, the section Thai language character string of number will to be set It is defined as Thai word.
In one embodiment of the invention, described information entropy brush menu member includes:
Frequency brush modeling block, for the section Thai language character string for exceeding setting frequency according to there is frequency, form first and treat Export set of slices;
Coagulation grade brush modeling block, for determining Thai language character string of each being cut into slices in the described first set of slices to be output Coagulation grade value, and according to section Thai language character string of the coagulation grade value more than the first setting value, form the second section to be output Set;
Free degree brush modeling block, for determining the letter of each section Thai language character string in the described second set of slices to be output The free angle value of entropy is ceased, and is more than the section Thai language character string of the second setting value according to the free angle value of comentropy, forms words output Set of slices.
In one embodiment of the invention, the coagulation grade brush modeling block, specifically for according to formula (2), determining described The coagulation grade value of current slice Thai language character string in one set of slices to be output;
Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is corresponding in current slice Thai language character string The appearance frequency of son section Thai language character string, co is coagulation grade value.
In one embodiment of the invention, the free degree brush modeling block, specifically for according to formula (3), determining current slice The left adjacent word comentropy and right adjacent word comentropy of Thai language character string;According to formula (4), by the left adjacent word comentropy and right adjacent word Smaller value in comentropy, it is defined as the free angle value of comentropy of the current slice Thai language character string;
Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
In one embodiment of the invention, the word determining unit, specifically for the height according to the frequency of occurrences to institute's predicate The each section Thai language character string exported in set of slices of converging carries out front and rear sort;By positioned at the section of the setting number of forefront Thai language character string is defined as the Thai word identified.
Technical scheme provided in an embodiment of the present invention can include the following benefits:
It in the embodiment of the present invention, can be handled by comentropy, Thai word is identified from Thai document, so, improved The efficiency of Thai words recognition, also, can also increase the brose and reading speed of Thai document.
It should be appreciated that the general description and following detailed description of the above are only exemplary and explanatory, not Can the limitation present invention.
Brief description of the drawings
Accompanying drawing herein is merged in specification and forms the part of this specification, shows the implementation for meeting the present invention Example, and for explaining principle of the invention together with specification.
Fig. 1 is a kind of flow chart of Thai words recognition method according to an exemplary embodiment;
Fig. 2 is a kind of flow chart of Thai words recognition method according to an exemplary embodiment;
Fig. 3 is a kind of block diagram of Thai words recognition device according to an exemplary embodiment;
Fig. 4 is a kind of block diagram of Thai words recognition device according to an exemplary embodiment.
Embodiment
The following description and drawings fully show specific embodiments of the present invention, to enable those skilled in the art to Put into practice them.Embodiment only represents possible change.Unless explicitly requested, otherwise single components and functionality is optional, and And the order of operation can change.The part of some embodiments and feature can be included in or replace other embodiments Part and feature.The scope of embodiment of the present invention includes the gamut of claims, and the institute of claims There is obtainable equivalent.Herein, each embodiment can individually or generally be represented that this is only with term " invention " It is merely for convenience, and if in fact disclosing the invention more than one, it is not meant to automatically limit the scope of the application For any single invention or inventive concept.Herein, such as first and second or the like relational terms are used only for one Entity or operation make a distinction with another entity or operation, exist without requiring or implying between these entities or operation Any actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant be intended to it is non-exclusive Property includes, so that process, method or equipment including a series of elements not only include those key elements, but also including The other element being not expressly set out.Each embodiment herein is described by the way of progressive, and each embodiment stresses Be all difference with other embodiment, between each embodiment identical similar portion mutually referring to.For implementing For structure, product etc. disclosed in example, due to its with embodiment disclosed in part it is corresponding, so fairly simple, the phase of description Part is closed referring to method part illustration.
In Thai document, punctuate is not had between word and word, not space, in short continuously spelt from the beginning to the end, Thai word is identified in the more difficult document from Thai language.It in the embodiment of the present invention, can be handled by comentropy, known from Thai document Do not go out Thai word, so, improve the efficiency of Thai words recognition, also, can also increase the brose and reading speed of Thai document Degree.
Fig. 1 is a kind of flow chart of Thai words recognition method according to an exemplary embodiment.As shown in figure 1, The process of Thai words recognition includes:
Step 101:According to setting step-length, filter segmentation processing is carried out to Thai document to be identified, obtaining includes at least one The set of slices of individual section Thai language character string.
User need to identify the word in the Thai document, that is, be extracted the Thailand of information when obtaining the information of Thai document Chinese language shelves are Thai document to be identified.Main character is all Thai characters in Thai document to be identified, it is also possible in the presence of Some digital informations, website information, mailbox message, English character etc., these information need to filter, therefore, need to be to Thailand to be identified Chinese language shelves carry out filtration treatment, form the first Thai document for only including Thai language character.
In Thai document, punctuate is not had between word and word, not space, in short continuously spelt from the beginning to the end, Therefore, Thai document is divided into some short circuits, and further segmentation, then may be partitioned into some short sentences, short sentence is then by continuous Thai Character forms, and therefore, at least one Thai short sentence in the first Thai document can be split, formed according to setting step-length Include the set of slices of at least one section Thai language character string.
Such as:After Thai document to be identified carries out filtration treatment, the first Thai document D 1 is formd, and a Thailand in D1 Language short sentence Si, i=1,2 ... n.Thai short sentence D1 can be split according to setting step-length, form one, two or multiple cut Piece Thai language character string.If Si includesIt is corresponding to cut if carrying out cutting by step-length step=1 Piece setIt is corresponding if carrying out cutting by step-length step=2 Set of slicesIt is corresponding if carrying out cutting by step-length step=3 Set of slicesEach Thai short sentence Si can be cut into slices successively Dividing processing, set of slices M corresponding to formation, including one, two or more section Thai language character strings.
Step 102:According to the comentropy process parameter values of each section Thai language character string, brush choosing is carried out to set of slices, Form words output set of slices.
In the embodiment of the present invention, comentropy processing need to be carried out to each section Thai language character string in set of slices, then, According to corresponding comentropy process parameter values, brush choosing is carried out to set of slices, forms words output set of slices.Wherein, information Entropy process parameter values include at least one of frequency, coagulation grade value and the free angle value of comentropy occur.There is frequency to use With the frequency of occurrences of instruction section Thai language character string, occur that frequency is higher, show the frequency of occurrences of the section Thai language character string Greatly.One section Thai language character string may wrap a word, it is also possible to the phrase that two or more words are formed, therefore, solidify journey For angle value to indicate that section Thai language character string is the probability of a word, coagulation grade value is more big, represents the section Thai language character string Be a word probability it is higher.And comentropy is the uncertainty for describing information source.Generally, an information source sends out any symbol Number it is uncertain, weighing it can measure according to the probability of its appearance.Probability is big, and it is more chance occur, uncertain small;Instead It is just big.If source symbol has n kind values:U1 ... Ui ... Un, corresponding probability are:P1 ... Pi ... Pn, and the appearance of various symbols Independently of one another.At this moment, the average uncertainty of information source should be single symbol uncertainty-logPi assembly average (E), It can be described as comentropy.Here, when section Thai language character string has corresponding left adjacent information and right adjacent information, comentropy can be used certainly Information source certainty corresponding to Thai language character string of cutting into slices is indicated by angle value.
In the embodiment of the present invention, set of slices can be carried out using one, two or more comentropy process parameter values Brush choosing, forms words output set of slices.Such as:The section Thai language character string for exceeding setting frequency according to there is frequency, formed Words output set of slices.Or it is more than the section Thai language character string of the second setting value according to the free angle value of comentropy, form word Converge and export set of slices etc..To be further, improve the precision of Thai words recognition, can according to occur frequency, coagulation grade value, And the free angle value of comentropy carries out brush choosing to set of slices, words output set of slices is formed.
Specifically it may include:There are one, two or more section Thai language character strings in set of slices M, each section can be counted The appearance frequency of Thai language character string, then, according to there is section Thai language character string of the frequency more than setting frequency, form first and treat Export set of slices.
Wherein, the appearance frequency of each section Thai language character string can be determined according to formula (1).
Pi=Wi/ ∑sMWi------------------------------------------------- formula (1)
Wherein, Wi is the frequency of each section Thai language character string, and Pi is the appearance frequency of each section Thai language character string, M For set of slices.
Wi is the frequency of each section Thai language character string, represents what section Thai language character string occurred in segmentation process Number.So, frequency is set as A, by the appearance frequency Pi of each section Thai language character string compared with setting frequency A, if The appearance frequency Pi of current slice Thai language character string is more than A, then it is to be output current slice Thai language character string Pi can be put into first In set of slices.So, by occur frequency carried out first brush choosing after, form the first set of slices to be output.
There is the higher section Thai language character string of frequency and be likely to be a word, or the word that two or more words are formed Group.Therefore, also need to carry out the first set of slices to be output further brush choosing.In the embodiment of the present invention, it may be determined that first treats The coagulation grade value of each section Thai language character string in set of slices is exported, and the first setting value is more than according to coagulation grade value Section Thai language character string, forms the second set of slices to be output.
Wherein, the solidification journey of current slice Thai language character string in the second set of slices to be output can according to formula (2), be determined Angle value;
Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is corresponding in current slice Thai language character string The appearance frequency of son section Thai language character string, co is coagulation grade value.
In the present embodiment, there is frequency to indicate the frequency of occurrences of section Thai language character string, i.e. Pi can be specifically with current The probability for the Thai language character string of cutting into slices indicates.Such as:Current slice Thai language character stringCorresponding sub- section Thai language Character string is respectivelyWith Wherein, current slice Thai language character stringProbability P=0.0005, and son section Thai language character stringProbability P 11=0.0002, sub- section Thai language character stringProbability P 12=0.0003 etc., according to formula (2), you can determine current slice Thai language character stringCoagulation grade value co.
Then, by the coagulation grade value of each section Thai language character string compared with the first setting value, if current slice The coagulation grade value of Thai language character string is more than the first setting value, then it is to be output the current slice Thai language character string can be put into second In set of slices, i.e., it is more than the section Thai language character string of the first setting value according to coagulation grade value, forms the second section to be output Set.
Also need the comentropy free degree to Thai language character string of each being cut into slices in the second set of slices to be output according to comentropy Value, further brush choosing is carried out to the second set of slices to be output.In the embodiment of the present invention, the second set of slices to be output is determined In each section Thai language character string the free angle value of comentropy, and be more than according to the free angle value of comentropy the section of the second setting value Thai language character string, form words output set of slices.
Wherein, the left adjacent word comentropy of current slice Thai language character string and right adjacent word information can according to formula (3), be determined Entropy;Then, according to formula (4), by the smaller value in left adjacent word comentropy and right adjacent word comentropy, it is defined as current slice Thai language The free angle value of comentropy of character string.
Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
Multiple left adjacent word comentropy H (U) and right adjacent word comentropy H (U), i.e. H (U) 1, H (U) can obtain by formula (3) 2, H (U) 3 ... H (U) n, the minimum value in multiple comentropy H (U) then can be obtained by formula (4), so as to obtain comentropy Free angle value., can will be every after determining in the second set of slices to be output each free angle value of comentropy of section Thai language character string The free angle value of comentropy of individual section Thai language character string is compared with the second setting value, if the letter of current slice Thai language character string When the breath free angle value of entropy is more than the second setting value, you can current slice Thai language character string is added in words output set of slices, It is more than the section Thai language character string of the second setting value according to the free angle value of comentropy, forms words output set of slices.
Above-mentioned basis is each cut into slices the appearance frequency of Thai language character string, coagulation grade value and the free angle value of comentropy Sequentially, brush choosing is carried out to words output set of slices, forms words output set of slices.Certainly, the embodiment of the present invention is not limited to This, can also according to coagulation grade value, there is the order of frequency and the free angle value of comentropy, words output set of slices is entered Row brush choosing, forms words output set of slices.Or can be according to there is the free angle value of frequency, comentropy and coagulation grade value Deng order, brush choosing is carried out to words output set of slices, forms words output set of slices.Stated particularly as tiring out one by one.
Step 103:From words output set of slices, the section Thai language character string for setting number is defined as what is identified Thai word.
Here, the section Thai language character string of setting number can be selected at random from words output set of slices, and be defined as The Thai word identified.Or include occurring frequency, coagulation grade value or comentropy certainly according to comentropy process parameter values By angle value, from words output set of slices, selection sets the section Thai language character string of number, and is defined as the Thai identified Word.
Wherein, before being carried out according to the height of the frequency of occurrences to each section Thai language character string in words output set of slices After sort;The Thai word identified will be defined as positioned at the section Thai language character string of the setting number of forefront.
It can be seen that brush choosing can be carried out to the section Thai language character string in Thai document, finally by comentropy process parameter values Thai word is identified from Thai document, so, improves the efficiency of Thai words recognition, also, can also increase Thai text The brose and reading speed of shelves.
Below by operating process set into specific embodiment, the method that the embodiment of the present disclosure provides is illustrated.
In the present embodiment, comentropy process parameter values include:There is frequency, coagulation grade value and the comentropy free degree Value.Therefore, setting frequency, the first setting value and the second setting value can be configured in advance.
Fig. 2 is a kind of flow chart of Thai words recognition method according to an exemplary embodiment.Such as Fig. 2, Thai Words recognition process includes:
Step 201:Filtration treatment is carried out to Thai document to be identified, forms the first Thai text for only including Thai language character Shelves.
Full half-angle character in Thai language is included into other non-Thai language characters such as English, mathematics and carries out filtration treatment, is only retained safe The character of the scope [0x0E00,0x0E7F] of text, the pure Thai document of a completion is thus formed, that is, formed and only include Thailand First Thai document of Chinese character.
Step 202:According to setting step-length, at least one Thai short sentence in the first Thai document is split, formed Include the set of slices of at least one section character string.
Such as:A Thai short sentence in first Thai documentLength N=8 is carried out by step=2 Segmentation, the set of slices of formation
Step 203:According to formula (1), each appearance frequency of section Thai language character string is determined in set of slices.
Step 204:The section Thai language character string for exceeding setting frequency according to there is frequency, form the first section collection to be output Close.
Step 205:According to formula (2), each solidification of section Thai language character string in the first set of slices to be output is determined Degree value.
Step 206:It is more than the section Thai language character string of the first setting value according to coagulation grade value, formation second is to be output to cut Piece set.
Step 207:According to formula (3) and formula (4), Thai language word of each being cut into slices in the second set of slices to be output is determined Accord with the free angle value of comentropy of string.
Such as:Thai short sentenceThis section Thai language character string occurs Four times, wherein left adjacent word is respectivelyRight adjacent word is respectively According to formula (3),The comentropy of the left adjacent word of this section Thai language character string is-(1/2) log (1/2)-(1/2) Log (1/2) ≈ 0.51, the comentropy of its right adjacent word is then-(1/2) log (1/2)-(1/4) log (1/4)-(1/4) log(1/4)≈1.73.So as to which the corresponding free angle value of comentropy is 0.51.
Step 208:It is more than the section Thai language character string of the second setting value according to the free angle value of comentropy, forms words output Set of slices.
Step 209:Each section Thai language character string in words output set of slices is entered according to the height of the frequency of occurrences Row is front and rear to sort.
Such as:Words output set of slices includes 50 section Thai language character strings, and the height of the corresponding frequency of occurrences is suitable Sequence is 25,23,19,15,10,7,5,4,4,4,3,3,2,2 ....Then corresponding section Thai language character string can be subjected to front and rear row Sequence.
Step 210:The Thai word identified will be defined as positioned at the section Thai language character string of the setting number of forefront.
If setting number as 5, frequency section Thai language character string corresponding to 25,23,19,15,10 can be will appear from respectively It is defined as the Thai word identified.
It can be seen that in the present embodiment, can be by there is frequency, coagulation grade value and the free angle value of comentropy, to Thai text Section Thai language character string in shelves carries out brush choosing, finally identifies Thai word from Thai document more accurately, so, The efficiency and accuracy rate of Thai words recognition are improved, also, can also increase the brose and reading speed of Thai document.
Following is embodiment of the present disclosure, can be used for performing embodiments of the present disclosure.
According to the process of above-mentioned Thai words recognition, a kind of device of Thai words recognition can be built.
Fig. 3 is a kind of block diagram of Thai words recognition device according to an exemplary embodiment.As shown in figure 3, should Device includes:Filter segmentation unit 310, comentropy brush menu member 320 and word determining unit 330, wherein,
Filter segmentation unit 310, for according to setting step-length, carrying out filter segmentation processing to Thai document to be identified, obtaining The set of slices of at least one section Thai language character string must be included.
Comentropy brush menu member 320, for the comentropy process parameter values according to each section Thai language character string, to section Set carries out brush choosing, forms words output set of slices.
Word determining unit 330, it is for from words output set of slices, the section Thai language character string for setting number is true It is set to the Thai word identified.
In one embodiment of the invention, comentropy brush menu member 320 includes:
Frequency brush modeling block, for the section Thai language character string for exceeding setting frequency according to there is frequency, form first and treat Export set of slices.
Coagulation grade brush modeling block, for determining the solidification of each section Thai language character string in the first set of slices to be output Degree value, and according to section Thai language character string of the coagulation grade value more than the first setting value, form the second set of slices to be output.
Free degree brush modeling block, for determining the comentropy of each section Thai language character string in the second set of slices to be output Free angle value, and according to section Thai language character string of the free angle value of comentropy more than the second setting value, form words output section Set.
In one embodiment of the invention, coagulation grade brush modeling block, specifically for according to formula (2), determining that first is to be output The coagulation grade value of current slice Thai language character string in set of slices.
Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is corresponding in current slice Thai language character string The appearance frequency of son section Thai language character string, co is coagulation grade value.
In one embodiment of the invention, free degree brush modeling block, specifically for according to formula (3), determining current slice Thai language The left adjacent word comentropy and right adjacent word comentropy of character string;According to formula (4), by left adjacent word comentropy and right adjacent word comentropy Smaller value, be defined as the free angle value of comentropy of current slice Thai language character string.
Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
In one embodiment of the invention, word determining unit 330, specifically for according to the height of the frequency of occurrences to words output Each section Thai language character string in set of slices carries out front and rear sort;By positioned at the section Thai language word of the setting number of forefront Symbol string is defined as the Thai word identified.
The device of embodiment of the present disclosure offer is provided.
Fig. 4 is a kind of block diagram of Thai words recognition device according to an exemplary embodiment.As shown in figure 4, should Device includes:Filter segmentation unit 310, comentropy brush menu member 320 and word determining unit 330, wherein, comentropy brush Menu member 320 includes:Frequency brush modeling block 321, coagulation grade brush modeling block 322 and free degree brush modeling block 323.
Wherein, filter segmentation unit 310 can carry out filtration treatment to Thai document to be identified, and formation only includes Thai language character The first Thai document, then according to setting step-length, at least one Thai short sentence in the first Thai document is split, shape Into the set of slices for including at least one section character string.
So, the frequency brush modeling block 321 in comentropy brush menu member 320 can be determined in set of slices according to formula (1) The appearance frequency of each section Thai language character string, and the section Thai language character string for exceeding setting frequency according to there is frequency, are formed First set of slices to be output.And the coagulation grade brush modeling block 322 in comentropy brush menu member 320 can be according to formula (2), really The coagulation grade value of each section Thai language character string in fixed first set of slices to be output, and it is more than first according to coagulation grade value The section Thai language character string of setting value, form the second set of slices to be output.Free degree brush choosing in comentropy brush menu member 320 Module 323 can determine Thai language character string of each being cut into slices in the second set of slices to be output according to formula (3) and formula (4) The free angle value of comentropy, and according to section Thai language character string of the free angle value of comentropy more than the second setting value, it is defeated to form vocabulary Go out set of slices.
So as to which word determining unit 330 can be according to the height of the frequency of occurrences to each cutting in words output set of slices Piece Thai language character string carries out front and rear sequence, and will be defined as identifying positioned at the section Thai language character string of the setting number of forefront Thai word.
It can be seen that in the present embodiment, can be by there is frequency, coagulation grade value and the free angle value of comentropy, to Thai text Section Thai language character string in shelves carries out brush choosing, finally identifies Thai word from Thai document more accurately, so, The efficiency and accuracy rate of Thai words recognition are improved, also, can also increase the brose and reading speed of Thai document.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more The shape for the computer program product that usable storage medium is implemented on (including but is not limited to magnetic disk storage and optical memory etc.) Formula.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
It should be appreciated that the invention is not limited in the flow and structure for being described above and being shown in the drawings, And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claim System.

Claims (10)

  1. A kind of 1. method of Thai words recognition, it is characterised in that including:
    According to setting step-length, filter segmentation processing is carried out to Thai document to be identified, acquisition includes at least one section Thai language word Accord with the set of slices of string;
    According to the comentropy process parameter values of each section Thai language character string, brush choosing is carried out to the set of slices, forms vocabulary Export set of slices;
    From the words output set of slices, the section Thai language character string for setting number is defined as the Thai word identified Language.
  2. 2. the method as described in claim 1, it is characterised in that when described information entropy process parameter values are including frequency occur, coagulating Gu when degree value and the free angle value of comentropy, the basis is each cut into slices the comentropy process parameter values of Thai language character string, right The set of slices carries out brush choosing, and forming words output set of slices includes:
    The section Thai language character string for exceeding setting frequency according to there is frequency, form the first set of slices to be output;
    Determine each coagulation grade value of section Thai language character string, and according to coagulation grade in the described first set of slices to be output Value forms the second set of slices to be output more than the section Thai language character string of the first setting value;
    Determine each free angle value of comentropy of section Thai language character string, and according to information in the described second set of slices to be output The free angle value of entropy is more than the section Thai language character string of the second setting value, forms words output set of slices.
  3. 3. method as claimed in claim 2, it is characterised in that described to determine each to cut in first set of slices to be output The coagulation grade value of piece Thai language character string includes:
    According to formula (2), the coagulation grade value of current slice Thai language character string in the described first set of slices to be output is determined;
    Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is that corresponding son is cut in current slice Thai language character string The appearance frequency of piece Thai language character string, co are coagulation grade value.
  4. 4. method as claimed in claim 2, it is characterised in that described to determine each to cut in second set of slices to be output The free angle value of comentropy of piece Thai language character string includes:
    According to formula (3), the left adjacent word comentropy of current slice Thai language character string and right adjacent word comentropy are determined;
    According to formula (4), by the smaller value in the left adjacent word comentropy and right adjacent word comentropy, it is defined as the current slice The free angle value of comentropy of Thai language character string;
    Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
    Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
    Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
  5. 5. the method as described in claim 1, it is characterised in that it is described from the words output set of slices, by setting The Thai word that several section Thai language character strings is defined as identifying includes:
    Front and rear row is carried out to each section Thai language character string in the words output set of slices according to the height of the frequency of occurrences Sequence;
    The Thai word identified will be defined as positioned at the section Thai language character string of the setting number of forefront.
  6. A kind of 6. device of Thai words recognition, it is characterised in that including:
    Filter segmentation unit, for according to setting step-length, carrying out filter segmentation processing to Thai document to be identified, obtaining is included extremely The set of slices of few section Thai language character string;
    Comentropy brush menu member, for the comentropy process parameter values according to each section Thai language character string, the section is collected Conjunction carries out brush choosing, forms words output set of slices;
    Word determining unit, for from the words output set of slices, the section Thai language character string for setting number to be determined For the Thai word identified.
  7. 7. device as claimed in claim 6, it is characterised in that described information entropy brush menu member includes:
    Frequency brush modeling block, for the section Thai language character string for exceeding setting frequency according to there is frequency, it is to be output to form first Set of slices;
    Coagulation grade brush modeling block, for determining each solidification of section Thai language character string in the described first set of slices to be output Degree value, and according to section Thai language character string of the coagulation grade value more than the first setting value, form the second set of slices to be output;
    Free degree brush modeling block, for determining the comentropy of each section Thai language character string in the described second set of slices to be output Free angle value, and according to section Thai language character string of the free angle value of comentropy more than the second setting value, form words output section Set.
  8. 8. device as claimed in claim 7, it is characterised in that
    The coagulation grade brush modeling block, specifically for according to formula (2), determining current in the described first set of slices to be output The coagulation grade value for Thai language character string of cutting into slices;
    Wherein, Pi is the appearance frequency of current slice Thai language character string, and Pij is that corresponding son is cut in current slice Thai language character string The appearance frequency of piece Thai language character string, co are coagulation grade value.
  9. 9. device as claimed in claim 7, it is characterised in that
    The free degree brush modeling block, the left adjacent word information specifically for according to formula (3), determining current slice Thai language character string Entropy and right adjacent word comentropy;According to formula (4), by the smaller value in the left adjacent word comentropy and right adjacent word comentropy, it is determined that For the free angle value of comentropy of the current slice Thai language character string;
    Wherein, Pi is the appearance frequency of each section Thai language character string, and H (U) is comentropy;
    Free=min H (U) 1, H (U) 2 ... and H (U) n } --- --- --- --- --- formula (4)
    Wherein, H (U) is comentropy, and free is the free angle value of comentropy.
  10. 10. device as claimed in claim 6, it is characterised in that
    The word determining unit, specifically for the height according to the frequency of occurrences to each in the words output set of slices Thai language of cutting into slices character string carries out front and rear sort;It will be defined as identifying positioned at the section Thai language character string of the setting number of forefront Thai word.
CN201710982841.0A 2017-10-20 2017-10-20 The method and device of Thai words recognition Pending CN107807918A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710982841.0A CN107807918A (en) 2017-10-20 2017-10-20 The method and device of Thai words recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710982841.0A CN107807918A (en) 2017-10-20 2017-10-20 The method and device of Thai words recognition

Publications (1)

Publication Number Publication Date
CN107807918A true CN107807918A (en) 2018-03-16

Family

ID=61592904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710982841.0A Pending CN107807918A (en) 2017-10-20 2017-10-20 The method and device of Thai words recognition

Country Status (1)

Country Link
CN (1) CN107807918A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111209946A (en) * 2019-12-31 2020-05-29 上海联影智能医疗科技有限公司 Three-dimensional image processing method, image processing model training method, and medium
WO2021051600A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Method, apparatus and device for identifying new word based on information entropy, and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110137642A1 (en) * 2007-08-23 2011-06-09 Google Inc. Word Detection
CN105320960A (en) * 2015-10-14 2016-02-10 北京航空航天大学 Voting based classification method for cross-language subjective and objective sentiments
CN106815190A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 A kind of words recognition method, device and server
CN107180025A (en) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 A kind of recognition methods of neologisms and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110137642A1 (en) * 2007-08-23 2011-06-09 Google Inc. Word Detection
CN105320960A (en) * 2015-10-14 2016-02-10 北京航空航天大学 Voting based classification method for cross-language subjective and objective sentiments
CN106815190A (en) * 2015-11-27 2017-06-09 阿里巴巴集团控股有限公司 A kind of words recognition method, device and server
CN107180025A (en) * 2017-03-31 2017-09-19 北京奇艺世纪科技有限公司 A kind of recognition methods of neologisms and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021051600A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Method, apparatus and device for identifying new word based on information entropy, and storage medium
CN111209946A (en) * 2019-12-31 2020-05-29 上海联影智能医疗科技有限公司 Three-dimensional image processing method, image processing model training method, and medium
CN111209946B (en) * 2019-12-31 2024-04-30 上海联影智能医疗科技有限公司 Three-dimensional image processing method, image processing model training method and medium

Similar Documents

Publication Publication Date Title
CN109710947B (en) Electric power professional word bank generation method and device
CN104881458B (en) A kind of mask method and device of Web page subject
CN108845982B (en) Chinese word segmentation method based on word association characteristics
CN105893478A (en) Tag extraction method and equipment
CN105787121B (en) A kind of microblogging event summary extracting method based on more story lines
CN106708798B (en) Character string segmentation method and device
CN109829151B (en) Text segmentation method based on hierarchical dirichlet model
CN106445915B (en) New word discovery method and device
CN105593845B (en) Generating means and its method based on the arrangement corpus for learning by oneself arrangement, destructive expression morpheme analysis device and its morpheme analysis method using arrangement corpus
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN110941959A (en) Text violation detection method, text restoration method, data processing method and data processing equipment
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
WO2019100458A1 (en) Method and device for segmenting thai syllables
CN103955450A (en) Automatic extraction method of new words
CN104978354A (en) Text classification method and text classification device
CN112445912A (en) Fault log classification method, system, device and medium
CN107665188A (en) A kind of semantic understanding method and device
CN108153851B (en) General forum subject post page information extraction method based on rules and semantics
CN106446051A (en) Deep search method of Eagle media assets
CN107807918A (en) The method and device of Thai words recognition
CN106126495B (en) One kind being based on large-scale corpus prompter method and apparatus
CN109213974B (en) Electronic document conversion method and device
CN114912425A (en) Presentation generation method and device
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
CN106933818A (en) A kind of quick multiple key text matching technique and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180316