CN105630770A - Word segmentation phonetic transcription and ligature writing method and device based on SC grammar - Google Patents

Word segmentation phonetic transcription and ligature writing method and device based on SC grammar Download PDF

Info

Publication number
CN105630770A
CN105630770A CN201510994505.9A CN201510994505A CN105630770A CN 105630770 A CN105630770 A CN 105630770A CN 201510994505 A CN201510994505 A CN 201510994505A CN 105630770 A CN105630770 A CN 105630770A
Authority
CN
China
Prior art keywords
word
syllables
write
ambiguity
mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510994505.9A
Other languages
Chinese (zh)
Inventor
黄河燕
黄静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ETONG LANGUAGE TECHNOLOGY (BEIJING) Co Ltd
Beijing Institute of Technology BIT
Original Assignee
ETONG LANGUAGE TECHNOLOGY (BEIJING) Co Ltd
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ETONG LANGUAGE TECHNOLOGY (BEIJING) Co Ltd, Beijing Institute of Technology BIT filed Critical ETONG LANGUAGE TECHNOLOGY (BEIJING) Co Ltd
Priority to CN201510994505.9A priority Critical patent/CN105630770A/en
Publication of CN105630770A publication Critical patent/CN105630770A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a word segmentation phonetic transcription and ligature writing method and device based on an SC grammar and belongs to the technical field of computer translation in computer science. Firstly, based on a word segmentation ambiguity rule of the SC grammar, an ambiguity segmentation rule library is built by means of abutment constraint conditions in natural language, and illegal segmentation is eliminated so that the word segmentation precision can be improved; secondly, based on a word segmentation ligature writing rule library of the SC grammar and a ligature writing corpora statistical library, the ligature writing corpora statistical library is used for performing ligature writing on ligature writing knowledge which cannot be presented as rules; finally, based on a dictionary library of the SC grammar, a dictionary is used for performing maximum matching to perform word segmentation, the word segmentation ambiguity rule is called for fields where ambiguity happens so that a correct segmentation result can be acquired, and the context of a word is analyzed so that correct part-of-speech tagging and phonetic transcription can be acquired. Compared with the prior art, word segmentation accuracy is improved, and the word segmentation ambiguity rule library, a combined ambiguity word library, the ligature writing rule library, the dictionary library and the ligature writing corpora statistical library are easy to expand and maintain.

Description

A kind of participle mark with phonetic symbols write the two or more syllables of a word together method based on the SC syntax and device
Technical field
The present invention relates to a kind of participle mark with phonetic symbols write the two or more syllables of a word together method and device, in particular in the blind translation system of a kind of Chinese based on the participle mark with phonetic symbols write the two or more syllables of a word together method of the SC syntax and device, belong to the machine translation mothod field in computer science.
Background technology
Mechanical translation refers to and utilizes robot calculator to convert a kind of natural language the process of another kind of natural language expressing to. The blind translation system of the Chinese is automatically translated into Braille Chinese information, and the education of blind person, life etc. are played very big help by this. Braille is the alphabetic writing of a kind of special shape, realize the translation of Chinese character to braille, first Chinese should be carried out word link writing, then convert phonetic to, then braille is converted to by phonetic, so the accuracy of Chinese participle mark with phonetic symbols just determines the accuracy of the blind translation of the Chinese to a great extent. Word link writing is the exclusive important rule of Chinese braille. Participle word one by one is separated write; Write the two or more syllables of a word together are the singularity according to braille, avoid syllable structure too loose, are convenient to touch reading, are linked up by some words and write. Word link writing, it is necessary to follow the primitive rule of Chinese grammer, the logicality of language, habituation and syllable length degree. Convert in the process of phonetic at Chinese, owing to Chinese character has multitone word problem, but the multitone phenomenon of word is just than the multitone phenomenon much less of word, and words more than three words seldom has multitone phenomenon, so correct word link writing can greatly reduce multitone phenomenon. But independent multitone word problem still can exist, context linguistic context just must be utilized to carry out natural language analyzing and processing how to multitone sign sound correctly. So having two difficult points at Chinese character to the switching process of braille: the exactness 1, improving Chinese word link writing; 2, in conjunction with contextual contextual analysis to the correct mark with phonetic symbols of multitone word. At present also stop in the artificial stage for Chinese to the translation of braille due to domestic, in order to bring how better educational material to blind person, heavy translation brings the reduction of accuracy rate, therefore urgent need a set of for the participle mark with phonetic symbols write the two or more syllables of a word together method of Chinese to the high-accuracy of braille, thus it is the basis that compacting is laid in the blind translation of the Chinese.
Summary of the invention
It is an object of the invention to the problem realizing the blind mechanical translation of the Chinese for solving, it is proposed to a kind of participle mark with phonetic symbols write the two or more syllables of a word together method based on the SC syntax and device, it is achieved participle mark with phonetic symbols write the two or more syllables of a word together fast and accurately.
The thought of the present invention is: 1, based on the segmentation ambiguity rule of the SC syntax, utilizes the adjacent constraint condition in natural language, sets up ambiguity partition rule base, to get rid of illegal cutting to improve participle precision; 2, based on word link writing rule base and the write the two or more syllables of a word together corpus statistics storehouse of the SC syntax, according to the singularity of braille, avoid syllable structure too loose, it is convenient to blind person and touches reading, some words are linked up and writes. Write the two or more syllables of a word together corpus statistics storehouse be used for write the two or more syllables of a word together those cannot represent the write the two or more syllables of a word together knowledge for rule; 3, based on the dictionary library of the SC syntax, utilizing dictionary to carry out Forward Maximum Method to carry out participle, occur the field of ambiguity to call segmentation ambiguity rule and obtain correct cutting result, the context linguistic context resolving this word obtains correct part of speech mark and mark with phonetic symbols.
It is an object of the invention to be achieved through the following technical solutions:
Based on a participle mark with phonetic symbols write the two or more syllables of a word together method for the SC syntax, based on dictionary library, combinational ambiguity dictionary, segmentation ambiguity rule base, combination handwriting rule storehouse and write the two or more syllables of a word together corpus statistics storehouse, comprise the following steps:
(1) Chinese character string and the article type of genre for the treatment of participle mark with phonetic symbols is received;
Described character string is pure Chinese character character string, is the character string not comprising the special symbols such as numeral, punctuation mark, ASCII code word symbol; If character string comprises non-chinese character, it is split, to the non-Chinese character string individual curing after segmentation, as directly generated word node and give respective type, Chinese character word string is gone to step (2) treated with other after participle mark with phonetic symbols write the two or more syllables of a word together non-Chinese character string merge after export.
(2) Chinese character string is carried out participle based on dictionary library, and the word block after participle is carried out part of speech mark and mark with phonetic symbols;
(3) according to article type of genre, call corresponding combination handwriting rule storehouse, carry out combining write the two or more syllables of a word together to the word block of step (2) based on the braille word link writing rule in combination handwriting rule storehouse;
(4) based on write the two or more syllables of a word together corpus statistics storehouse, the word block after combination is carried out two times and combine write the two or more syllables of a word together;
(5) the Chinese character string after the participle mark with phonetic symbols write the two or more syllables of a word together generated is exported.
Described dictionary library is used for Chinese participle, part of speech mark and mark with phonetic symbols, comprises the phonetic of Chinese word symbol, grammatical and semantic attribute-bit symbol, context distinguishing funotion, word.
Described dictionary library is built by following process: defines a set of grammatical and semantic attributive classification system according to Chinese dictionary knowledge, and includes, and language engineering personnel are perfect further in the process of debugging language material.
Described carry out participle based on dictionary library and completed by following process:
A. with reference to dictionary library, utilize Forward Maximum Method algorithm that statement is carried out fractionation and obtain word block;
B. intersection feature according to word block carries out overlapping ambiguity judgement;
C. based on combinational ambiguity dictionary, word block is carried out ambiguity judgement;
D. according to ambiguity rule, by reasoning disambiguation;
E. word segmentation result is exported.
Described overlapping ambiguity be shape such as word string AXB, wherein AX forms a word, and simultaneously XB also forms a word, and this kind of ambiguity phenomenon is overlapping ambiguity. Wherein, the length of A, X, B is more than or equal to a word length. As " if having time ", " different situations ", " big head " etc. all exist overlapping ambiguity.
The word block that described combinational ambiguity dictionary exists combinational ambiguity for identifying, what include in storehouse is two words that there is combinational ambiguity, combinational ambiguity word be shape such as the word string of AB, wherein A, B independently become word, and such as sentence, " he is Shanghai in the future. " in " in the future " be exactly combinational ambiguity word.
Described combinational ambiguity dictionary is built by following process: National Language Engineer is progressively included in the process of debugging big batch language material.
Described segmentation ambiguity rule base is used for reasoning disambiguation word block, obtains correct word segmentation result, comprises ambiguity word block, conditional function, correct participle operation.
Described segmentation ambiguity rule base is built by following process: National Language Engineer is progressively summed up in the process of debugging big batch language material and improved rule. Segmentation ambiguity rule base is subdivided into overlapping ambiguity rule and combinational ambiguity rule two classes, and the word block with overlapping ambiguity calls overlapping ambiguity rule-based reasoning disambiguation, and the word block with combinational ambiguity calls combinational ambiguity rule-based reasoning disambiguation.
Described based on combinational ambiguity dictionary, word block carried out ambiguity and judges to be completed by following process:
A. to current word block, two points of lookup algorithm inquiry combinational ambiguity dictionaries are utilized;
B. according to Query Result, combinational ambiguity mark is exported.
Described according to ambiguity rule, completed by following process by reasoning disambiguation:
A. to the current word block containing ambiguity mark, the ambiguity word block part in coupling ambiguity rule;
If b. the match is successful, carry out conditional function inspection;
If c. condition inspection meets, perform the operation of correct participle;
D. correct word segmentation result is exported.
Described word block after participle is carried out part of speech mark and mark with phonetic symbols is completed by following process:
A. to current word block, from dictionary library, the dictionary information of this word block is taken out;
B. context function inspection is carried out one by one;
If c. context check meets, take out this part of speech and phonetic.
Described combination handwriting rule storehouse is used for the word block after also being marked by participle and carries out combination write the two or more syllables of a word together, comprises rule word block part, conditional function, write the two or more syllables of a word together operation. According to different article types, combination handwriting rule storehouse is subdivided into writing in classical Chinese rule base and modern times literary composition rule base.
Described combination handwriting rule storehouse is built by following process: include one by one according to the combination handwriting rule of definition in braille publishing thing, and language engineering personnel are perfect further in the process of debugging language material.
Described based on combination handwriting rule, word block carried out combination write the two or more syllables of a word together and completed by following process:
A. the word block part to current some word blocks, in coupling combination handwriting rule;
If b. the match is successful, carry out conditional function inspection;
If c. condition inspection meets, perform the operation of correct write the two or more syllables of a word together;
D exports the word segmentation result after write the two or more syllables of a word together.
Described write the two or more syllables of a word together corpus statistics storehouse is used for carrying out two combination write the two or more syllables of a word together to according to the word block after combination handwriting rule combination, and what include in storehouse is need the word block combining write the two or more syllables of a word together, such as " three main rules of discipline ". Write the two or more syllables of a word together corpus statistics storehouse is subdivided into basis dictionary and user thesaurus, and wherein some general write the two or more syllables of a word together word blocks included in basis dictionary, and user thesaurus comprises the self-defined word block needing write the two or more syllables of a word together of user.
Described write the two or more syllables of a word together corpus statistics storehouse is built by following process: include according to some concrete write the two or more syllables of a word together word blocks of definition in braille publishing thing, and language engineering personnel are perfect further in the process of debugging language material.
Described based on write the two or more syllables of a word together corpus statistics storehouse to combination after word block carry out two times combination write the two or more syllables of a word together completed by following process:
A. to current word block, mate according to the order of user thesaurus, basis dictionary;
If b. the match is successful, perform write the two or more syllables of a word together combination;
C. the word agllutination fruit after write the two or more syllables of a word together is exported;
A kind of participle mark with phonetic symbols write the two or more syllables of a word together device based on the SC syntax, based on dictionary library, combinational ambiguity dictionary, write the two or more syllables of a word together corpus statistics storehouse, combination handwriting rule storehouse and segmentation ambiguity rule base, comprise the word-dividing mode connected successively, part of speech mark and mark with phonetic symbols module, once combine write the two or more syllables of a word together module and two combination write the two or more syllables of a word together modules, word-dividing mode, part of speech mark and mark with phonetic symbols module are connected with dictionary library respectively, word-dividing mode is also connected with segmentation ambiguity rule base respectively with combinational ambiguity dictionary, once combine write the two or more syllables of a word together module to be connected with combination handwriting rule storehouse, two times combination write the two or more syllables of a word together module is connected with write the two or more syllables of a word together corpus statistics storehouse,
Word-dividing mode is for splitting based on dictionary library inputting Chinese characters string, split into independent word block, and in the process of segmentation, the word block obtained is judged whether to there is ambiguity based on overlapping ambiguity feature and combinational ambiguity dictionary, and the base that there is ambiguity is eliminated overcome ambiguity in segmentation ambiguity rule base, obtain correct word block;
Part of speech mark and mark with phonetic symbols module are used for by the word block that word-dividing mode is obtained by context function inspection, the word block after participle being carried out correct part of speech mark and mark with phonetic symbols based on dictionary library thus obtaining correct part of speech and the phonetic of word block;
Once combine write the two or more syllables of a word together module be used for part of speech mark after word block carry out combination write the two or more syllables of a word together, this module based on combination handwriting rule storehouse by conditional function is checked obtain write the two or more syllables of a word together combine after word block;
The match query operation that two combination write the two or more syllables of a word together modules are used for being carried out by the word block after once combining write the two or more syllables of a word together write the two or more syllables of a word together corpus statistics storehouse obtain write the two or more syllables of a word together combine after word block, and export with the word block of part of speech mark with mark with phonetic symbols.
As preferably, described dictionary library, combinational ambiguity dictionary, write the two or more syllables of a word together corpus statistics storehouse, combination handwriting rule storehouse and segmentation ambiguity rule base all constantly can be changed perfect according to the development in epoch, thus improve the accuracy of participle.
Useful effect
Braille is the alphabetic writing of a kind of special shape, so the accuracy of Chinese participle mark with phonetic symbols just determines the accuracy of the blind translation of the Chinese to a great extent. The dictionary structure based on the SC syntax of inventive design improves the accuracy of multitone sign sound, the accuracy of participle is improve, and segmentation ambiguity rule base, combinational ambiguity dictionary, combination handwriting rule storehouse, dictionary library and write the two or more syllables of a word together corpus statistics storehouse are easy to expansion and safeguard based on the participle of the SC syntax, combination handwriting rule.
Accompanying drawing explanation
Below in conjunction with accompanying drawing and invention example, the present invention is described in detail:
Fig. 1 is a kind of participle mark with phonetic symbols write the two or more syllables of a word together method flow schematic diagram based on the SC syntax of the embodiment of the present invention;
Fig. 2 is the schema of participle process;
Fig. 3 is the schema of part of speech mark and mark with phonetic symbols process;
Fig. 4 is the schema of word link writing process;
Fig. 5 is a kind of composition structural representation of the participle mark with phonetic symbols write the two or more syllables of a word together device based on the SC syntax of the embodiment of the present invention.
Embodiment
Below in conjunction with accompanying drawing and embodiment, the present invention is described in detail.
Based on a participle mark with phonetic symbols write the two or more syllables of a word together method for the SC syntax, flow process as shown in Figure 1, comprises the following steps:
(1) accept to receive Chinese character string and the article type of genre for the treatment of participle mark with phonetic symbols;
It is example by modern literary composition, Chinese character string content of the article type of genre that accepts for " 2008, Xiao Li is promoted as the senior technical director of this project " below, the implementation process of the inventive method is described.
(2) Chinese character string is carried out participle based on dictionary library, and the word block after participle is carried out part of speech mark and mark with phonetic symbols. As shown in Figure 2, this content is by following process implementation:
Chinese character string is carried out Forward Maximum Method based on dictionary by 2.1, is syncopated as word block.
In conjunction with the most major term long message of dictionary and the maximum possible length of side in sentence, it is determined that an optimum maximal side N, search in dictionary. Such as sentence, " 2008, Xiao Li was promoted as the senior technical director of this project. " " year " most major term length in dictionary is 3 because include in dictionary with the word of year beginning the longest be 3 words. " year " maximum possible length of side in sentence is 1, because being non-Chinese character symbol below, so that it is determined that the optimum maximal side N in " year " is 1 in this sentence. If there being such a N words in dictionary, then the match is successful, and coupling field is split out as a word; If can not find such a N words in dictionary, then it fails to match. Coupling field removes last Chinese character, and N-l remaining character, as new coupling field, carries out new coupling, so go on, until cutting to success. Namely complete one to take turns coupling and be syncopated as a word. And so forth, until all words all are split out.
2.2 word block ambiguities judge
If the word cut out is more than one Chinese character, i.e. N > 1, then carry out the judgement of overlapping ambiguity, the 2nd Chinese character getting this word is as prefix, taking word length >=N as the length of side, perform above-mentioned word slicing operation, if such word can be found, just illustrate that overlapping ambiguity exists, and calls segmentation ambiguity rule-based reasoning disambiguation. As cutting in sentence above to " project " time, taking " order " as prefix, when word length is 2, it has been found that " object " is also word, this just illustrates that " project " exists overlapping ambiguity.
If it is 2 that current word is grown up in 1, likely there is combinational ambiguity in this word, and inquiry combinational ambiguity dictionary judges whether it exists combinational ambiguity. For exemplary character string, owing to " project " be not in combinational ambiguity dictionary, so " project " only overlapping ambiguity. If " project " is in combinational ambiguity dictionary, then " project " has overlapping ambiguity and combinational ambiguity simultaneously.
2.3 reasoning disambiguations
Ambiguity mark type according to current word calls corresponding segmentation ambiguity rule-based reasoning disambiguation. Described ambiguity rule base contains the rule of the ambiguity partition in some specific word, part of speech or attribute situation, such as combinational ambiguity rule: " NP (in the future); NP (PLA) �� DWD (A) ", wherein, " NP (in the future); NP (PLA) " is the first part of ambiguity rule, i.e. ambiguity word block part. " DWD (A) " is the Part III of ambiguity rule, i.e. correct participle function part, and as the second section of ambiguity rule in this rule, namely conditional function part is empty; This Rule Expression follows a B word block below when A's word block " in the future ", and when namely representing noun (NP) of place (PLA), this A word block wants cutting to open " DWD (A) ". Such as sentence, " he is Shanghai in the future. " after step 2.1,2.2, finding that " in the future " has combinational ambiguity, coupling rule " NP (in the future), NP (PLA) �� DWD (A) " success, the correct cutting of " in the future " be " will/next ". Overlapping ambiguity rule is the same with the representation of combinational ambiguity rule, and just content is different. There is overlapping ambiguity for above-mentioned sentence " project ", call overlapping ambiguity rule and carry out reasoning disambiguation. Ambiguity rule base does not mate respective rule, but the segmentation methods in the present invention be based on just to maximum match, so according to just to the longest priority principle, obtaining correct word cutting for " project ".
Chinese character character string below is gone on by step above, until being syncopated as all words. The word block after sentence cutting above is:
2008//,/little/Lee/promotion/be/this// project// total/slip-stick artist/. /
2.4 part of speech mark and marks with phonetic symbols
It is illustrated in figure 3 the process that word block is carried out part of speech mark and mark with phonetic symbols, it is specially:
To each Chinese character word block queries dictionary, take out the dictionary information of this word, first Chinese character word block " year " being expressed as follows in dictionary such as current sentence:
$
TIM:(NCGEN, nian) S (L, (1,1), [AP; Q; WH; R]) " nian2 "
AP:(AGEN)��nian2��
Wherein, " $ " is the first part of Chinese word, i.e. Chinese word symbol part. " TIM:(NCGEN, nian) " is the second section of Chinese word, i.e. grammatical and semantic attribute-bit symbol part; It represents that " year " can work as time word (TIM) in sentence. " S (L, (1,1), [AP; Q; WH; R]) " be the Part III of Chinese word, i.e. context distinguishing funotion part. Its represents, if " year " in sentence as time word (TIM), then its first left word must be adjective (AP) or number (Q) or interrogative (WH) or pronoun (R). " nian2 " is the Part IV of Chinese word, i.e. the phonetic part of word.
Sentence above, " 2008 " are number (Q), meet the Article 1 in " year ", take out part of speech TIM and phonetic " nian2 ". So going on, the part of speech mark of sentence above with mark with phonetic symbols result is:
2008/Q/2008/TIM/nian2 ,/BD/, little/Lee AP/xiao3/R/li3 promotion/VP/jin4sheng1 is /SV/wei2 this/R/zhe4/L/ge4 project/NP/xiang4mu4 /DEF/de0 is total/AP/zong3 slip-stick artist/NP/gong1cheng2shi1. / BD/.
After word block is carried out part of speech mark and mark with phonetic symbols, word link writing will be carried out by process as described in Figure 4, specific as follows:
(3) according to article type of genre, call corresponding combination handwriting rule storehouse, carry out combining write the two or more syllables of a word together to the word block of step (2) based on the braille word link writing rule in combination handwriting rule storehouse;
This is that article cut out in modern times style, calls modern literary composition combination handwriting rule, from left to right takes out the word block after participle mark successively, when current word block is " 2008/Q/2008 ", and rule that the match is successful
S1{label:Q}S2{label:NP/L/TIM,length:1}||S1,S2
Wherein, " S1{label:Q}S2{label:NP/L/TIM, length:1} " is the first part of rule, i.e. rule word block part. First word block that its represents in rule is number (Q), the 2nd word block to be word length (length) be 1 noun (NP) or measure word (L) or time word (TIM). Current rule does not have conditional function, and " S1, S2 " is the Part III of rule, i.e. write the two or more syllables of a word together function part, and it represents needs word block S1 together with S2 write the two or more syllables of a word together. So word block " 2008/Q/2008/TIM/nian2 " needs write the two or more syllables of a word together. Neologisms block after write the two or more syllables of a word together represents for " 2008/QCH/2008nian2 ", and QCH mark represents that this word block is the word block after write the two or more syllables of a word together. Taking out next possibility write the two or more syllables of a word together word block " little/AP/xiao3 ", coupling combination handwriting rule, performs as above step successively, thus obtains the word block after once combining write the two or more syllables of a word together:
2008/QCH/2008nian2 ,/BD/, Xiao Li/QCH/xiao3li3 promotion/VP/jin4sheng1 be /SV/wei2 this/QCH/zhe4ge4 project/NP/xiang4mu4 /DEF/de0 senior technical director/QCH/zong3gong1cheng2shi1. / BD/.
(4) based on write the two or more syllables of a word together corpus statistics storehouse, the word block after combination is carried out two times and combine write the two or more syllables of a word together;
From left to right taking out the word block after write the two or more syllables of a word together combine successively, by the longest principle of optimality match user dictionary, the word block in the dictionary of basis, carries out combination write the two or more syllables of a word together after the match is successful, obtain the word block after two combination write the two or more syllables of a word together.
2008/QCH/nian2 ,/BD/, Xiao Li/QCH/xiao3li3 promotion/VP/jin4sheng1 be /SV/wei2 this/QCH/zhe4ge4 project/NP/xiang4mu4 /DEF/de0 senior technical director/QCH/zong3gong1cheng2shi1. / BD/.
(5) the Chinese character string after the participle mark with phonetic symbols write the two or more syllables of a word together generated is exported.
Based on a kind of above-mentioned participle mark with phonetic symbols write the two or more syllables of a word together method based on the SC syntax, achieve a kind of participle mark with phonetic symbols write the two or more syllables of a word together device based on the SC syntax, as shown in Figure 5, as can be seen from the figure, this device is based on dictionary library, write the two or more syllables of a word together corpus statistics storehouse, combination handwriting rule storehouse, combinational ambiguity dictionary and segmentation ambiguity rule base, comprise word-dividing mode, part of speech mark and mark with phonetic symbols module, once combine write the two or more syllables of a word together module and two combination write the two or more syllables of a word together modules, word-dividing mode, part of speech mark and mark with phonetic symbols module are connected with dictionary library respectively, word-dividing mode is also connected with segmentation ambiguity rule base respectively with combinational ambiguity dictionary, once combine write the two or more syllables of a word together module to be connected with combination handwriting rule storehouse, two times combination write the two or more syllables of a word together module is connected with write the two or more syllables of a word together corpus statistics storehouse,
Word-dividing mode is for splitting based on dictionary library inputting Chinese characters string, split into independent word block, and in the process of segmentation, the word block obtained is judged whether to there is ambiguity based on overlapping ambiguity feature and combinational ambiguity dictionary, and the base that there is ambiguity is eliminated overcome ambiguity in segmentation ambiguity rule base, obtain correct word block;
Part of speech mark and mark with phonetic symbols module are used for by the word block that word-dividing mode is obtained by context function inspection, the word block after participle being carried out correct part of speech mark and mark with phonetic symbols based on dictionary library thus obtaining correct part of speech and the phonetic of word block;
Once combine write the two or more syllables of a word together module be used for part of speech mark after word block carry out combination write the two or more syllables of a word together, this module based on combination handwriting rule storehouse by conditional function is checked obtain write the two or more syllables of a word together combine after word block;
The match query operation that two combination write the two or more syllables of a word together modules are used for being carried out by the word block after once combining write the two or more syllables of a word together write the two or more syllables of a word together corpus statistics storehouse obtain write the two or more syllables of a word together combine after word block, and export with the word block of part of speech mark with mark with phonetic symbols.
With the passing of time, people can constantly change existing this usage and constantly create neologisms, therefore described dictionary library, combinational ambiguity dictionary, write the two or more syllables of a word together corpus statistics storehouse, combination handwriting rule storehouse and segmentation ambiguity rule base all can be safeguarded, make it constantly change perfect according to the developing contents in epoch, thus improve the accuracy of participle.
Experimental result
The correct mark with phonetic symbols problem of the Chinese segmentation ambiguity in the blind switching process of the Chinese, write the two or more syllables of a word together and multitone word is efficiently solved, it is achieved that Chinese is to the high efficiency smart translation conversion of braille based on the participle mark with phonetic symbols write the two or more syllables of a word together method of the SC syntax. Translation accuracy rate is higher than 90%.
The present invention adopts artificial intelligence technology, has organically merged the multiple analyzing and processing strategies such as rule and instance, Chinese sentence carries out participle mark with phonetic symbols write the two or more syllables of a word together efficiently and accurately, it is to increase the exactness of the blind translation of the Chinese. Inventive design is a kind of based on the SC syntax, and extensibility is good, represents that efficiency is high, and the Rule Expression language of hommization, this Rule Expression has universality, extends in the solution of other natural language processing problems.

Claims (10)

1. the participle mark with phonetic symbols write the two or more syllables of a word together method based on the SC syntax, it is characterised in that: based on dictionary library, combinational ambiguity dictionary, segmentation ambiguity rule base, combination handwriting rule storehouse and write the two or more syllables of a word together corpus statistics storehouse, comprise the following steps:
Step one, reception treat Chinese character string and the article type of genre of participle mark with phonetic symbols;
Step 2, Chinese character string is carried out participle based on dictionary library, and the word block after participle is carried out part of speech mark and mark with phonetic symbols;
Step 3, according to article type of genre, call corresponding combination handwriting rule storehouse, based on the braille word link writing rule in combination handwriting rule storehouse, the word block of step (2) carried out combination write the two or more syllables of a word together;
Step 4, based on write the two or more syllables of a word together corpus statistics storehouse to combination after word block carry out two times combination write the two or more syllables of a word together;
Step 5, by generate participle mark with phonetic symbols write the two or more syllables of a word together after Chinese character string export.
2. a kind of participle mark with phonetic symbols write the two or more syllables of a word together method based on the SC syntax according to claim 1, it is characterized in that, described dictionary library is used for Chinese participle, part of speech mark and mark with phonetic symbols, comprises the phonetic of Chinese word symbol, grammatical and semantic attribute-bit symbol, context distinguishing funotion, word.
3. a kind of participle mark with phonetic symbols write the two or more syllables of a word together method based on the SC syntax according to claim 1, it is characterised in that, described carry out participle based on dictionary library and completed by following process::
A. with reference to dictionary library, utilize Forward Maximum Method algorithm that statement is carried out fractionation and obtain word block;
B. intersection feature according to word block carries out overlapping ambiguity judgement;
C. based on combinational ambiguity dictionary, word block is carried out ambiguity judgement;
D. according to ambiguity rule, by reasoning disambiguation;
E. word segmentation result is exported.
4. a kind of participle mark with phonetic symbols write the two or more syllables of a word together method based on the SC syntax according to claim 3, it is characterised in that, the word block that described combinational ambiguity dictionary exists combinational ambiguity for identifying, what include in storehouse is the word that there is combinational ambiguity.
5. according to the arbitrary a kind of described participle mark with phonetic symbols write the two or more syllables of a word together method based on the SC syntax of claim 3-4, it is characterized in that, described segmentation ambiguity rule base is used for reasoning disambiguation word block, obtain correct word segmentation result, comprise ambiguity word block, conditional function, correct participle operation, described according to ambiguity rule, completed by following process by reasoning disambiguation:
A. to the current word block containing ambiguity mark, the ambiguity word block part in coupling ambiguity rule;
If b. the match is successful, carry out conditional function inspection;
If c. condition inspection meets, perform the operation of correct participle;
D. correct word segmentation result is exported.
6. a kind of participle mark with phonetic symbols write the two or more syllables of a word together method based on the SC syntax according to claim 1, it is characterised in that, described word block after participle is carried out part of speech mark and mark with phonetic symbols is completed by following process:
A. to current word block, from dictionary library, the dictionary information of this word block is taken out;
B. context function inspection is carried out one by one;
If c. context check meets, take out this part of speech and phonetic.
7. a kind of participle mark with phonetic symbols write the two or more syllables of a word together method based on the SC syntax according to claim 1, it is characterised in that, described combination handwriting rule storehouse is used for the word block after also being marked by participle and carries out combination write the two or more syllables of a word together, comprises rule word block part, conditional function, write the two or more syllables of a word together operation; According to different article types, combination handwriting rule storehouse is subdivided into writing in classical Chinese rule base and modern times literary composition rule base; Described based on combination handwriting rule, word block carried out combination write the two or more syllables of a word together and completed by following process:
A. the word block part to current some word blocks, in coupling combination handwriting rule;
If b. the match is successful, carry out conditional function inspection;
If c. condition inspection meets, perform the operation of correct write the two or more syllables of a word together;
D. the word segmentation result after write the two or more syllables of a word together is exported.
8. a kind of participle mark with phonetic symbols write the two or more syllables of a word together method based on the SC syntax according to claim 1, it is characterized in that, described write the two or more syllables of a word together corpus statistics storehouse is used for carrying out two combination write the two or more syllables of a word together to according to the word block after combination handwriting rule combination, and what include in storehouse is need the word block combining write the two or more syllables of a word together; Write the two or more syllables of a word together corpus statistics storehouse is subdivided into basis dictionary and user thesaurus, and wherein some general write the two or more syllables of a word together word blocks included in basis dictionary, and user thesaurus comprises the self-defined word block needing write the two or more syllables of a word together of user; Described based on write the two or more syllables of a word together corpus statistics storehouse to combination after word block carry out two times combination write the two or more syllables of a word together completed by following process:
A. to current word block, mate according to the order of user thesaurus, basis dictionary;
If b. the match is successful, perform write the two or more syllables of a word together combination;
C. the word agllutination fruit after write the two or more syllables of a word together is exported.
9. the participle mark with phonetic symbols write the two or more syllables of a word together device based on the SC syntax, it is characterized in that, based on dictionary library, combinational ambiguity dictionary, write the two or more syllables of a word together corpus statistics storehouse, combination handwriting rule storehouse and segmentation ambiguity rule base, comprise the word-dividing mode connected successively, part of speech mark and mark with phonetic symbols module, once combine write the two or more syllables of a word together module and two combination write the two or more syllables of a word together modules, word-dividing mode, part of speech mark and mark with phonetic symbols module are connected with dictionary library respectively, word-dividing mode is also connected with segmentation ambiguity rule base respectively with combinational ambiguity dictionary, once combine write the two or more syllables of a word together module to be connected with combination handwriting rule storehouse, two times combination write the two or more syllables of a word together module is connected with write the two or more syllables of a word together corpus statistics storehouse,
Word-dividing mode is for splitting based on dictionary library inputting Chinese characters string, split into independent word block, and in the process of segmentation, the word block obtained is judged whether to there is ambiguity based on overlapping ambiguity feature and combinational ambiguity dictionary, and the base that there is ambiguity is eliminated overcome ambiguity in segmentation ambiguity rule base, obtain correct word block;
Part of speech mark and mark with phonetic symbols module are used for by the word block that word-dividing mode is obtained by context function inspection, the word block after participle being carried out correct part of speech mark and mark with phonetic symbols based on dictionary library thus obtaining correct part of speech and the phonetic of word block;
Once combine write the two or more syllables of a word together module be used for part of speech mark after word block carry out combination write the two or more syllables of a word together, this module based on combination handwriting rule storehouse by conditional function is checked obtain write the two or more syllables of a word together combine after word block;
The match query operation that two combination write the two or more syllables of a word together modules are used for being carried out by the word block after once combining write the two or more syllables of a word together write the two or more syllables of a word together corpus statistics storehouse obtain write the two or more syllables of a word together combine after word block, and export with the word block of part of speech mark with mark with phonetic symbols.
10. a kind of participle mark with phonetic symbols write the two or more syllables of a word together device based on the SC syntax according to claim 9, it is characterized in that, described dictionary library, combinational ambiguity dictionary, write the two or more syllables of a word together corpus statistics storehouse, combination handwriting rule storehouse and segmentation ambiguity rule base all can be safeguarded, make it constantly change perfect according to the developing contents in epoch, thus improve the accuracy of participle.
CN201510994505.9A 2015-12-23 2015-12-25 Word segmentation phonetic transcription and ligature writing method and device based on SC grammar Pending CN105630770A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510994505.9A CN105630770A (en) 2015-12-23 2015-12-25 Word segmentation phonetic transcription and ligature writing method and device based on SC grammar

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201510977335 2015-12-23
CN2015109773353 2015-12-23
CN201510994505.9A CN105630770A (en) 2015-12-23 2015-12-25 Word segmentation phonetic transcription and ligature writing method and device based on SC grammar

Publications (1)

Publication Number Publication Date
CN105630770A true CN105630770A (en) 2016-06-01

Family

ID=56045727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510994505.9A Pending CN105630770A (en) 2015-12-23 2015-12-25 Word segmentation phonetic transcription and ligature writing method and device based on SC grammar

Country Status (1)

Country Link
CN (1) CN105630770A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368474A (en) * 2017-07-07 2017-11-21 浙江理工大学 A kind of automatical and efficient translation conversion method of Chinese to braille
CN107424612A (en) * 2017-07-28 2017-12-01 北京搜狗科技发展有限公司 Processing method, device and machine readable media
CN108255815A (en) * 2018-02-07 2018-07-06 苏州金螳螂文化发展股份有限公司 The segmenting method and device of text
CN109271625A (en) * 2018-08-28 2019-01-25 江苏省基础地理信息中心 A kind of phonetic spelling normalization method of Chinese place name
CN110222182A (en) * 2019-06-06 2019-09-10 腾讯科技(深圳)有限公司 A kind of statement classification method and relevant device
CN111274806A (en) * 2020-01-20 2020-06-12 医惠科技有限公司 Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record
CN113065002A (en) * 2021-04-19 2021-07-02 北京理工大学 Chinese semantic disambiguation method based on knowledge graph and context
US11074419B1 (en) * 2020-07-06 2021-07-27 Morgan Stanley Services Group Inc. Systems and methods for providing online content in braille

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1591414A (en) * 2004-06-03 2005-03-09 华建电子有限责任公司 Automatic translating converting method for Chinese language to braille
US20080027933A1 (en) * 1999-10-20 2008-01-31 Araha, Inc. System and method for location, understanding and assimilation of digital documents through abstract indicia
CN101135940A (en) * 2007-09-07 2008-03-05 中国科学院计算技术研究所 Braille computer pointing words input system, device and method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080027933A1 (en) * 1999-10-20 2008-01-31 Araha, Inc. System and method for location, understanding and assimilation of digital documents through abstract indicia
CN1591414A (en) * 2004-06-03 2005-03-09 华建电子有限责任公司 Automatic translating converting method for Chinese language to braille
CN101135940A (en) * 2007-09-07 2008-03-05 中国科学院计算技术研究所 Braille computer pointing words input system, device and method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
陈优阳: "汉盲翻译中的分词连写处理算法研究", 《网络安全技术与应用》 *
黄河燕 等: "基于多知识分析的汉盲转换算法", 《语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368474B (en) * 2017-07-07 2020-08-04 浙江理工大学 Automatic efficient translation and conversion method from Chinese to braille
CN107368474A (en) * 2017-07-07 2017-11-21 浙江理工大学 A kind of automatical and efficient translation conversion method of Chinese to braille
CN107424612B (en) * 2017-07-28 2021-07-06 北京搜狗科技发展有限公司 Processing method, apparatus and machine-readable medium
CN107424612A (en) * 2017-07-28 2017-12-01 北京搜狗科技发展有限公司 Processing method, device and machine readable media
CN108255815A (en) * 2018-02-07 2018-07-06 苏州金螳螂文化发展股份有限公司 The segmenting method and device of text
CN109271625A (en) * 2018-08-28 2019-01-25 江苏省基础地理信息中心 A kind of phonetic spelling normalization method of Chinese place name
CN109271625B (en) * 2018-08-28 2023-07-14 江苏省基础地理信息中心 Pinyin spelling standardization method for Chinese place names
CN110222182A (en) * 2019-06-06 2019-09-10 腾讯科技(深圳)有限公司 A kind of statement classification method and relevant device
CN110222182B (en) * 2019-06-06 2022-12-27 腾讯科技(深圳)有限公司 Statement classification method and related equipment
CN111274806A (en) * 2020-01-20 2020-06-12 医惠科技有限公司 Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record
CN111274806B (en) * 2020-01-20 2020-11-06 医惠科技有限公司 Method and device for recognizing word segmentation and part of speech and method and device for analyzing electronic medical record
US11074419B1 (en) * 2020-07-06 2021-07-27 Morgan Stanley Services Group Inc. Systems and methods for providing online content in braille
CN113065002A (en) * 2021-04-19 2021-07-02 北京理工大学 Chinese semantic disambiguation method based on knowledge graph and context
CN113065002B (en) * 2021-04-19 2022-10-14 北京理工大学 Chinese semantic disambiguation method based on knowledge graph and context

Similar Documents

Publication Publication Date Title
CN105630770A (en) Word segmentation phonetic transcription and ligature writing method and device based on SC grammar
CN107463553B (en) Text semantic extraction, representation and modeling method and system for elementary mathematic problems
Candito et al. Benchmarking of statistical dependency parsers for french
US20110040553A1 (en) Natural language processing
Priyadarshi et al. Towards the first Maithili part of speech tagger: Resource creation and system development
Sibarani et al. A study of parsing process on natural language processing in bahasa Indonesia
CN103927179A (en) Program readability analysis method based on WordNet
Parameswarappa et al. Kannada word sense disambiguation using decision list
Shafi et al. UNLT: Urdu natural language toolkit
CN102929865A (en) PDA (Personal Digital Assistant) translation system for inter-translating Chinese and languages of ASEAN (the Association of Southeast Asian Nations) countries
Sun et al. Towards accurate and efficient Chinese part-of-speech tagging
Lo et al. Cool English: A grammatical error correction system based on large learner corpora
CN102135957A (en) Clause translating method and device
Khoufi et al. Chunking Arabic texts using conditional random fields
CN105045784A (en) English expression access device method and device
Gambäck et al. Experiences with developing language processing tools and corpora for Amharic
Sarkar et al. Bengali noun phrase chunking based on conditional random fields
Ariaratnam et al. A shallow parser for Tamil
Tedla Tigrinya morphological segmentation with bidirectional long short-term memory neural networks and its effect on English-Tigrinya machine translation
Shetty et al. An approach to identify Indic languages using text classification and natural language processing
WO2008017188A1 (en) System and method for making teaching material of language class
Lu et al. Language model for Mongolian polyphone proofreading
Altenbek et al. Identification of basic phrases for kazakh language using maximum entropy model
Sodhar et al. Word by Word Labelling of Romanized Sindhi Text by using Online Python Tool
Duo et al. Transition based neural network dependency parsing of Tibetan

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160601