CN101458682A - Mapping method based on Chinese character and Japanese Chinese character and use thereof - Google Patents

Mapping method based on Chinese character and Japanese Chinese character and use thereof Download PDF

Info

Publication number
CN101458682A
CN101458682A CNA2008101631551A CN200810163155A CN101458682A CN 101458682 A CN101458682 A CN 101458682A CN A2008101631551 A CNA2008101631551 A CN A2008101631551A CN 200810163155 A CN200810163155 A CN 200810163155A CN 101458682 A CN101458682 A CN 101458682A
Authority
CN
China
Prior art keywords
mapping
character
chinese character
chinese
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008101631551A
Other languages
Chinese (zh)
Inventor
黄勤
孙宝乐
陈磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Fangjie Information Technology Co Ltd
Original Assignee
Hangzhou Fangjie Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Fangjie Information Technology Co Ltd filed Critical Hangzhou Fangjie Information Technology Co Ltd
Priority to CNA2008101631551A priority Critical patent/CN101458682A/en
Publication of CN101458682A publication Critical patent/CN101458682A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a mapping method of Chinese characters in China and Chinese characters in Japan and uses thereof, because the China and the Japan all belong to a Chinese characters culture field and have the habit for using the Chinese characters, and the Chinese characters in China and the Chinese characters in Japan have intercommunity and similarity, according to comparing the grapheme, the UNICODE code and the pronunciation and so on steps in turn to correspond the Chinese characters in China and Chinese characters in Japan, so that a mapping list of Chinese characters in China and Chinese characters in Japan can be established for representing the other country character by one country character, and the characters in multinational short messages between the China and the Japan can be mapped one by one, so that the mapped short message characters can be rightly received and kept with original meaning by each other. The invention effectively solves the problems of the disorder code and character vacancy existing in the short message communication between the China and the Japan. The mapping list of the method has characteristics that the Chinese characters mapping amount is large, the lookup speed is fast and mapping list is expandable, not only being convenient for the short message communication between the two countries, but also being suitable for the environment where the one country character need to be represented by the other country character between the China and the Japan.

Description

A kind of mapping method and application thereof based on Chinese character and japanese character
Technical field
The present invention relates to the literal transformation technology of communication technique field, especially a kind of mapping method and this method based on Chinese character and japanese character used in SMS exchanges.
Background technology
The narrow strip of water of Chinese and Japanese two countries has long interchange history.Along with bilateral relations get warm again after a cold spell, the personnel that commute two countries also increase rapidly, from 2005 to 2007, the tourism of two countries from 4,040,000 person-times to breaking through 5,000,000 person-times.Have (" the Japanese Foreign Ministry white paper 2007 ") more than 1,000,000 the Chinese staff of Japan, foreign student, trainee, the personnel that hang up one's hat; According to the statistical data of the Japanese Foreign Ministry, the Japanese of permanent China in 2006 surpasses 110,000.These people are accompanied by being on home leave and commercial exchange between two countries.Along with the opening of two peoples's thought, the quantity that mixed marriage exchanges also obviously increases simultaneously.More than various situations driven interchange demand between two countries personnel greatly.
Characteristics such as note is one of the most frequently used cellular service, and it is quick, inexpensive, convenient are subjected to liking of China and Japanese two peoples deeply.2000 to 2007, China's short message service amount was respectively: 1,000,000,000, the average daily transmission 1,700,000,000 of note in 18,900,000,000,90,000,000,000,1,371 hundred million, 2,177 hundred million, 3,046 hundred million, 4,296 hundred million, 5,921 hundred million, 2007.And the people of Japan 85% has mobile phone, and its note use amount accounts for first of the whole world.Link up the SMS of China and Japan, convenient approach is provided for the interchange of two countries.
And the operational scheme of usual note to be transmit leg write short message content submits to destination number place, be transferred to the SMS service center (SMSC) of operator through the base station, the SMS service center is distributed according to the destination number of note again, finally send to the recipient and decode, obtain short message content.This process is just transmitted information, Content of communciation is not handled, and this is applicable to the interchange of carrying out in the same country, can effectively protect freedom of correspondence and communication privacy.But when carrying out transnational note when carrying out, because national literal is preferentially supported and handled to the operator and the mobile phone of two countries, and support insufficient to other country's literal, if still transmit note according to the note operational scheme in the same country, make note that the recipient obtains incomplete or can't show fully easily, can't obtain the expressed meaning of note, this note be exchanged lost efficacy, and bring puzzlement to the recipient.The literal that not only can't show the other side on the mobile phone home, and write the literal of the other side country on can't mobile phone home.Usually on Chinese mobile phone, Japanese can't be imported, on the Japanese mobile phone, also Chinese can't be imported in the same manner.
Given this problem has perplexed the written communication of two countries, and various public organizations all in the generation that solves or reduce this situation by every means, have produced several different methods.Mainly contain: translation, phonetic method, the method etc. of divining by means of characters.Said method is worth using for reference, but all exists not enough:
Transnational note exchange its basic requirement be note can be fast transmitted, can be by successful reception, can allow the other side understand content.Allow the other side understand short message content, be not meant that the note literal that the other side receives must be identical with the note literal that transmit leg sends, but can change into the form that other are fit to note, allow the other side understand the implication of short message content by this form.Have only the literal that national literal is changed into the other side country, promptly Chinese is changed into the literal that exists in the Japanese or Japanese is changed into the literal that exists in the Chinese, this this problem just can be eased.
Translation often is used to transnational interchange.Because it is that state's literal is converted to another state's literal according to the implication of literal, during realized the conversion of literal, so avoided the problem of demonstration original text side literal mess code in take over party's mobile phone.Because the translation accuracy rate of existing machine translation tools is not high, does not obtain at present widely and use.Machine translation tools commonly used only is an automatic dictionary, just simply literal is translated one by one, is spliced into article then in order.And the state-of-the-art translation tool Google translation that can use at present, need the word content of the billions of words of storage earlier, the learning art of Statistics Application makes up translation model then, the fluent degree that its translation quality also can't reach a people when saying mother tongue, (" Google translates frequently asked questions and corresponding answer ") also can't compare favourably with professional interpreter's technical ability.Current translation is mostly by manually finishing.But the translation personnel can't fully understand the linguistic context of linking up both sides, are subject to the influence of job morale, can't guarantee the accuracy and the reliability of translation result.Human translation needs contact ac both sides' original text in addition, and the privacy that exchanges both sides can't be protected.The human translation that above-mentioned reason also makes is not carried out on note exchanges.
Because the letter and number character is supported in existing mobile phone and operation commercial city, mess code problem during phonetic method (Roman phonetic of Chinese character and the Roman phonetic of japanese character) can very fast solution two countries exchanges, do not need through text conversion, this is to solve the easiest method of expecting of mess code problem.Because the mapping of phonetic and literal has the problem of a sound multiword, one sound multiword phenomenon of some everyday character brings serious disturbance can for the correct understanding of article, such as phonetic in the Chinese is " shanxi ", just corresponding " Shanxi " and " Shaanxi " two phrases, phonetic is " kenkou " in the Japanese, just corresponding " health ", " build the worker ", 8 phrases such as " row of holding concurrently ", so adopt the phonetic method, need based on context the meaning to judge which Chinese character this phonetic refers at present, and more fearful be to provide the context of foundation also to have the problem of selecting Chinese character.So the statement that uses the phonetic method to write is open to misunderstanding, accuracy rate is on the low side.
Problem (mainly being present in the Chinese character phonetic) at phonetic one sound multiword, people improve the phonetic method, the phonetic method of transferring appears annotating, be exactly with adding the numeral tone at phonetic alphabet, " 0 " expression softly, " 1 " expression high and level tone, " 2 " expression rising tone, sound in " 3 " expression, " 4 " expression falling tone, can be expressed as " ni0 ", " ni1 ", " ni2 ", " ni3 ", " ni4 " as " ni ", " n ī ", " n í ", " n ǐ ", " ni ", use to annotate and transfer the phonetic method can reduce the probability of a sound multiword, but can not avoid fully.Use the notes mileometer adjustment of " ni " to show that " ni2 " is example again, words such as just corresponding at least " mud ", " Ni ", " secondary rainbow ".
Because Chinese pinyin exists pre-nasal sound, back nasal sound, Japanese phonetic exists voiceless sound, voiced sound, causes easily when writing, and omits or has increased the character that should not exist, and as " shanliang " (kind-heartedness) in the Chinese, is write out easily " shangliang "; " kakkou " (just) in the Japanese write out " kakou " (processing) easily.In addition since Chinese north and south in enunciative difference, " N ", " H ", just be read as " L " easily, " F " influences writing and discerning of information, as " wo zai hu zhou deng ni " (I in Huzhou etc. you), write as " wo zai fu zhou deng ni " (I in Foochow etc. you), an error the breadth of a single hair can lead you a thousand li astray.
Because phonetic is a kind of system of writing pronunciation, only be used for the study of Chinese character usually, can't replace Chinese character and use.The article that writes out with phonetic is difficult to understand fully.It is not a kind of formal ways of writing, and is a kind of help of reading.
Phonetic is made up of letter, and the phonetic of most Chinese characters needs 3 or above letter to form, and it is more than two byte Chinese characters to take up room, and causes in the finite space information conveyed amount to be lacked than Chinese character.
And other method is to adopt the method for divining by means of characters, and exactly some Chinese characters that can't show in the other side's character library is split, and can be presented in the other side's the character library with the simplest radical array configuration.As " you " word of Chinese, in this way, be split as " Ren ", " that ".The method is applicable to the Chinese character of left and right sides structure, and for the Chinese character of up-down structure, semi-surrounding structure, full investing mechanism and composite structure, and this fractionation mode is difficult to allow the people discern word before splitting intuitively.This method is the destruction to literal, is the method for piecing together of the Chinese character for Chinese character.
Using Chinese character to exchange is Chinese people and Japanese people's habits and customs, has preferential acceptance, is to write book and read first-selected.Chinese character belongs to the morpheme syllabic language of ideograph writing system, is one of its important characteristic with the word table meaning.People with Chinese character deposit prefers using Chinese character to convey a message.When reading one piece of article, by reading Chinese character, can directly obtain literal meaning, and not need as reading phonetic, need through the secondary conversion, increase the cost length of the information of obtaining.Facts have proved, the implication that the reception and registration that the use Chinese character more can be correct than use phonetic will be represented, the discrimination of Chinese character compares with pitch.
Observe China's Chinese character commonly used and (belong to the GB2312 character set, GB2312 character set full name " Chinese Character Set Code for Informati baseset ", be a China's Mainland simplified Chinese character national standard) and Japan's Chinese character commonly used (belong to the Shift_JIS character set, the Shift_JIS character set is literal code collection commonly used of Japan), can find that at first many Chinese characters wherein are to be same or analogous on font, secondly some Chinese characters can carry out correspondence with the complex form of Chinese characters of China on font, corresponding to the UNICODE sign indicating number, just there is obviously difference in the residue Chinese character, needs other mode to carry out association.Can put in order these Chinese characters at this point, set up the mapping table of Chinese character and japanese character.Use mapping table to be engaged in the mapping of two countries' literal then.Be mapped as the Shift_JIS Japanese or be mapped as GB2312 Chinese from GB2312 Chinese from the Shift_JIS Japanese.
Summary of the invention
The objective of the invention is at the deficiencies in the prior art, a kind of mapping method and application thereof based on Chinese character and japanese character is provided.
The objective of the invention is to be achieved through the following technical solutions: a kind of mapping method based on Chinese character and japanese character, this method at first generates mapping table, carries out the mutual mapping of Chinese character and japanese character then according to mapping table.Wherein, described mapping table generates by following steps:
(1) obtains and puts in order GB2312 character library and Shift_JIS character library;
(2) carry out the font same map;
(3) carry out the mapping of UNICODE sign indicating number;
(4) carry out the similar mapping of font;
(5) carry out the pronunciation mapping.
Further, described step (2) is specially: according to the identical condition of font, according to GB2312 word in GB2312 and the Shift_JIS character library text and Shift_JIS word row, GB2312 word with the GB2312 text is an object of reference, mates Chinese character in the GB2312 text with the Shift_JIS word of Shift_JIS text, uses the method for exhaustion, Chinese character pattern in both sides' character library table is compared one by one, when the identical comparison of font occurring, extract this to mapping, put into mapping table.Remaining Chinese character is filed the text for Chinese word library table GB2312-A again.
Further, described step (3) is specially: Chinese character in the Shift_JIS word row of Chinese character in the GB2312 word of the Chinese word library table GB2312-A text row and Japanese character library table Shift_JIS text is compared, when condition meets, extract this to mapping, integrate with in the mapping table of font identical process generation.Remaining Chinese character is filed the text for Chinese word library table GB2312-B again.
Further, described step (4) is specially: extract the Chinese character in the GB2312 word row in the GB2312-B text that needs mapping earlier, analyzing Chinese character radicals forms and stroke order, give a mark to Chinese character, determine the score value of Chinese character, from the Shift_JIS word of Shift_JIS text row, seek the close Chinese character of score value that obtains by same procedure again, when the candidate Chinese character number that obtains during greater than 1, select only comparison, join mapping table.Remaining Chinese character is filed the text for Chinese word library table GB2312-C again.
Further, described step (5) is specially: extract the Chinese character in the GB2312 word row in the GB2312-C text that needs mapping, obtain Chinese-character pronunciation, when this word is polyphone, select pronunciation the most commonly used, from the Shift_JIS word row of Shift_JIS text, seek the same or analogous Chinese character of pronunciation that obtains by same procedure then, when the candidate Chinese character number that obtains greater than 1 the time, from candidate Chinese character, select only comparison, finally add mapping table.
The above-mentioned application of mapping method in SMS exchanges based on Chinese character and japanese character.Be specially: transmit leg is write short message content and destination number, be transferred to the SMS service center of operator through the base station, the SMS service center is distributed the mapping table place to note, the mapping table place obtains the content and the destination number of note, short message content utilization mapping table is shone upon, obtain target short message content, then the target short message content is submitted to destination number place through mapping.
The present invention compares with background technology, and the beneficial effect that has is:
Based on the mapping method of Chinese character and japanese character, disclosed the intrinsic relation of characters in Chinese and japanese: Chinese character and japanese character originate from the Ancient Chinese Chinese character, and be common on the font and the meaning of word.China and Japan belong to the character cultural circle, have similarity on form of thinking and use habit, all adapt to and like writing and reading Chinese character.
Though Chinese is identical with the japanese character origin, because Chinese character evolutionary process separately is different with degree, exists some Chinese character directly to shine upon.For avoiding occurring the literal disappearance in the note interchange, these Chinese characters are adopted particular form, handle as " interchangeability of Chinese characters word ".In article, occur once in a while in the real life " interchangeability of Chinese characters word ", exchange that both sides can be accustomed to according to grammer, social general knowledge with exchange scene and judge the prototype of this " interchangeability of Chinese characters word ", and the accuracy rate of this judgement is higher than the article that phonetic is write.
The present invention is different from translation process, the implication that short message content embodied is not resolved, and only the note literal is shone upon one by one.Its process can be realized by program fully, has avoided artificial participation, effectively protects the privacy of telex network.
The mapping table Chinese character of setting up according to this method is complete, the bar number is many.Can adopt mapping table is imported internal memory, in internal memory, shine upon calculating.May operate in simultaneously on the hardware platform of symmetrical multiprocessing (SMP) technology, adopt parallel mapping, improve mapping speed.
Description of drawings
Fig. 1 is based on the process flow diagram of the mapping method of Chinese character and japanese character;
Fig. 2 carries out font same map block diagram;
Fig. 3 carries out UNICODE sign indicating number mapping block diagram;
Fig. 4 carries out mapping block diagram familiar in shape;
Fig. 5 is the close mapping block diagram that pronounces;
Fig. 6 is that Chinese character is mapped as the japanese character synoptic diagram by mapping table;
Fig. 7 is that japanese character is mapped as the Chinese character synoptic diagram by mapping table.
Embodiment
The invention will be further described below in conjunction with accompanying drawing.
As shown in Figure 1, the mapping method based on Chinese character and japanese character of the present invention is divided into two steps: one, generate mapping table, two, carry out the mutual mapping of Chinese character and japanese character according to mapping table.
Mapping table based on the mapping method of Chinese character and japanese character is set up process, may further comprise the steps:
1. obtain and put in order GB2312 character library and Shift_JIS character library;
2. carry out the font same map;
3. carry out the mapping of UNICODE sign indicating number;
4. carry out the similar mapping of font;
5. carry out the pronunciation mapping.
To be mapped to japanese character from Chinese character is example.With the Chinese character is object of reference, uses japanese character to satisfy object of reference is compared.
Obtain Chinese GB2312 coding schedule from Internet, right being transformed into coding with the correspondence of block form in the coding schedule, each is encoded to being made up of a GB2312 coding, separator and a character corresponding with this coding, as C9A1 umbrella, B8F1 lattice.(CP936 is the coding page or leaf another name of GB2312 to obtain CP936.txt from the official website of Unicode, http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/c p936.txt), its organized formats is: row 1 are CP936 coding (16 systems, 0xXXXX), row 2 are Unicode coding (16 systems, 0xYYYY), row 3 are the title or the character of Unicode correspondence.Show CP936 among the CP936 with coding of the GB2312 in the GB2312 table and Unicode and be encoded to and be connected, obtain containing the character library table of the GB2312 of UNICODE sign indicating number, deposit in the text called after GB2312.Its form is Unicode sign indicating number, GB2312 sign indicating number, GB2312 word, cuts apart with separator.As wherein two being listed as " 4F1E C9A1 umbrella " and " 683C B8F1 lattice " (not containing double quotation marks).This arrangement has 8000 approximately, adds part and does not belong to the GB2312 table, but after belonging to the Chinese character commonly used of GBK table, have 8995 approximately.
Obtain Japanese Shift_JIS coding schedule from Internet, right being transformed into coding with the correspondence of block form in the coding schedule, each is encoded to being made up of a Shift_JIS coding, separator and a character corresponding with this coding, as 8948 plumages, 90B3 just.(CP932 is the coding page or leaf another name of Shift_JIS to obtain CP932.txt from the official website of Unicode, http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/c p932.txt), its organized formats is: row 1 are CP932 coding (16 systems, 0 x XXXX), row 2 are Unicode coding (16 systems, 0 x YYYY), row 3 are the title or the character of Unicode correspondence.Show CP932 among the CP932 with coding of the Shift_JIS in the Shift_JIS table and Unicode and be encoded to and be connected, obtain containing the character library table of the Shift_JIS of UNICODE sign indicating number, deposit in the text called after Shift_JIS.Its form is Unicode sign indicating number, Shift_JIS sign indicating number, Shift_JIS word, cuts apart with separator.As wherein two being listed as " 7FBD8948 plumage " and " 6B6390B3 just " (not containing double quotation marks).This arrangement has 9000 approximately, adds part and does not belong to the Shift_JIS table, but after belonging to the japanese character commonly used of JIS table, have 9397 approximately.
At first maximum to the bar number, and the part of most convenient, promptly the font same section shines upon, as shown in Figure 2.According to the identical condition of font, according to GB2312 word in GB2312 and the Shift_JIS text and Shift_JIS word row, GB2312 word with the GB2312 text is an object of reference, mate Chinese character in the GB2312 text with the Shift_JIS word of Shift_JIS text, use the method for exhaustion, Chinese character pattern in both sides' character library table is compared one by one, when the identical comparison of font occurring, extract this to mapping, put into mapping table, the structure of mapping table is the UNICODE sign indicating number, the GB2312 sign indicating number, the GB2312 word, the Shift_JIS sign indicating number, Shift_JIS word and decollator.Remaining Chinese character is filed the text for Chinese word library table GB2312-A again, is used for next mapping method.This process can solve 60% of GB2312 table, 57% of Shift_JIS table.
Owing to all comprise numeral, letter and symbol in Chinese and the Japanese.So these literal are handled according to the identical mode of font.Plain film assumed name in the Japanese also uses this method to handle.
Shown in Figure 3 promptly is the method that is used to handle the mapping of Unicode sign indicating number, and this process is mainly handled Chinese and had simplified and traditional body correspondence or the identical correspondence of the meaning of word with Japanese.Unicode is organized in when carrying out the character library design, has considered the correspondence of China, Japan and Korea S. CJK, so we can utilize this facility to come to be the mapping table service.With font same map similar process, this step process is that Chinese character in the Shift_JIS word row of Chinese character in the GB2312 word row of residue Chinese word library table GB2312-A text and Japanese character library table Shift_JIS text is compared, when condition meets, extract this to mapping, integrate with in the mapping table of font identical process generation.Remaining Chinese character is filed the text for Chinese word library table GB2312-B again, is used for next mapping method.This process can solve 28% of GB2312 table, 27% of Shift_JIS table.
Through the corresponding formed mapping table of above secondary, comprised most Chinese characters in common use, can satisfy common daily interchange.
Remain a small amount of Chinese character, owing to there is not clear regularity to seek, so adopt complicated approach to shine upon, manually proofread and correct, this mapping can be adjusted according to daily use habit.
Figure 4 shows that the comparison that font is similar, the identical difference of font phase Sihe font is to constitute the difference that exists one or more radicals by which characters are arranged in traditional Chinese dictionaries existence to write in a plurality of radicals by which characters are arranged in traditional Chinese dictionaries of Chinese character.These words in the life are often become another " wrongly written or mispronounced character " by misidentification, though on pronunciation easily by people's misidentification, on font, still separated by the circle of good definition.Chinese character in the GB2312 word row in the GB2312-B text that the extraction earlier of this step process needs to shine upon, analyzing Chinese character radicals forms and stroke order, give a mark to Chinese character, determine the score value of Chinese character, from the Shift_JIS word row of Shift_JIS text, seek the close Chinese character of score value that obtains by same procedure again, when the candidate Chinese character number that obtains during greater than 1, need the only comparison of artificial selection, finally join mapping table.Remaining Chinese character is filed the text for Chinese word library table GB2312-C again, is used for next mapping method.This process can solve 9% of GB2312 table, 8% of Shift_JIS table.
When another Chinese character score value skew that obtains when reference Chinese character score value is big, think that two Chinese characters are not suitable for coupling, it is mated with the pronunciation of Chinese character.As shown in Figure 5.Extract the Chinese character in the GB2312 word row in the GB2312-C text that needs mapping, obtain Chinese-character pronunciation (not relating to tone), when this word is polyphone, select pronunciation the most commonly used so, from the Shift_JIS word row of Shift_JIS text, seek the Chinese character of the pronunciation same or similar (not relating to tone) that obtains by same procedure then, when the candidate Chinese character number that obtains greater than 1 the time, need manually from candidate Chinese character, to select only comparison, finally add mapping table.This process can solve the about 2% of GB2312 table, about 2% of Shift_JIS table.
The remaining Chinese character of only a few is because can't be corresponding, temporarily vacant.1% of this Chinese character quantity not sufficient mapping table.The probability that runs in the daily use is extremely low.
Because the japanese character original entries is more than the Chinese character original entries, and there is the phenomenon of the corresponding different a plurality of Chinese characters of a japanese character in the statistics mapping table, so there is 6% japanese character not participate in from the mapping of the corresponding Japanese of Chinese approximately.
Set up the mapping table that is mapped to Chinese character from japanese character with same process.
Owing in the daily interchange, have a plurality of Chinese characters in the information and form, because the mapping table clauses and subclauses are many, certainly will influence mapping speed again.Need be optimized mapping table: 1, the appearance to Chinese character rearranges in proper order, and Chinese characters in common use are positioned over preferential position, improves hit rate, reduces the inquiry of skew significantly; 2, use memory database, whole mapping table is loaded into internal memory,, can reduce the time of query mappings table because the speed of access memory is faster than memory devices such as hard disks; 3, adopt the hardware platform of symmetrical multiprocessing framework (SMP), cut apart because note exchanges the born time, the energy let us uses the hardware platform of symmetrical multiprocessing framework (SMP) easily.
Mapping table can be placed on database table or be loaded into internal memory after setting up and finishing, and is used for that note exchanges between China and Japan, is engaged in the Chinese character mappings work.
It is of a great variety to be used for the mapping method that note exchanges between China and Japan, but on the employed mechanism of literal processing method, can reduce: based on translation, based on phonetic, based on radical, based on coding.The present invention is based on the method for Chinese character and japanese character mapping, given up phonetic one sound multiword, radical and can't split and translate the artificial shortcoming that participates in of needs, in conjunction with the development of coding.This method has, and exchanges both sides and writes note conveniently, and arriving note does not have mess code, reads recognition rate higher characteristic.For the method the most accurately that provides is provided present Sino-Japan note.
The present invention is based on the mapping method of Chinese character and japanese character, at first by combination Chinese character (GB2312 character library) and japanese character (Shift_JIS character library), according to characteristics such as font, UNICODE sign indicating number and pronunciations, adopt according to font identical, similar, the UNICODE sign indicating number is identical, orders such as pronunciation is identical, similar are set up Chinese and japanese Chinese character mapping table.
Word for word utilize mapping table to carry out the Chinese and japanese mapping to the short message content that obtains then, be converted into the target short message content.The Chinese character note that the GB2312 that therefrom civilian mobile phone is sent encodes becomes the japanese character note that Shift_JIS encodes, and it is presented in the Japanese mobile phone; Also can make from the japanese character note of the Shift_JIS coding of Japanese mobile phone transmission, become the short breath of Chinese character of GB2312 coding, it can be presented in the Chinese mobile phone.
Based on method of the present invention, the flow process of note becomes transmit leg and writes short message content and destination number, be transferred to the SMS service center of operator through the base station, the SMS service center is distributed the mapping table place to note, the mapping table place obtains the content and the destination number of note, short message content utilization mapping table is shone upon, obtain target short message content through mapping, then the target short message content is submitted to destination number place, finally send to the recipient and decode, obtain the target short message content.
Be example to be positioned over database below, introduce the application of mapping table.
Set up the MS SQL SERVER database table of mapping table with mapping table, table name is HanCode,, and mapping table imports in this table.List structure is:
CREATE?TABLE[HanCode](
[UNICODE][char](4)COLLATE?Chinese_PRC_CI_AS?NOT?NULL,
[FROM_CODE][char](4)COLLATE?Chinese_PRC_CI_AS?NULL,
[FROM_TEXT][varchar](2)COLLATE?Chinese_PRC_CI_AS?NULL,
[DEST_CODE][char](4)COLLATE?Chinese_PRC_CI_AS?NULL,
[DEST_TEXT][varchar](2)COLLATE?Chinese_PRC_CI_ASNULL,
CONSTRAINT[PK_HanCode]PRIMARY?KEY?CLUSTERED
(
[UNICODE]
)ON[PRIMARY]
)ON[PRIMARY]
After receiving note, short message content is word for word cut apart, obtained the character string of a short message content.Obtain the literal (belonging to Chinese character, symbol, numeral and letter) in the character string in order, with the font of this literal as querying condition, use query statement from mapping table, to search target text, obtain the font of target text, then this target text is added in the target note.Intact all literal of sequential processes are organized into the target note successively, send to the recipient.
Query statement is:
select?DEST_CODE,DEST_TEXT
from?HanCode
Where FROM_TEXT='<mapped literal〉'
Embodiment 1
The Chinese character note that the GB2312 that sends from Chinese mobile phone encodes to the japanese character mapping, becomes the japanese character note of Shift_JIS coding through the mapping table Chinese character, and it is presented in the Japanese mobile phone.As: " your short message service is convenient " 9 Chinese characters, wherein 4 Chinese characters " note is convenient " are identical according to font; 2 Chinese characters " business " are identical according to the UNICODE sign indicating number; 2 Chinese characters " you " are similar according to font; Last 1 Chinese character " " is similar according to pronunciation.These several Chinese characters are after mapping is finished through mapping tables, and it is convenient to obtain “ Mi Door note industry Service " 9 japanese characters.Mapping process as shown in Figure 6.
Embodiment 2
The japanese character note that the Shift_JIS that sends from the Japanese mobile phone encodes to the Chinese character mapping, becomes short breath of Chinese character of GB2312 coding through the mapping table japanese character, and it can be presented in the Chinese mobile phone.As: “ Mi Door note industry Service conveniently " 9 japanese characters, wherein 5 Chinese characters " note is convenient " are identical according to font; 3 Chinese characters " traffic gate " are identical according to the UNICODE sign indicating number; 1 Chinese character " you " is similar according to font; 0 Chinese character is similar according to pronunciation.After this several Chinese characters process mapping table mapping is finished, obtain " your short message service is convenient " 9 Chinese characters.Mapping process as shown in Figure 7.

Claims (7)

1. the mapping method based on Chinese character and japanese character is characterized in that, this method at first generates mapping table, carries out the mutual mapping of Chinese character and japanese character then according to mapping table.Wherein, described mapping table generates by following steps:
(1) obtains and puts in order GB2312 character library and Shift_JIS character library.
(2) carry out the font same map.
(3) carry out the mapping of UNICODE sign indicating number.
(4) carry out the similar mapping of font.
(5) carry out the pronunciation mapping.
2. mapping method according to claim 1, it is characterized in that, described step (2) is specially: according to the identical condition of font, according to GB2312 word in GB2312 and the Shift_JIS character library text and Shift_JIS word row, GB2312 word with the GB2312 text is an object of reference, mate Chinese character in the GB2312 text with the Shift_JIS word of Shift_JIS text, use the method for exhaustion, Chinese character pattern in both sides' character library table is compared one by one, when the identical comparison of font occurring, extract this to mapping, put into mapping table.Remaining Chinese character is filed the text for Chinese word library table GB2312-A again.
3. mapping method according to claim 1, it is characterized in that, described step (3) is specially: Chinese character in the Shift_JIS word row of Chinese character in the GB2312 word of the Chinese word library table GB2312-A text row and Japanese character library table Shift_JIS text is compared, when condition meets, extract this to mapping, integrate with in the mapping table of font identical process generation.Remaining Chinese character is filed the text for Chinese word library table GB2312-B again.
4. mapping method according to claim 1, it is characterized in that, described step (4) is specially: extract the Chinese character in the GB2312 word row in the GB2312-B text that needs mapping earlier, analyzing Chinese character radicals forms and stroke order, give a mark to Chinese character, determine the score value of Chinese character, from the Shift_JIS word row of Shift_JIS text, seek the close Chinese character of score value that obtains by same procedure again, when the candidate Chinese character number that obtains during greater than 1, select only comparison, join mapping table.Remaining Chinese character is filed the text for Chinese word library table GB2312-C again.
5. mapping method according to claim 1, it is characterized in that, described step (5) is specially: extract the Chinese character in the GB2312 word row in the GB2312-C text that needs mapping, obtain Chinese-character pronunciation, when this word is polyphone, select pronunciation the most commonly used, from the Shift_JIS word row of Shift_JIS text, seek the same or analogous Chinese character of pronunciation that obtains by same procedure then, when the candidate Chinese character number that obtains greater than 1 the time, from candidate Chinese character, select only comparison, finally add mapping table.
6. the described application of mapping method in SMS exchanges of claim 1 based on Chinese character and japanese character.
7. application according to claim 6, it is characterized in that, this application is specially: transmit leg is write short message content and destination number, be transferred to the SMS service center of operator through the base station, the SMS service center is distributed the mapping table place to note, and the mapping table place obtains the content and the destination number of note, and short message content utilization mapping table is shone upon, obtain target short message content, then the target short message content is submitted to destination number place through mapping.
CNA2008101631551A 2008-12-18 2008-12-18 Mapping method based on Chinese character and Japanese Chinese character and use thereof Pending CN101458682A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008101631551A CN101458682A (en) 2008-12-18 2008-12-18 Mapping method based on Chinese character and Japanese Chinese character and use thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008101631551A CN101458682A (en) 2008-12-18 2008-12-18 Mapping method based on Chinese character and Japanese Chinese character and use thereof

Publications (1)

Publication Number Publication Date
CN101458682A true CN101458682A (en) 2009-06-17

Family

ID=40769548

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008101631551A Pending CN101458682A (en) 2008-12-18 2008-12-18 Mapping method based on Chinese character and Japanese Chinese character and use thereof

Country Status (1)

Country Link
CN (1) CN101458682A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662926A (en) * 2012-03-29 2012-09-12 常州华文文字技术有限公司 Storage and access methods for word stock
CN104424165A (en) * 2013-09-06 2015-03-18 北大方正集团有限公司 Messy code detection method and system for text documents
CN106650716A (en) * 2016-12-12 2017-05-10 福建字客网络科技有限公司 Identification method and device for computer font
CN109145315A (en) * 2018-09-05 2019-01-04 腾讯科技(深圳)有限公司 Text interpretation method, device, storage medium and computer equipment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662926A (en) * 2012-03-29 2012-09-12 常州华文文字技术有限公司 Storage and access methods for word stock
CN104424165A (en) * 2013-09-06 2015-03-18 北大方正集团有限公司 Messy code detection method and system for text documents
CN104424165B (en) * 2013-09-06 2018-05-25 北大方正集团有限公司 A kind of text document mess code detection method and system
CN106650716A (en) * 2016-12-12 2017-05-10 福建字客网络科技有限公司 Identification method and device for computer font
CN109145315A (en) * 2018-09-05 2019-01-04 腾讯科技(深圳)有限公司 Text interpretation method, device, storage medium and computer equipment
CN111368565A (en) * 2018-09-05 2020-07-03 腾讯科技(深圳)有限公司 Text translation method, text translation device, storage medium and computer equipment
CN111368565B (en) * 2018-09-05 2022-03-18 腾讯科技(深圳)有限公司 Text translation method, text translation device, storage medium and computer equipment
US11853709B2 (en) 2018-09-05 2023-12-26 Tencent Technology (Shenzhen) Company Limited Text translation method and apparatus, storage medium, and computer device

Similar Documents

Publication Publication Date Title
CN103294776B (en) Smartphone address book fuzzy search method
CN101295292B (en) A kind of method based on maximum entropy model modeling and name Entity recognition and device
CN101950285A (en) Utilize native language pronunciation string converting system and the method thereof of statistical method to Chinese character
CN104809176A (en) Entity relationship extracting method of Zang language
CN103309926A (en) Chinese and English-named entity identification method and system based on conditional random field (CRF)
CN106383814A (en) Word segmentation method of English social media short text
CN103984771B (en) Method for extracting geographical interest points in English microblog and perceiving time trend of geographical interest points
CN101840406A (en) Place name searching device and system
CN103902525B (en) Uighur part-of-speech tagging method
Terrill Why make books for people who dont read? A perspective on documentation of an endangered language from Solomon Islands
CN106601253B (en) Examination & verification proofreading method and system are read aloud in the broadcast of intelligent robot word
CN103810161B (en) Method for converting Cyril Mongolian into traditional Mongolian
CN101458682A (en) Mapping method based on Chinese character and Japanese Chinese character and use thereof
Kang Spoken language to sign language translation system based on HamNoSys
CN109271625B (en) Pinyin spelling standardization method for Chinese place names
CN109086285B (en) Intelligent Chinese processing method, system and device based on morphemes
CN103020046B (en) Based on the name transliteration method of name origin classification
CN103164398A (en) Chinese-Uygur language electronic dictionary and automatic translating Chinese-Uygur language method thereof
CN103164397A (en) Chinese-Kazakh electronic dictionary and automatic translating Chinese- Kazakh method thereof
CN103164396A (en) Chinese-Uygur language-Kazakh-Kirgiz language electronic dictionary and automatic translating Chinese-Uygur language-Kazakh-Kirgiz language method thereof
CN101576924A (en) Mongolian retrieval method
CN103164395A (en) Chinese-Kirgiz language electronic dictionary and automatic translating Chinese-Kirgiz language method thereof
Arora et al. Pre-processing of English-Hindi corpus for statistical machine translation
CN101436205A (en) Method and apparatus for enquiring unique word by explanation
CN103488305A (en) Chinese input method system with simplified and traditional Chinese contrasts

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20090617