WO2018228101A1 - Chinese meaning based chinese encoding method and system, and medium device - Google Patents

Chinese meaning based chinese encoding method and system, and medium device Download PDF

Info

Publication number
WO2018228101A1
WO2018228101A1 PCT/CN2018/086500 CN2018086500W WO2018228101A1 WO 2018228101 A1 WO2018228101 A1 WO 2018228101A1 CN 2018086500 W CN2018086500 W CN 2018086500W WO 2018228101 A1 WO2018228101 A1 WO 2018228101A1
Authority
WO
WIPO (PCT)
Prior art keywords
chinese
code
morpheme
word
meaning
Prior art date
Application number
PCT/CN2018/086500
Other languages
French (fr)
Chinese (zh)
Inventor
夏诠真
Original Assignee
佛山辞荟源信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 佛山辞荟源信息科技有限公司 filed Critical 佛山辞荟源信息科技有限公司
Publication of WO2018228101A1 publication Critical patent/WO2018228101A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding

Definitions

  • the invention relates to a computer data processing technology, in particular to a Chinese digital coding method and system and a media device encoded in Chinese.
  • Chinese digital processing such as computer processing
  • should be encoded first that is, coded as an intermediary input and digital processing, to achieve information memory and transmission, so that in the personal computer, the World Wide Web, smart phones
  • the information age represented by the Chinese character system is, coded as an intermediary input and digital processing, to achieve information memory and transmission, so that in the personal computer, the World Wide Web, smart phones
  • the information age represented by the Chinese character system is not limited to, Chinese character system.
  • the codes of the telegraph code, the four-corner number, the big five-code, the national standard code, and the unified code are all single-word code systems, and each code represents a Chinese character.
  • the weakness of the unicode used by people is that it can only represent the glyph of a Chinese character, neglecting the meaning of the word and the meaning of the word, and cannot directly understand and process the meaning of the text, resulting in the Chinese being better than the Western language.
  • the advantages are not fully applied, and the combination of the basic shape (pen shape), sound (pinyin) and meaning (meaning) of the Chinese language cannot be effectively digitized.
  • the present invention provides a Chinese encoding method, system and medium device based on Chinese meaning for overcoming the intrinsic defects existing in the digital coding scheme of Chinese characters, and digitally encodes the basic constituent elements (morphemes) of Chinese meaning, which is different from the present
  • morphemes basic constituent elements
  • the Chinese meaning element that is, the morpheme rule, can solve the problem of computer processing Chinese characters, the common meaning of the same word, the different sounds of the same word and other problems in the process of Chinese digitization.
  • the Chinese digital encoding method, system and medium device based on Chinese meaning (morpheme) of the invention including morpheme codes based on Chinese meaning, ie neutral codes, word codes based on Chinese words and phrases (meaningful set of morphemes) (also known as "concept code”), and the Chinese database of these codes corresponding to the huge Chinese meaning.
  • a morpheme-based Chinese encoding method provided for implementing the present invention includes the following steps:
  • the database includes a morpheme table corresponding to the morphemes and a vocabulary list corresponding to the words and phrases; and the morpheme table includes a morpheme code of each morpheme, the vocabulary list The word code containing the words and phrases.
  • the morpheme table includes a glyph summary table and a word summary table, and the glyph summary table and the word summary table are associated with each other by the morpheme code.
  • the encoding processing method based on the Chinese meaning further includes the following steps:
  • Chinese is input using a morpheme-assisted input method based on the neutral code and the word code.
  • Receiving morpheme selection information determining a morpheme code corresponding to the selected morpheme
  • the encoding of a Chinese morpheme includes encoding a single Chinese character
  • the canonical code is combined with the morpheme number to form a morpheme code of a different meaning morpheme of the single Chinese character.
  • the number of the word list is plural.
  • the word code is represented by a 32-bit hexadecimal number.
  • the storage of the word code in the Chinese code database is implemented by using an eight-dimensional matrix space.
  • the word code is stored in an eight-dimensional matrix space, including:
  • Each word code is used as a point in the eight-dimensional matrix space, and the points are positioned by eight hexadecimal values of X, Y, Z, P, Q, R, S, and T, and the eight-dimensional matrix space structure is used as the storage address.
  • the using the eight-dimensional matrix spatial structure as the storage address of the word code includes:
  • the word code consists of three parts: the sequential classification value, the sorting link value, and the protection code.
  • the structure is as follows:
  • the classification value includes three values, respectively representing X-axis, Y-axis and Z-axis coordinate values;
  • the sorting link value includes four values, respectively representing P-axis, Q-axis, R-axis and S-axis coordinate values,
  • the protection code is the last digit and represents the T-axis coordinate value.
  • the classification value represents a point in the three-dimensional space, and all the classification values are stored in a data layout diagram of the three-dimensional space, wherein the data layout diagram is a table;
  • the vocabulary list is divided into a plurality of types, and different categorical values correspond to different types of vocabulary tables.
  • the protection code is a control value calculated from the classification value and the sorting link value by a code map
  • the code picture is a table structure composed of a plurality of vectors and a matrix.
  • the word list includes: a dictionary type sentence list, a dictionary type word list, a poetry ancient sentence list, and a history list.
  • a storage medium for storing the computer program instructions of the encoding processing method based on Chinese meaning is also provided.
  • a coding processing software system based on Chinese meaning comprising the storage medium, wherein computer program instructions in the storage medium are called to complete an encoding process based on Chinese meaning.
  • an object of the present invention is to provide an encoding processing device based on Chinese meaning, including a central processing unit, and the storage medium connected to a central processing unit;
  • the central processor invokes computer program instructions in the storage medium to perform an encoding process based on Chinese meaning.
  • the invention has a breakthrough design, fully considers the convenience and accuracy advantages of adopting morphemes as a code number system for designing Chinese characters, and uses neutral code and word code as the core to solve different homophones and different words in Chinese digital processing. Righteousness and other issues.
  • the morpheme table and/or its generated applications are: smart prompt input method.
  • powerful, flexible and accurate coding rich and complete, enabling people to input Chinese more easily and accurately and understand the semantics of Chinese.
  • This coding system has the potential to help improve the electronic processing efficiency of Chinese in the era of computer digitization, make Chinese more suitable for the information processing requirements of the digital age, and contribute to the promotion of Chinese culture in the digital age.
  • FIG. 1 is a flowchart of a morpheme-based Chinese encoding processing method according to an embodiment of the present invention
  • Figure 2 is an embodiment of the step S100 of Figure 1;
  • FIG. 3 is an implementation manner of step S200 in FIG. 1;
  • Figure 4 is an embodiment of the step S400 of Figure 1;
  • FIG. 5 is a morpheme-based Chinese encoding system according to an embodiment of the present invention.
  • morpheme As the basic element of Chinese semantics, morpheme has the following requirements: (1) it has only one pronunciation and an accurate basic meaning. (2) The morpheme has no glyph, it is neutral to the font, does not distinguish between simplified and traditional, and facilitates the search, statistics and information of information. analysis.
  • the morpheme is a language unit representing the smallest Chinese meaning, and the same word, according to the meaning of the meaning, may correspond to multiple morphemes, and the morpheme is an element of the Chinese group word, and is a unique Chinese semantic unit, relying on Words and phrases cannot exist alone.
  • “pass” corresponds to two morphemes (English send, biography; communication or biography);
  • "calendar” corresponds to two morphemes (English history, calendar; history or calendar);
  • “day” corresponds to three morphemes (English Sun, day, japanese; sun, day, Japan).
  • Morphemes have a unique pronunciation and a meaning.
  • the morpheme is encoded, and the formed code is called a neutral code; the code and the phrase are encoded, and the formed code is called a word code.
  • a Chinese encoding method based on Chinese meaning which is shown in FIG. 1 , includes the following steps:
  • Step S100 encoding a morpheme of a Chinese language to obtain a morpheme code of each of the morphemes;
  • morphemes are detected, and each morpheme is defined and encoded to obtain a morpheme code, which is a neutral code.
  • the existing Chinese characters are used for notes, the articles are composed of sentences, the sentences are composed of words and phrases, and the words and phrases are composed of existing Chinese characters.
  • the existing Chinese characters are different from the Western languages. They have three attributes: shape (pen shape), sound (pinyin), and meaning (meaning).
  • a homomorphic existing Chinese character can have multiple meanings and pinyin. Because the ambiguity (multiple meaning attributes) of existing Chinese characters hinders the automatic processing of information, the analysis of big data affecting Chinese coding makes it difficult to search, disseminate, translate, input, etc.
  • the embodiment of the present invention encodes a plurality of meaning attributes of Chinese, and encodes the morphemes to obtain a neutral code.
  • the breakthrough invention of the embodiment of the present invention is to abandon this unbreakable traditional method, using morpheme coding as a unit of word formation, and information processing with morpheme as the core structure is impossible for other language systems (including English and French), such as 1 is shown.
  • morpheme is the core of Chinese, which can make the conversion between existing simplified Chinese characters and traditional Chinese characters not rely on context analysis and rely on the indication of morpheme table (the morpheme table is a collection of morphemes, both simplified and traditional characters)
  • the definition is performed in the morpheme table), and it is not necessary to identify that it is a simplified or traditional Chinese character, and the retrieval accuracy can be basically 100%.
  • each code of the morpheme is constructed on the basis of the existing Chinese characters, and one code corresponds to one neutral code.
  • the morpheme coding method of the embodiment of the present invention combines shape, sound, and meaning, that is, each morpheme is encoded using a neutral code.
  • the information of the Chinese character is encoded by using two general tables, that is, the Chinese glyph summary table and the word meaning summary table, wherein the glyph summary table only passes the "shape" attribute of the Chinese character (the radical, the stroke number, the stroke order) , (acoustic) coding; the word meaning summary table only registers the "righteous” and “sound” attribute codes of Chinese characters, homonyms synonymous Chinese characters (such as dust/dust, Chen/Chen, peak/peak) use only the same code, regardless of the written It is a traditional form, a simplified form or a variant form. As long as it is synonymous, it is treated as a word, so the code of the morpheme in the list of meanings is also called "neutral code".
  • a word summary table and a glyph summary table are adopted.
  • the word meaning summary indicates that the morpheme “meaning” does not mean “shape”
  • the glyph summary indicates that the morpheme “shape” does not mean “meaning”
  • this data structure is based on the relational database inventor Dr. Edgar Frank Codd.
  • the database integrity is designed according to the third law. The purpose is to change the complex shape, sound, and meaning of the many-to-many relationship of the existing Chinese characters into simple by adding the simplified and traditional characters of the Chinese characters to the summary list. "Many-to-one relationship" and "one-on-one relationship.”
  • the dual master table changes the program for digitizing the Chinese character information: the input and storage of Chinese uses a list of meanings, and the output of Chinese (display or print of text) uses a glyph summary.
  • the separate processing of input and output is a major innovation in information processing that changes people's work habits.
  • a neutral code is set for each different meaning of a standardized Chinese character, wherein the neutral code is a neutral code.
  • the purpose of the morpheme is to accurately define each meaning of the Chinese normative word. Because of the existence of polysemy, a Chinese normative character (the existing Chinese character set published by the State Council of China in 2013) can correspond to multiple morphemes.
  • the encoding method of the canonical word is four Arabic numerals.
  • the structure of "neutral code” is "normative word code” + "morpheme serial number”, as follows:
  • the word “line” is multi-syllable, and the code for the standard word “line” is "0483", which has: 1 walk 2 rows (row) 3 industries (business) ... a variety of meanings. Therefore, the embodiment of the present invention sets the specification word “row”: 10843A ("walking” morpheme) 2083B ("row” morpheme) 30838C ("industry” sense morpheme) ... and so on, and many morphemes clearly distinguish the norm The different meanings of the word “row” are shown in Table 2.
  • the morpheme code is based on the code of the existing standard Chinese character (for example, "0483” is the “line” standard word code), plus an identifying letter (A, B, C, D, E, F, G%)
  • the codes of different meaning morphemes as in the above example: 0483A is the code of the "go” morpheme, 0483B is the code of the "row” morpheme, and 0483C is the code of the "industry” morpheme.
  • N is an integer, indicating that the existing kanji has a total of N morpheme codes.
  • the morpheme code is based on the existing Chinese characters, plus the number N, N is an integer, that is, the existing canonical word has N morpheme codes, for example, the morpheme code of the existing canonical word "row" is 04833, wherein the last one Bit 3 indicates that the canonical word has 3 morphemes.
  • Step S200 encoding words and phrases in Chinese, and obtaining word codes of the words and phrases;
  • a morpheme code (neutral code) is used as a construction unit, and a word or a phrase is set to obtain a word code.
  • Words and phrases are the basic units of human thinking, reasoning, and exchange of information, morphemes of embodiments of the present invention. Compared with chemistry, words are like atoms. Words and phrases are like molecules or genes. The performance of analytes should stop at molecules or genes. Analytical articles should be based on words and phrases. From the perspective of the embodiment of the present invention, a morpheme is an element constituting a word, and a word is a basic unit constituting a sentence. In the embodiment of the present invention, a Chinese word (substantially a Chinese word) or a phrase is treated in one piece.
  • a "word” is a combination of all monosyllabic morphemes or a plurality of morphemes that can be independent.
  • Single words, multiple words and phrases idioms, conjunctions, proverbs, proverbs, afterwords, maxims, famous sentences, idioms, names of people, place names, institution names, brand names, trade names, specialist terms.
  • morphemes have "unicity".
  • the function of the text is a note, the clearer the better.
  • the biggest purpose of coding information is to achieve "uniqueness” and to eliminate ambiguities and inaccuracies expressed in ordinary languages.
  • a morpheme is a unit of words or phrases that should be able to accurately pronounce words and phrases.
  • the encoding of the words and phrases from the morpheme is collectively referred to as the word code in the embodiment of the present invention.
  • the word code is classified into the following eight categories from the perspective of group words: 1 language morpheme (word code) 2 surname morpheme (word code) 3 person name morpheme (word code) 4 place name class Morpheme (word code) 5 science morpheme (word code) 6 ancient Chinese morpheme (word code) 7 nonsense phoneme morpheme (word code) 8 table morpheme (word code) and so on.
  • the morpheme is used as the construction unit, and the words or phrases of the morpheme and the morpheme of the nonsense (no meaning, express pronunciation) are set to be pseudo-statement codes.
  • Morpheme category Morpheme number Words that can be composed (examples) Language morpheme A, B, C, D, E, F, G, H, I, J, K, L Snake, ascetic, self Surname morpheme M Chen, Li, Zhang, Wang, He Human morpheme N Empress Dowager Cixi, Li Bai, Zhu Bangfu Toponymic morpheme P Shanghai, Paris, Maling Road Technology morpheme R Bentley telegraph code, organic luminescent material Ancient Chinese morpheme T If the husband is not fighting and the temple is the winner, it’s too much.
  • the words “horse lane” and “maling road” have the words “horse” and “dao”.
  • the standard word number of the word “ ⁇ ” is 2777, and the standard word number of the word “dao” is 2745.
  • the "horse” morpheme of the horse lane is 2777A, the “horse” morpheme of the Maling Road is 2777P; the "dao” morpheme of the horse lane is 2745B, and the “dao” morpheme of the Ma Lingdao is 2745P; the "horse” of these two words
  • the word and the word "dao” are different, because the horse lane is a common word, and the Ma Lingdao is a geographical term. If you do not distinguish from the morpheme level, the search for information cannot be accurate, but the meaning of the "horse" of the animal and the “horse” of the geographical term are mixed, so the analysis result of the data is not accurate.
  • the animal "horse” morpheme (2777A) can be composed of: words, phrases, horses, horses, horses, successes, etc.;
  • the place name "Ma” morpheme (2777P) can be composed of: Ma Lingdao, Ma Yipo...
  • the phonogram "Ma” morpheme (2777V) can be composed of: motor, Rome, Madrid, etc.;
  • the morpheme is a word-forming unit, and the word or phrase other than the non-speech morphemes and the morphemes are encoded to obtain a false word code.
  • Table 4 is a morpheme and word comparison table showing the relationship between morphemes and words.
  • Morpheme Word/phrase Code Morpheme coding 4 Arabic numerals + morpheme number Word code (distributed in multiple tables by category and purpose)
  • a word or phrase Is the unit that makes up a compound word or sentence Core form Morpheme Table + Schedule ( radicals, notes).
  • the Normative Glyph Table and the Morpheme Table are sister tables. The number and content of the fields in each table are different, and the table and table are connected in series.
  • idiom table Take the idiom table as an example. Each idiom consists of four (or more) morphemes, as shown in Table 5 below:
  • idiom table among the seven thousand idioms, 75 idioms include the word “dao”, but the word “dao” corresponds to 6 or 7 morphemes, so the embodiment of the present invention is " When the idiom table is coded, it should indicate which morpheme (one of A, B, C, 7) that constitutes the idiom.
  • the first sentence of the first chapter of Laozi's Tao Te Ching is: "Tao Dao, very Tao”; three “Tao” characters appear in this sentence, meaning different, so three different morphemes should be used.
  • the first word “dao” is a noun, meaning “dao” (Dao) of the Tao Te Ching; the second "dao” is a verb, meaning "talk”; third
  • the word “dao” means "method".
  • Tao Dao, very Tao can be translated as: "The truth that can be dictated is not an eternal truth.”
  • the philosophical theory of Tao Te Ching is profound, and the explanations of later generations may not be the opinions of Lao Tzu himself. There is no morphological concept and the truth of the author of the Tao Te Ching cannot be accurately translated.
  • the embodiment of the present invention distributes all Chinese vocabulary (the number of targets is one million) in tens to hundreds of forms according to word classes (common words, idioms, linguistics, idioms, linguistics, slang, proverbs, maxims, Allusions, names of people, names of places, names of school organizations, specialist terms...); lexical coding means that words and phrases are defined by morphemes, as in the above example, the "Apocalypse" is divided into four morphemes: security, poverty, music, and Tao.
  • the core method is: (1) Each word and each phrase (idiom, place name, specialist term, 7) are encoded. Each code represents a concept and does not represent a Chinese character string. Words or phrases of the same concept (such as “mouse” / “mouse”; “astronaut” / “spaceman”), although the strings are different, only use the same code to represent.
  • Words or phrases with N meanings are represented by N codes (for example, the word “fan” has two significantly different meanings of "food” and “FANS”, so it is represented by two different codes) (2) each The word code must be accurately defined; for the accuracy of the definition, in many cases the embodiment of the present invention adds English/French corresponding words (such as using "FANS” to accurately define “fans") (3) string (word /phrase) expressed in neutral code (sentence morphemes); words/phrases expressed in neutral codes make them more independent and accurate, so they are not plagued by differences in simple, complex, and alloglyphs. (4) Vision or The nature of the phrase, the embodiment of the present invention uses a table with different structures to record its attributes (such as the number of fields and contents of the common vocabulary, idiom table, place name table, etc.) are completely different.
  • step A the Chinese vocabulary is collected and stored in a plurality of forms in the relational database according to the part of speech/word class;
  • the table includes, but is not limited to, a common vocabulary, an idiom list, a philanthropy list, an allusion table, a Chinese place name table, and the like.
  • Step B the table is divided into morphemes
  • the idioms of "Apocalypse” are divided into four norms: security, poverty, music, and Tao.
  • Each normative word is defined by four Arabic numerals (for example, the word “dao” in the middle of poverty is represented by 2745).
  • step C the above-mentioned canonical word is replaced by an appropriate morpheme by adding a morpheme number (A, B, C, ...) to each of the canonical words, for example, the morpheme number of the word “dao" of the sinister music is "A” ( So the morpheme code is 2745A).
  • Step S300 constructing a Chinese code database
  • the database includes a morpheme table corresponding to the morpheme and a vocabulary list corresponding to the word and the phrase; and the morpheme table includes a morpheme code of each morpheme,
  • the word list contains the words and phrases of the words and phrases.
  • the neutral code is classified and summarized, and the word code is combined to form a morpheme database based on semantic coding.
  • the morpheme table is integrated into an eight-dimensional matrix space and sorted and linked.
  • Each of the word codes is represented by a combination of a plurality of vectors and a matrix by a number of 32 bits, that is, 8 16-digit numbers and 4 bytes of length.
  • the Chinese encoding method of the present invention encodes words and phrases by a 32-bit (ie, 8 16-digit, 4-byte length) hexadecimal numbers (the morpheme itself is 16) One bit encoding).
  • each code (including a neutral code and a word code) is regarded as a point in an eight-dimensional matrix space, and the points are eight, X, Y, Z, P, Q, R, S, and T.
  • the values (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F) are de-positioned.
  • the embodiment of the present invention utilizes an eight-dimensional matrix spatial structure to make a storage address of information.
  • the coding is composed of three parts: classification value, sorting link value and protection code. Its structure is:
  • a class value is a point in a stereo matrix (three-dimensional matrix space) stored in a table in three-dimensional space.
  • the name of the table can be called a data allocation table. Its task is to record the number.
  • the sort link value is the sequence number of the record of each table, expressed in four 16-digit numbers. ;
  • the sorting may be sorted by pinyin letters, such as a, o, e, ...; or may be sorted by stroke size, for example: one, B, ....
  • the links may be linked in the same glyph, for example, between the poems of the name “Li Bai”; or may be linked between words having the same meaning, for example, “filial piety” Link all morphemes.
  • the protection code is a control digit calculated from the classification value and the ranking value by a code map, and as an implementable manner, a control value that can be calculated from the classification value and the ranking value.
  • the code map is a table structure composed of a number of vectors [vector] and a matrix [array].
  • class values include, but are not limited to, table classifications:
  • Dictionary class the following is the main table of the code table (ie, dictionary class) of the embodiment of the present invention (each table is an independent file, storing the same kind of information):
  • the word list which collects 13,000 Chinese words, its main fields are simplified and traditional fonts, pinyin, short spells, radicals, strokes, notes, strokes, short interpretations, and detailed explanations.
  • a commonly used glyph table which collects about 6,000 commonly used simplified, traditional and variant Chinese glyphs. Its main fields are: glyph, Unicode code, radical, stroke, stroke order, sound, basic sound, basic meaning (example: / ⁇ , ⁇ / ⁇ , ⁇ / ⁇ / ⁇ / ⁇ —etc). These words are synonymous but each different form uses one record to register information);
  • word meaning table ie commonly used morpheme table
  • word code code ie Morpheme code
  • default glyphs ie Morpheme code
  • simplified glyphs traditional glyphs
  • variants word definitions
  • pinyin short spells
  • the radical table which collects 260 simplified and traditional radicals (for example: ⁇ , ⁇ , and gold are three different radicals);
  • the sound note table collects 1000 sounds, and 80% of Chinese is a sound word.
  • the "sound” is changed to the "sound” side by word, and the usage is similar to the radical.
  • the keyword table which collects 500 keywords, consists of morphemes. As an implementable method, the most commonly used 500 morphemes are selected and called "keywords". Information search by semantic keywords is a basic function. Different from other Chinese coding systems, because keywords are defined by word meaning, the meaning is accurate, so the search of information can be done very delicately. It can be done by other Chinese systems. .
  • a commonly used vocabulary which collects about 60,000 commonly used words. It is a form that is made according to the principle of one yard and one meaning. Each record has only one basic meaning.
  • the main fields are: simplified characters, traditional Chinese characters, pinyin, even spells, definitions, example sentences, English words, French words, keywords, first words, tail words, words.
  • phrases class - The following is a table of the phrase poetry dictionary class, etc.:
  • Idiom which collects about 7,000 idioms, its fields are: simplified string, traditional string, pinyin, even spell, simplified interpretation, traditional interpretation, use example sentences, English translation, keywords, first words, tail words;
  • this table can collect 3,000 links (two sentences of idioms / famous words), its fields are: joint language, pinyin, annotation, short comment, source, source, category, keyword, first word;
  • Proverbs which collect about two thousand common proverbs, whose fields are: proverbs, categories, explanations;
  • the maxim which collects about two thousand common adages, whose fields are: maxim, category, source, interpretation;
  • Fable which collects about 2,000 Chinese fables, including but not limited to: fables, title, category, author, etc.;
  • idioms collect about two thousand common idioms, and its fields are: idioms, categories, sources, and explanations.
  • Song Dynasty which consist of two parts: word content and word author. Its fields include but are not limited to: name card name, word title, author, author introduction, word original, annotation, comment, Chinese translation, English translation, French translation;
  • Bai Xiang's lyrics which was compiled by Shu Menglan of Jing'an people during the Jiaqing period of the Qing Dynasty. It selects a total of 100 words from Tang to Qing, and all hundred is a valuable reference for lyrics.
  • the fields of this table include but are not limited to: name card name, author, title, original text, test, practice;
  • Pei Wen poetry which collects 105 poems, its fields include but are not limited to: poetry rhyme name, big category, poetry rhyme number, attached poem rhyme word;
  • Gu Wenguan which is a collection of Chinese prose in the past dynasties, a total of 218 articles. It was a study of ancient Chinese texts selected by Wu Chucai and Wu Tiaohou during the Kangxi reign of the Qing Dynasty. The fields of this table are: author, author introduction, dynasty, title, article title, original text, comment, vernacular translation, short comment
  • the four books which are the collective name of "The Analects of Confucius”, “Mencius”, “University”, and “The Doctrine of the Mean”.
  • “The Analects of Confucius” records the words and deeds of Confucius, "Meng Zi” records Meng Yan's words and deeds, "The Doctrine of the Mean” and “University” are two articles written by the Southern Song Dynasty scholar Zhu Xi from the "Book of Rites”.
  • the authors of the four books are Confucius, Zi Si, Mencius, Cheng Zi, Zhu Xi, etc., with a time interval of 1,800 years. After the Song and Yuan Dynasties, the four books became a must-read for the school's official textbooks and the imperial examinations.
  • the Tao Te Ching was made by Laozi (Li Er) in the Spring and Autumn Period of China. It consisted of 81 chapters and was translated into many languages. The fields include, but are not limited to, chapters, original texts, vernacular translations, English translations, French translations, and reviews;
  • a selection of Chinese folk songs which collects about 300 Chinese folk songs.
  • dynasty its fields include, but are not limited to: the name of the dynasty, the age of the beginning of the AD, the founder, the capital, the present place, the main characters, and the notes;
  • China's big towns, Chinese geographical terms, China's famous attractions, its fields include but are not limited to: provincial name (or district name), abbreviation, major categories, fine categories, levels, short sentences, detailed introduction, pictures;
  • the national name capital table its fields include but are not limited to: region, country name (Chinese + English), capital (Chinese + English), area, population, short introduction, remarks, national flag, national anthem.
  • Step S400 using an electronic device, inputting Chinese using a morpheme-assisted input method according to a neutral code and a word code.
  • step S400 includes the following steps:
  • Step S410 receiving input data information
  • Step S420 providing a morpheme selection prompt according to the input data information
  • Step S430 receiving morpheme selection information, and determining morpheme coding corresponding to the selected morpheme;
  • Step S440 calling the word meaning summary table query and providing Chinese characters corresponding to the morpheme coding
  • Step S450 calling the glyph summary table according to the selected Chinese characters, querying and determining the Chinese to be entered;
  • Step S460 displaying and inputting the determined Chinese.
  • the Chinese language can be stored and transmitted in three different formats: (1) Unicode (2) Neutral Code (3) Word Code. Take the string "Chinese Treasure Chest” as an example, and archive it with Unicode.
  • the inner code is: "6C49 8BED 767E 5B9D 7BB1"; archived with neutral code, the inner code is: "BA7E BB79 A6CA C45F BD63"; archive with word code
  • the internal code is: "ABCD1234".
  • 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F represent a 16-digit number, because the computer uses two The method of entering the law (0, 1 carry) (the "digital" of the digital camera is the common name of the binary method), the Chinese code does not use the 10 method and uses the binary method.
  • a hexadecimal digit consists of 4 binary digits (0, 1); two 16 digits form a byte (byte, the smallest unit of computer memory).
  • the length of the Unicode code and the neutral code are both two bytes, and the length of the word code is four bytes, which is composed of eight 16-digit numbers.
  • the matrix coding table of the embodiment of the present invention is conceived according to the hexadecimal notation.
  • the embodiment of the present invention should develop a new input method to enter information.
  • Input methods can be handwriting, oral reading, pinyin, radicals, notes, strokes, foreign languages, and so on. Regardless of the manner (pinyin, handwriting, oral reading, etc.), in the embodiment of the present invention, on the basis of the prior art, it is processed by the morpheme input method according to an embodiment of the present invention.
  • the input unit is a word or a phrase instead of a word. Using the word as the input unit can reduce the repetition rate. If you encounter a polysemy, a dialog box is displayed asking the user to select the appropriate meaning.
  • steps 11. to 13. are the same as in the prior art; and steps 14. to 15. are the embodiments of the present invention in which the user selects the meaning of the food or the meaning of the FANS, and obtains a suitable word code.
  • Example 2 When the user types the word "CHEN", it displays: 1. Chen 2. Chen 3. Dust 4. Dust 5. Morning 6. ⁇ ; If the user selects the first or second item, the internal code is It is B3AF. If the user selects the third or fourth item, the internal code is B9D0. Among them, B3AF is the neutral code of "Chen surname”, B9D0 is the neutral code of "DUST"; in the input stage, only the meaning of the word, ignore the glyph, write abbreviated and write, all remember with the same neutral code.
  • Example 3 When the user types the word "BAI”, it displays: 1. white (color) 2. white (speaking) 3. white (last name) 4. worship 5. pendulum 6. defeat...; if the user wants Enter the word “white” and choose one of the first, second or third meanings, which is the "white” of the color, or the "white” of the speech, or the "white” of the last name.
  • the inner code is A5D5, A5D6. , A5D7, to clear the meaning of the word "white”.
  • an embodiment of the present invention further provides a storage medium for storing computer program instructions according to the Chinese meaning encoding processing method according to the embodiment of the present invention.
  • an embodiment of the present invention further provides an encoding processing software system based on Chinese meaning, including the storage medium, where computer program instructions in the storage medium are called to complete encoding processing based on Chinese meaning.
  • the software system includes a morpheme encoding module 10, a statement encoding module 20, a table module 30, and an input module 40. among them:
  • the morpheme encoding module 10 is configured to encode a morpheme of a Chinese language to obtain a morpheme code of each of the morphemes.
  • the sentence encoding module 20 is configured to encode words and phrases in Chinese to obtain word codes of the words and phrases.
  • the table module 30 is configured to construct a Chinese code database, where the database includes a morpheme table corresponding to the morphemes and a vocabulary list corresponding to the words and phrases; and the morpheme table includes each morpheme The morpheme code, the word list contains the word code of the word and the phrase.
  • the input module 40 is configured to input Chinese by using a morpheme-assisted input method according to a neutral code and a word code using an electronic device.
  • an encoding processing device based on Chinese meaning is further provided, including a central processing unit and the storage medium connected to the central processing unit;
  • the central processor invokes computer program instructions in the storage medium to perform an encoding process based on Chinese meaning.
  • the working process of the storage medium, the software system, and the processing device in the embodiment of the present invention is basically the same as the Chinese encoding method based on the Chinese meaning. Therefore, in the specific embodiment, the detailed description will not be repeated.
  • the present embodiment in order to help overseas Chinese and foreigners who do not understand pinyin to input Chinese, the present embodiment also has a glyph input method (handwriting, cangjie, wubi, radical strokes, notes, strokes).
  • the logic is called system identification after inputting the whole word (such as "dayday”) or phrase (such as "Li Bai") by traditional methods such as handwriting, Cangjie, Wubi, radical strokes, notes, strokes, etc., so the system knows “ “White Day” and "Li Bai” are inseparable strings. Look for the words “dayday” or "Li Bai".
  • a morpheme is used to define a word or phrase; the unit of input is not a morpheme but a word or phrase.
  • the morpheme can realize the one-to-one correspondence with the code, that is, the uniqueness of the code. Analyze and collect Chinese words and phrases, and store them in hundreds of tables in relational databases (Access, Oracle, others). The coding codes are not the same;
  • each morpheme has a hyperlink function.
  • the user can browse the entire knowledge base at will (for example, reading Bai Juyi’s "The Song of Everlasting Sorrow", the user clicks on "fishing”
  • the morpheme of Yangshuo Drum can show the interpretation of "Yuyang Drums”; the user then clicks on the morpheme “An Lushan” from the explanatory text, and enters the "Ancient Chinese Names List” to show the life of An Lushan and "The Anshi Rebellion” "Review; after reading the explanation, you can return to the verse of "The Song of Everlasting Sorrow”.
  • the language system has been used for many years, especially since the Unicode system has been in use for more than 20 years, and it has become obsolete. Without the ability to load the new needs caused by the rapid advancement of technology in the information age, the embodiment of the present invention introduces a neutral code ( Morpheme coding) and word code (word and phrase coding), with the vitality of this method and system to promote the continued development of language culture and technology.
  • a neutral code Morpheme coding
  • word code word and phrase coding
  • the language system is diverse. Taking Chinese characters as an example, due to historical reasons, independent development has formed two simple and traditional Chinese languages, which is not conducive to cultural and economic exchanges. At the same time, in the new era, the application of new words, the translation of foreign words, and the production of technical vocabulary are very non-uniform and hinder the interaction of language and culture.
  • the embodiment of the present invention serves the people of the world with technical reforms. It collects languages, tries to unify the language processing of foreign words and new words, and enables the neutral code (morpheme coding) logic of multi-font coexistence, so that users can conveniently select and use. .
  • the embodiment of the present invention collects a large number of morpheme vocabulary, performs coding processing (adding foreign language corresponding words, etc.), and obtains 1) a linguistic knowledge base with morpheme as the core; 2) a neutral code and a word code as the backbone. Language processing system; 3) Intelligent prompt input method backed by language knowledge base.
  • the three major modules of knowledge base, coding system and input method can be applied independently or combined.
  • the morpheme-based language processing method and system of the embodiment of the present invention is convenient, delicate, and flexible in processing language information, and can perform search, analysis, and statistics of language big data, and has a super-function of a large relational database of a language. Has a strong boost to its value.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention provides a Chinese meaning based Chinese encoding method and system, and a medium device. The method comprises the following steps: encoding morphemes of Chinese, so as to obtain a morpheme code of each of the morphemes; encoding words and phrases of Chinese, so as to obtain word and phase codes of the words and the phrases (note: the word and phase code is also known as "concept code"); constructing a Chinese encoding database, the database comprising a morpheme table corresponding to the morphemes and a word and phase table corresponding to the words and the phrases; and the morpheme table including the morpheme code of each of the morphemes, and the word and phase table including word and phase codes of the words and the phrases. By means of the present method, the processing of language information is convenient, meticulous and flexible, thereby being able to perform searching, parsing and counting of big data of the language, having the powerful function of traversing the big relational database of the language, thereby effectively enhancing the value thereof.

Description

基于汉语含义的汉语编码方法及系统和介质设备Chinese coding method, system and medium device based on Chinese meaning 技术领域Technical field
本发明涉及一种计算机数据处理技术,特别是涉及一种以汉语含义为编码的汉语数字化编码方法及系统和介质设备。The invention relates to a computer data processing technology, in particular to a Chinese digital coding method and system and a media device encoded in Chinese.
背景技术Background technique
一般地,汉语进行数字化处理,如电脑处理时,应先被编码,即以代号(code)为中介输入并进行数字化处理,实现信息记忆和传输,这样才能在以个人电脑、万维网、智能手机为代表的信息时代使用汉字体系。Generally, Chinese digital processing, such as computer processing, should be encoded first, that is, coded as an intermediary input and digital processing, to achieve information memory and transmission, so that in the personal computer, the World Wide Web, smart phones The information age represented by the Chinese character system.
最早出现的汉语编码数据是1880年面世的《中文商业电报码》,历经民国初年的王云五四角号码、20世纪八十年代的台湾大五码和中国大陆国标码,一直演变至20世纪末的国际统一码,汉语数据靠越来越完善的编码系统紧紧地跟在拉丁语系进入光辉灿烂的数码世界。The earliest Chinese coded data was the Chinese Business Telegraph Code, which was published in 1880. It has been transformed into 20 by the Wang Yunwu four-corner number in the early years of the Republic of China, the Taiwan Big Five in the 1980s, and the Chinese national standard code. At the end of the century, the international Unicode code, Chinese data is closely followed by the Latin language system into the glorious digital world.
但电报码、四角号码、大五码、国标码、统一码这些编码统统是单字码系统,每个码代表一个汉字。现有技术中,人们所用的国际统一码(unicode)的弱点是它仅能代表一个汉字的字形,疏忽了字音和字义属性,无法直接进行文意的理解和处理,导致汉语胜于西方语系的优点没有被充分应用,另汉语实质上的形(笔形)、音(拼音)、义(含义)多因素结合体特性未能进行获得有效地数字化处理。However, the codes of the telegraph code, the four-corner number, the big five-code, the national standard code, and the unified code are all single-word code systems, and each code represents a Chinese character. In the prior art, the weakness of the unicode used by people is that it can only represent the glyph of a Chinese character, neglecting the meaning of the word and the meaning of the word, and cannot directly understand and process the meaning of the text, resulting in the Chinese being better than the Western language. The advantages are not fully applied, and the combination of the basic shape (pen shape), sound (pinyin) and meaning (meaning) of the Chinese language cannot be effectively digitized.
现有汉字表形,而同形的汉字可以有多个含义。长久以来,从古代到今日,人们一直以现有汉字笔形为构词的单位,所有的数字化信息系统,包括计算机处理,数字化搜索、以及传播、翻译等等应用,全部是以现有汉字笔形的规则来作为数字化信息处理的基本单位。Existing Chinese characters have a phenotype, while homomorphic Chinese characters can have multiple meanings. For a long time, from ancient times to today, people have always used the existing Chinese pen shape as the unit of word formation. All digital information systems, including computer processing, digital search, and communication, translation and other applications, are all in the shape of existing Chinese characters. Rules come as the basic unit of digital information processing.
汉语中文章是词和短语的集合而不是字的集合,是一个“词”或一个“短语”代表一个完整的概念(concept),“字”承担不了这个任务,所以使用上述的字形单字码作为数字化处理以及信息记忆和电子化传播媒介的这个传统方法限制了汉语在数字化时代文化中的传播,不可能为信息的搜索和信息的分析提供有力的帮助,缺乏扩展的空间,需要进一步改进。An article in Chinese is a collection of words and phrases rather than a collection of words. It is a "word" or a "phrase" that represents a complete concept. The word can't take on this task, so use the glyph word above. The traditional method of digital processing and information memory and electronic media limits the spread of Chinese in the digital age culture. It is impossible to provide powerful help for information search and information analysis. It lacks room for expansion and needs further improvement.
发明内容Summary of the invention
本发明为克服现有汉字数字化编码方案中存在的本征缺陷提供一种基于汉语含义的汉语编码方法及系统和介质设备,通过利用汉语含义的基本构成元素(语素)进行数字化编码,不同于现有字形元素来进行汉语数字化编码及处理。采用汉语含义元素,即语素的规则则可解决汉语数字化过程中,计算机处理汉字,常出现的同字不同义,同字不同音的准确性以及其他问题。The present invention provides a Chinese encoding method, system and medium device based on Chinese meaning for overcoming the intrinsic defects existing in the digital coding scheme of Chinese characters, and digitally encodes the basic constituent elements (morphemes) of Chinese meaning, which is different from the present There are glyph elements for Chinese digital encoding and processing. The Chinese meaning element, that is, the morpheme rule, can solve the problem of computer processing Chinese characters, the common meaning of the same word, the different sounds of the same word and other problems in the process of Chinese digitization.
本发明的基于汉语含义(语素)的汉语数字化编码方法及系统和介质设备,包含基于汉语含义的语素码,即中性码,基于汉语词及短语(语素的有含义的集合集)的词句码(又称“概念码”),以及这些代号构成的对应庞大汉语含义的汉语数据库。The Chinese digital encoding method, system and medium device based on Chinese meaning (morpheme) of the invention, including morpheme codes based on Chinese meaning, ie neutral codes, word codes based on Chinese words and phrases (meaningful set of morphemes) (also known as "concept code"), and the Chinese database of these codes corresponding to the huge Chinese meaning.
为实现本发明而提供的一种基于语素的汉语编码方法,包括以下步骤:A morpheme-based Chinese encoding method provided for implementing the present invention includes the following steps:
对汉语的语素进行编码,得到每个所述语素的语素码;Encoding the morphemes of Chinese to obtain the morpheme code of each of the morphemes;
对汉语中的词及短语进行编码,得到所述词和短语的词句码;Encoding words and phrases in Chinese to obtain the word code of the words and phrases;
构建汉语编码数据库,所述数据库中包括与所述语素相对应的语素表及与所述词及短语相对应的词句表;且所述语素表中包含每个语素的语素码,所述词句表中包含词及短语的词句码。Constructing a Chinese code database, the database includes a morpheme table corresponding to the morphemes and a vocabulary list corresponding to the words and phrases; and the morpheme table includes a morpheme code of each morpheme, the vocabulary list The word code containing the words and phrases.
优选的,所述语素表包括字形总表和字义总表,且所述字形总表和所述字 义总表之间通过所述语素码相互关联。Preferably, the morpheme table includes a glyph summary table and a word summary table, and the glyph summary table and the word summary table are associated with each other by the morpheme code.
优选的,所述的基于汉语含义的编码处理方法,还包括如下步骤:Preferably, the encoding processing method based on the Chinese meaning further includes the following steps:
使用电子设备,根据中性码和词句码,利用语素辅助的输入法输入汉语。Using electronic devices, Chinese is input using a morpheme-assisted input method based on the neutral code and the word code.
其包括以下步骤:It includes the following steps:
接收输入数据信息;Receiving input data information;
根据所述输入数据信息提供语素选择提示;Providing a morpheme selection prompt according to the input data information;
接收语素选择信息,确定所选择语素对应的语素编码;Receiving morpheme selection information, determining a morpheme code corresponding to the selected morpheme;
调用所述字义总表查询并提供语素编码对应的汉语文字;Calling the word meaning summary table query and providing Chinese characters corresponding to the morpheme code;
根据选择的汉语文字调用所述字形总表,查询并确定要录入的汉语;Calling the glyph summary table according to the selected Chinese characters, querying and determining the Chinese to be entered;
显示、录入所确定的汉语。Display and enter the confirmed Chinese.
优选的,所述对汉语语素进行编码,包括对单个汉字的编码;Preferably, the encoding of a Chinese morpheme includes encoding a single Chinese character;
包括如下步骤:Including the following steps:
对每个所述单个汉字语构建唯一的规范字代码;Constructing a unique canonical code for each of the individual Chinese words;
确定所述单个汉字包含的不同含义数量;Determining the number of different meanings contained in the single Chinese character;
为所述单个汉字的每个含义确定一个语素序号;Determining a morpheme number for each meaning of the single Chinese character;
所述规范字代码和所述语素序号组合构成所述单个汉字不同含义语素的语素码。The canonical code is combined with the morpheme number to form a morpheme code of a different meaning morpheme of the single Chinese character.
优选的,所述词句表的数量为多个。Preferably, the number of the word list is plural.
优选的,所述词句码采用32位元的16进制数字表示。Preferably, the word code is represented by a 32-bit hexadecimal number.
优选的,所述词句码在所述汉语编码数据库中的存储,采用八维度矩阵空间实现。Preferably, the storage of the word code in the Chinese code database is implemented by using an eight-dimensional matrix space.
优选的,采用八维矩阵空间存储所述词句码,包括:Preferably, the word code is stored in an eight-dimensional matrix space, including:
将每个词句码作为八维矩阵空间的一个点,点以X、Y、Z、P、Q、R、S、T八个16进制数值定位,利用八维矩阵空间结构作为储存地址。Each word code is used as a point in the eight-dimensional matrix space, and the points are positioned by eight hexadecimal values of X, Y, Z, P, Q, R, S, and T, and the eight-dimensional matrix space structure is used as the storage address.
优选的,所述利用八维矩阵空间结构作为词句码的储存地址,包括:Preferably, the using the eight-dimensional matrix spatial structure as the storage address of the word code includes:
词句码由顺序的分类值、排序链接值、保护码三个部分组成,结构如下:The word code consists of three parts: the sequential classification value, the sorting link value, and the protection code. The structure is as follows:
Figure PCTCN2018086500-appb-000001
Figure PCTCN2018086500-appb-000001
其中,所述分类值包含三个数值,分别表征X轴,Y轴和Z轴坐标值;所述排序链接值包括四个数值,分别表征P轴,Q轴,R轴和S轴坐标值,保护码为最后一位,表征T轴坐标值。Wherein, the classification value includes three values, respectively representing X-axis, Y-axis and Z-axis coordinate values; the sorting link value includes four values, respectively representing P-axis, Q-axis, R-axis and S-axis coordinate values, The protection code is the last digit and represents the T-axis coordinate value.
优选的,所述分类值表征三维空间里的一个点,所有分类值存储在三维空间的数据布局图中,所述数据布局图为一个表;且Preferably, the classification value represents a point in the three-dimensional space, and all the classification values are stored in a data layout diagram of the three-dimensional space, wherein the data layout diagram is a table;
所述词句表分为多种类型,不同的分类值对应的不同类型的词句表。The vocabulary list is divided into a plurality of types, and different categorical values correspond to different types of vocabulary tables.
优选的,所述保护码为从所述分类值和所述排序链接值凭编码图计算出来的控制数值,且所述编码图为由多个向量和矩阵组成的表结构。Preferably, the protection code is a control value calculated from the classification value and the sorting link value by a code map, and the code picture is a table structure composed of a plurality of vectors and a matrix.
优选的,所述词句表包括:字典类词句表、词典类词句表、诗词古籍类词句表及史地类词句表。Preferably, the word list includes: a dictionary type sentence list, a dictionary type word list, a poetry ancient sentence list, and a history list.
为实现本发明目的还提供一种存储介质,用于存储所述基于汉语含义的编码处理方法的计算机程序指令。In order to achieve the object of the present invention, a storage medium for storing the computer program instructions of the encoding processing method based on Chinese meaning is also provided.
为实现本发明目的更进一步提供一种基于汉语含义的编码处理软件系统,包括所述的存储介质,所述存储介质中的计算机程序指令被调用完成基于汉语含义的编码处理。In order to achieve the object of the present invention, a coding processing software system based on Chinese meaning is further provided, comprising the storage medium, wherein computer program instructions in the storage medium are called to complete an encoding process based on Chinese meaning.
为实现本发明目的更更进一步提供一种基于汉语含义的编码处理设备,包 括中央处理器,以及与中央处理器相连接的所述的存储介质;Further, an object of the present invention is to provide an encoding processing device based on Chinese meaning, including a central processing unit, and the storage medium connected to a central processing unit;
所述中央处理器调用所述存储介质中的计算机程序指令执行完成基于汉语含义的编码处理。The central processor invokes computer program instructions in the storage medium to perform an encoding process based on Chinese meaning.
本发明基于汉语含义的汉语编码方法及系统和介质设备具有如下优点:The Chinese encoding method and system and medium device based on Chinese meaning have the following advantages:
本发明具有突破性设计,充分考虑采用语素作为设计汉字数字化的代号系统具有的方便性和准确性优势,以中性码和词句码为核心,解决汉语数字化处理中的同音不同字、同字不同义等问题。同时凭语素表和/或其产生的应用如:智能提示输入法。并进行强大灵活准确的编码,丰富完备,使得人们,能够更方便、准确地输入汉语并理解汉语的语义。本编码系统有潜力帮助提升汉语在计算机数字化时代的电子化处理效率,令汉语更适应数字化时代信息处理的要求,为在数字化时代弘扬汉语文化做出贡献。The invention has a breakthrough design, fully considers the convenience and accuracy advantages of adopting morphemes as a code number system for designing Chinese characters, and uses neutral code and word code as the core to solve different homophones and different words in Chinese digital processing. Righteousness and other issues. At the same time, the morpheme table and/or its generated applications are: smart prompt input method. And powerful, flexible and accurate coding, rich and complete, enabling people to input Chinese more easily and accurately and understand the semantics of Chinese. This coding system has the potential to help improve the electronic processing efficiency of Chinese in the era of computer digitization, make Chinese more suitable for the information processing requirements of the digital age, and contribute to the promotion of Chinese culture in the digital age.
附图说明DRAWINGS
为了更清楚地说明本发明具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings to be used in the specific embodiments or the description of the prior art will be briefly described below, and obviously, the attached in the following description The drawings are some embodiments of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative work.
图1为本发明实施例基于语素的汉语编码处理方法流程图;1 is a flowchart of a morpheme-based Chinese encoding processing method according to an embodiment of the present invention;
图2为图1中步骤S100的一种可实施方式;Figure 2 is an embodiment of the step S100 of Figure 1;
图3为图1中步骤S200的一种可实施方式;FIG. 3 is an implementation manner of step S200 in FIG. 1;
图4为图1中步骤S400的一种可实施方式;Figure 4 is an embodiment of the step S400 of Figure 1;
图5为本发明实施例中基于语素的汉语编码系统。FIG. 5 is a morpheme-based Chinese encoding system according to an embodiment of the present invention.
具体实施方式detailed description
如图1-5所示,为了使本发明的目的、技术方案和优点更加清楚明了。结合具体的实施方式,对本发明进行详细说明。此过程中,省略了对公知结构和技术的描述,用以避免对不必要地混淆本发明的概念。对于这些描述,只是示例性的。并不是限制本发明的范围。The objects, technical solutions, and advantages of the present invention are apparent from the accompanying drawings. The present invention will be described in detail in conjunction with specific embodiments. Descriptions of well-known structures and techniques are omitted in the process to avoid unnecessarily obscuring the inventive concept. For these descriptions, it is merely exemplary. It is not intended to limit the scope of the invention.
语素作为汉语语义的基本元素,具有的要求是:⑴它只有一个读音和一个准确的基本含义⑵语素没有字形观念,它对字形中性,不区分简体、繁体形,方便信息的搜索、统计和分析。As the basic element of Chinese semantics, morpheme has the following requirements: (1) it has only one pronunciation and an accurate basic meaning. (2) The morpheme has no glyph, it is neutral to the font, does not distinguish between simplified and traditional, and facilitates the search, statistics and information of information. analysis.
本发明实施例中,语素为代表汉语含义最小的语言单位,同一个字,基于其含义的多寡,可以对应多个语素,语素是汉语组词的元素,是具有唯一性的汉语语义单位,依托于字和词组且不能单独存在。举例:“传”字对应两个语素(英语send,biography;传达或传记);“历”字对应两个语素(英语history,calendar;历史或日历);“日”字对应三个语素(英语sun,day,japanese;太阳、日子、日本)。语素具有唯一的一个读音和一个含义。In the embodiment of the present invention, the morpheme is a language unit representing the smallest Chinese meaning, and the same word, according to the meaning of the meaning, may correspond to multiple morphemes, and the morpheme is an element of the Chinese group word, and is a unique Chinese semantic unit, relying on Words and phrases cannot exist alone. For example: "pass" corresponds to two morphemes (English send, biography; communication or biography); "calendar" corresponds to two morphemes (English history, calendar; history or calendar); "day" corresponds to three morphemes (English Sun, day, japanese; sun, day, Japan). Morphemes have a unique pronunciation and a meaning.
作为一种可实施方式,将语素编码,形成的编码叫中性码;将词和短语编码,形成的编码叫词句码。As an implementation manner, the morpheme is encoded, and the formed code is called a neutral code; the code and the phrase are encoded, and the formed code is called a word code.
本发明实施例的一种基于汉语含义的汉语编码方法,如图1所示,包括如下步骤:A Chinese encoding method based on Chinese meaning, which is shown in FIG. 1 , includes the following steps:
步骤S100,对汉语的语素进行编码,得到每个所述语素的语素码;Step S100, encoding a morpheme of a Chinese language to obtain a morpheme code of each of the morphemes;
如图2所示,本发明实施例中,分析汉语多个含义属性,检出语素,定义并编码每个语素,得到语素码,即为中性码。As shown in FIG. 2, in the embodiment of the present invention, multiple meaning attributes of Chinese are analyzed, morphemes are detected, and each morpheme is defined and encoded to obtain a morpheme code, which is a neutral code.
现有汉字用来记事,文章由句组成,句由词及短语组成,词及短语由现有 汉字组成。现有汉字和西方语言不同,它同时具备形(笔形)、音(拼音)、义(含义)三个属性,一个同形现有汉字可以有多个含义和拼音。由于现有汉字的多义性(多个含义属性)妨碍了信息的自动化处理,影响汉语编码的大数据分析,使检索、传播、翻译、输入等变得相当困难。针对上述现有汉字的弱点,本发明实施例通过分析汉语的多个含义属性,以语素来进行编码得到中性码。The existing Chinese characters are used for notes, the articles are composed of sentences, the sentences are composed of words and phrases, and the words and phrases are composed of existing Chinese characters. The existing Chinese characters are different from the Western languages. They have three attributes: shape (pen shape), sound (pinyin), and meaning (meaning). A homomorphic existing Chinese character can have multiple meanings and pinyin. Because the ambiguity (multiple meaning attributes) of existing Chinese characters hinders the automatic processing of information, the analysis of big data affecting Chinese coding makes it difficult to search, disseminate, translate, input, etc. In view of the weakness of the above-mentioned existing Chinese characters, the embodiment of the present invention encodes a plurality of meaning attributes of Chinese, and encodes the morphemes to obtain a neutral code.
词、语素、现有汉字的区别是:①词是造句的单位②语素是构词的单位③现有汉字是记录词和语素的书写单位。前两者属于语言符号系统,有含义属性;后者属于书写符号系统,主要是字形属性,义属性模糊。语素和现有汉字之间的最明显区别是语素表意、中性,可以用多种不同字形显示,所以其编码可称之为语素码,即中性码;The difference between words, morphemes and existing Chinese characters is: 1 word is the unit of sentence construction 2 morpheme is the unit of word formation 3 Existing Chinese characters are the writing unit of record words and morphemes. The first two belong to the linguistic symbol system and have meaning attributes; the latter belong to the writing symbol system, mainly the glyph attributes, and the meaning attributes are vague. The most obvious difference between a morpheme and an existing Chinese character is that the morpheme is ideographic and neutral, and can be displayed in a plurality of different glyphs, so the encoding can be called a morpheme code, that is, a neutral code;
本发明实施例突破性发明创造是舍弃这个牢不可破的传统方法,以语素编码为构词的单位,以语素为核心结构的信息处理是其他语系(包括英语、法语)所无法做到的,如表1所示。The breakthrough invention of the embodiment of the present invention is to abandon this unbreakable traditional method, using morpheme coding as a unit of word formation, and information processing with morpheme as the core structure is impossible for other language systems (including English and French), such as 1 is shown.
另外,以语素为汉语的核心,可以使现有简、繁汉字之间的转换不须靠上下文分析(context analysis)而靠语素表的指示(语素表为语素的集合,简体和繁体字形均可在语素表中进行定义)进行检索处理,无须识别其是简体或者繁体字,其检索准确率基本能够达到是100%。In addition, morpheme is the core of Chinese, which can make the conversion between existing simplified Chinese characters and traditional Chinese characters not rely on context analysis and rely on the indication of morpheme table (the morpheme table is a collection of morphemes, both simplified and traditional characters) The definition is performed in the morpheme table), and it is not necessary to identify that it is a simplified or traditional Chinese character, and the retrieval accuracy can be basically 100%.
表1:Table 1:
Figure PCTCN2018086500-appb-000002
Figure PCTCN2018086500-appb-000002
Figure PCTCN2018086500-appb-000003
Figure PCTCN2018086500-appb-000003
较佳地,作为一种可实施方式,在现有汉字的基础上,构建语素的每个编码,一个编码对应一个中性码。Preferably, as an implementable manner, each code of the morpheme is constructed on the basis of the existing Chinese characters, and one code corresponds to one neutral code.
作为一种可实施方式,本发明实施例的语素编码方法,形、音、义兼顾,即使用中性码对每个语素进行编码。As an exemplified manner, the morpheme coding method of the embodiment of the present invention combines shape, sound, and meaning, that is, each morpheme is encoded using a neutral code.
本发明实施例中,通过使用两个总表,即汉语的字形总表和字义总表去编码汉字的信息,其中,字形总表只通过汉字的“形”属性(部首、笔画数、笔顺、声符)编码;字义总表只登记汉字的“义”及“音”属性编码,同音同义汉字(譬如塵/尘,陳/陈,峯/峰)只使用同一个代码,无论写的是繁体形,简体形或异体形,只要是同音同义便视如同一个字,因此字义总表中的语素的编码又称“中性码”。In the embodiment of the present invention, the information of the Chinese character is encoded by using two general tables, that is, the Chinese glyph summary table and the word meaning summary table, wherein the glyph summary table only passes the "shape" attribute of the Chinese character (the radical, the stroke number, the stroke order) , (acoustic) coding; the word meaning summary table only registers the "righteous" and "sound" attribute codes of Chinese characters, homonyms synonymous Chinese characters (such as dust/dust, Chen/Chen, peak/peak) use only the same code, regardless of the written It is a traditional form, a simplified form or a variant form. As long as it is synonymous, it is treated as a word, so the code of the morpheme in the list of meanings is also called "neutral code".
本发明实施例中,为解决同形汉字常有多义所引起的问题,采用了字义总表和字形总表。其中,字义总表表示语素的“义”不表示“形”;字形总表表示语素的“形”不表示“义”;这个数据结构是根据关系数据库发明人科特博士(Edgar Frank Codd)提出的数据库完整性第三律而设计,其目的是通过在字义总表中添加汉字的简体形和繁体形字段,将现有汉字复杂的形、音、义“多对多关系”性质改变成简单的“多对一关系”和“一对一关系”。In the embodiment of the present invention, in order to solve the problem caused by the ambiguity of the homomorphic Chinese characters, a word summary table and a glyph summary table are adopted. Among them, the word meaning summary indicates that the morpheme “meaning” does not mean “shape”; the glyph summary indicates that the morpheme “shape” does not mean “meaning”; this data structure is based on the relational database inventor Dr. Edgar Frank Codd. The database integrity is designed according to the third law. The purpose is to change the complex shape, sound, and meaning of the many-to-many relationship of the existing Chinese characters into simple by adding the simplified and traditional characters of the Chinese characters to the summary list. "Many-to-one relationship" and "one-on-one relationship."
进一步地,双总表改变了汉字信息数字化处理的程序:汉语的输入和储存采用字义总表,汉语的输出(文字的显示或打印)采用字形总表。输入和输出 的分别处理是信息处理的一项重大革新,会改变人们的工作习惯。Further, the dual master table changes the program for digitizing the Chinese character information: the input and storage of Chinese uses a list of meanings, and the output of Chinese (display or print of text) uses a glyph summary. The separate processing of input and output is a major innovation in information processing that changes people's work habits.
研究表明,三分之一的汉字多义,本发明实施例为规范汉字的每个不同含义设置中性码,其中,中性码是语素的编码(neutral code)。语素的用途是为汉语规范字的每一个意思准确定义,由于多义的存在,因此一个汉语规范字(2013年中国国务院公布的现有汉字字集)可对应多个语素。规范字的编码方式是四个阿拉伯数字,“中性码”的结构是“规范字代码”+“语素序号”,如下:Studies have shown that one-third of Chinese characters are polysemy. In the embodiment of the present invention, a neutral code is set for each different meaning of a standardized Chinese character, wherein the neutral code is a neutral code. The purpose of the morpheme is to accurately define each meaning of the Chinese normative word. Because of the existence of polysemy, a Chinese normative character (the existing Chinese character set published by the State Council of China in 2013) can correspond to multiple morphemes. The encoding method of the canonical word is four Arabic numerals. The structure of "neutral code" is "normative word code" + "morpheme serial number", as follows:
Figure PCTCN2018086500-appb-000004
Figure PCTCN2018086500-appb-000004
举例,“行”字多音多义,规范字“行”的代码是“0483”,有:①走(walk)②排(row)③行业(business)……多种含义。于是本发明实施例为规范字“行”设置:①0483A(“走”义语素)②0483B(“排”义语素)③0483C(“行业”义语素)……等等多个语素,很清楚地区别规范字“行”的不同含义,如表2所示。For example, the word "line" is multi-syllable, and the code for the standard word "line" is "0483", which has: 1 walk 2 rows (row) 3 industries (business) ... a variety of meanings. Therefore, the embodiment of the present invention sets the specification word "row": 10843A ("walking" morpheme) 2083B ("row" morpheme) 30838C ("industry" sense morpheme) ... and so on, and many morphemes clearly distinguish the norm The different meanings of the word "row" are shown in Table 2.
表2:语素表例子Table 2: Morphological Table Example
规范字Normative word 规范字代号Specification word code 语素代码Morpheme code 语素音Morpheme 语素含义Morpheme meaning 能组成的词a word that can be composed
Row 04830483 0483A0483A XíngXíng 走(walk)Walk 行走、步行、旅行、行踪Walking, walking, traveling, whereabouts
Row 04830483 0483B0483B HangHang 排(row)Row 单行、双行、雁飞成行Single line, double line, geese flying into line
Row 04830483 0483C0483C HangHang 行业(business)Business 外行、同行如敌国Foreigners, peers, such as enemy countries
语素码是在现有规范汉字的代号基础上(譬如“0483”是“行”的规范字代码),加上一个识别字母(A、B、C、D、E、F、G……)作为不同含义语素的代码,如上例:0483A是“走”义语素的代码,0483B是“排”义语素的代码,0483C是“行业”义语素的代码。The morpheme code is based on the code of the existing standard Chinese character (for example, "0483" is the "line" standard word code), plus an identifying letter (A, B, C, D, E, F, G...) The codes of different meaning morphemes, as in the above example: 0483A is the code of the "go" morpheme, 0483B is the code of the "row" morpheme, and 0483C is the code of the "industry" morpheme.
作为一种更佳的实施方式,在语素编码的基础上,添加语素编码个数N,得 到N个语素码,其中,N为整数,表示该现有汉字共有N个语素码。As a more preferred embodiment, based on the morpheme coding, the number N of morpheme codes is added, and N morpheme codes are obtained, wherein N is an integer, indicating that the existing kanji has a total of N morpheme codes.
语素码是在现有汉字的基础上,加上数字N,N为整数,即该现有规范字有N个语素码,如现有规范字“行”的语素码为04833,其中,最后一位3表示该规范字有3个语素。The morpheme code is based on the existing Chinese characters, plus the number N, N is an integer, that is, the existing canonical word has N morpheme codes, for example, the morpheme code of the existing canonical word "row" is 04833, wherein the last one Bit 3 indicates that the canonical word has 3 morphemes.
步骤S200,对汉语中的词及短语进行编码,得到所述词和短语的词句码;Step S200, encoding words and phrases in Chinese, and obtaining word codes of the words and phrases;
如图3所示,本发明实施例中,以语素码(中性码)为构建单位,对词或者短语设置编码,得到词句码。As shown in FIG. 3, in the embodiment of the present invention, a morpheme code (neutral code) is used as a construction unit, and a word or a phrase is set to obtain a word code.
词和短语是人类思维、推理、交换信息的基本单位,本发明实施例的语素。和化学比较,字好比是原子(atom),词和短语好比是分子(molecule)或基因(DNA),分析物质的性能应该止于分子或基因,分析文章应该以词和短语为基本单位。从本发明实施例的角度看,语素是组成词的元素(element),词是构成语句的基本单位(basic unit)。本发明实施例中,将汉语的词(基本上是语文词)或者短语一体对待,站在它的观点看,“词”是所有能独立成义的单音节语素或数个语素的组合,因此单字词、多字词和短语(成语、联语、谚语、俗语、歇后语、格言、名句、惯用语、人名、地名、机构名、牌子名、商品名、专科术语……)都应该被编码,从而设置为词句码,而词句码由语素码(中性码)设置。Words and phrases are the basic units of human thinking, reasoning, and exchange of information, morphemes of embodiments of the present invention. Compared with chemistry, words are like atoms. Words and phrases are like molecules or genes. The performance of analytes should stop at molecules or genes. Analytical articles should be based on words and phrases. From the perspective of the embodiment of the present invention, a morpheme is an element constituting a word, and a word is a basic unit constituting a sentence. In the embodiment of the present invention, a Chinese word (substantially a Chinese word) or a phrase is treated in one piece. From the point of view of it, a "word" is a combination of all monosyllabic morphemes or a plurality of morphemes that can be independent. Single words, multiple words and phrases (idioms, conjunctions, proverbs, proverbs, afterwords, maxims, famous sentences, idioms, names of people, place names, institution names, brand names, trade names, specialist terms...) should be coded , thus set to the word code, and the word code is set by the morpheme code (neutral code).
以语素定义词和短语的最大好处是语素有“唯一性”(unicity)。文字的功能是记事,能越清楚越好。将信息编码的最大目的是取得“唯一性”,杜绝以普通语言表达的含糊不准确情况。The biggest benefit of defining words and phrases with morphemes is that morphemes have "unicity". The function of the text is a note, the clearer the better. The biggest purpose of coding information is to achieve "uniqueness" and to eliminate ambiguities and inaccuracies expressed in ordinary languages.
语素是组成词或短语的单位,它应当能很准确地为词和短语表音。从语素设置词和短语的编码,在本发明实施例中统称为词句码。如表3所示,从组词角 度为词句码分类为包括但不限于以下八大类:①语文类语素(词句码)②姓氏类语素(词句码)③人名类语素(词句码)④地名类语素(词句码)⑤科技类语素(词句码)⑥古汉语语素(词句码)⑦无义表音语素(词句码)⑧表形语素(词句码)等等。后两类在现有技术中不承认是真语素(真词句码),但本发明实施例中,为了信息的精准检索和大数据分析的需要也为它们编码,称为“假语素(假词句码)”。A morpheme is a unit of words or phrases that should be able to accurately pronounce words and phrases. The encoding of the words and phrases from the morpheme is collectively referred to as the word code in the embodiment of the present invention. As shown in Table 3, the word code is classified into the following eight categories from the perspective of group words: 1 language morpheme (word code) 2 surname morpheme (word code) 3 person name morpheme (word code) 4 place name class Morpheme (word code) 5 science morpheme (word code) 6 ancient Chinese morpheme (word code) 7 nonsense phoneme morpheme (word code) 8 table morpheme (word code) and so on. The latter two types are not recognized as true morphemes (true word code) in the prior art, but in the embodiments of the present invention, they are also encoded for the accurate retrieval of information and the need for big data analysis, which is called "false morpheme (false words) code)".
以语素为构建单位,对无义表音(无含义,表达发音)语素和表形语素的词或者短语设置编码为假语句码。The morpheme is used as the construction unit, and the words or phrases of the morpheme and the morpheme of the nonsense (no meaning, express pronunciation) are set to be pseudo-statement codes.
作为一种可实施方式,很多组成词或短语的字(特别是外来词)只表音,不表意,如:“马达”这个词中的“马”和“达”字;“雪铁龙”这个词中的“雪”、“铁”和“龙”字。这些用于翻译外来产品、商标、人名和地名的汉字是只用来表音的,马、达、雪、铁、龙……这些字和其本义毫无关系。As an implementable way, many words (especially foreign words) that make up a word or phrase are only sounds, not ideograms, such as: "horse" and "da" in the word "motor"; the word "Citroen" The words "snow", "iron" and "dragon" in the middle. These Chinese characters used to translate foreign products, trademarks, names and place names are used only for the pronunciation. Ma, Da, Xue, Tie, Long... These words have nothing to do with their original meaning.
本发明实施例中的语句码,利用“无义表音语素”表收集全部表音汉字,为每一个表音字编码,大大地改善信息检索和分析的准确性。In the statement code in the embodiment of the present invention, all the phonetic Chinese characters are collected by using the "nonsense phonetic morpheme" table, and each phonetic word is encoded, which greatly improves the accuracy of information retrieval and analysis.
表3:语素的编码规则Table 3: Morphological coding rules
语素类别Morpheme category 语素序号Morpheme number 能组成的词(例子)Words that can be composed (examples)
语文类语素Language morpheme A、B、C、D、E、F、G、H、I、J、K、LA, B, C, D, E, F, G, H, I, J, K, L 蛇行、苦行僧、自行其是Snake, ascetic, self
姓氏类语素Surname morpheme MM 陈、李、张、王、何Chen, Li, Zhang, Wang, He
人名类语素Human morpheme NN 慈禧太后、李白、朱邦復Empress Dowager Cixi, Li Bai, Zhu Bangfu
地名类语素Toponymic morpheme PP 上海、巴黎、马陵道Shanghai, Paris, Maling Road
科技类语素Technology morpheme RR 本特雷电报码、有机发光材料Bentley telegraph code, organic luminescent material
古汉语语素Ancient Chinese morpheme TT 夫未战而庙算胜者,得算多也If the husband is not fighting and the temple is the winner, it’s too much.
无义表音语素Nonsense phoneme morpheme VV 马歇尔、三文治、雷达Marshall, sandwich, radar
表形语素Tabular morpheme ZZ 图书馆/圖書館Library/library
例如,“马车道”和“马陵道”这两个词都有“马”和“道”字。“马”字的规范字号码是2777,“道”字的规范字号码是2745。马车道的“马”语素是 2777A,马陵道的“马”语素是2777P;马车道的“道”语素是2745B,马陵道的“道”语素是2745P;这两个词的“马”字和“道”字语素不同,原因是:马车道是常用词,马陵道是地理名词。若不从语素层面区分,信息的搜索便不可能精准,而将意为动物的“马”和地理名词的“马”音字混用,于是数据的分析结果不准确。For example, the words "horse lane" and "maling road" have the words "horse" and "dao". The standard word number of the word "马" is 2777, and the standard word number of the word "dao" is 2745. The "horse" morpheme of the horse lane is 2777A, the "horse" morpheme of the Maling Road is 2777P; the "dao" morpheme of the horse lane is 2745B, and the "dao" morpheme of the Ma Lingdao is 2745P; the "horse" of these two words The word and the word "dao" are different, because the horse lane is a common word, and the Ma Lingdao is a geographical term. If you do not distinguish from the morpheme level, the search for information cannot be accurate, but the meaning of the "horse" of the animal and the "horse" of the geographical term are mixed, so the analysis result of the data is not accurate.
例子 Example :
动物“马”语素(2777A)可以组成:马路、赛马、马到成功……等词或短语;The animal "horse" morpheme (2777A) can be composed of: words, phrases, horses, horses, horses, successes, etc.;
地名“马”语素(2777P)可以组成:马陵道、马嵬坡……等短语;The place name "Ma" morpheme (2777P) can be composed of: Ma Lingdao, Ma Yipo...
表音“马”语素(2777V)可以组成:马达、罗马、马德里……等短语;The phonogram "Ma" morpheme (2777V) can be composed of: motor, Rome, Madrid, etc.;
以语素为构词单位,对无义表音语素和表形语素之外的词或者短语进行编码而得到假词句码。The morpheme is a word-forming unit, and the word or phrase other than the non-speech morphemes and the morphemes are encoded to obtain a false word code.
下面表4为《语素、词比较表》,显示语素和词之间的关系。Table 4 below is a morpheme and word comparison table showing the relationship between morphemes and words.
表4:语素、词比较表Table 4: Morpheme, word comparison table
  语素Morpheme 词/短语Word/phrase
代号Code 语素编码(中性码)=4个阿拉伯数字+语素序号Morpheme coding (neutral code) = 4 Arabic numerals + morpheme number 词句码(按类别和用途,分散储存在多个表中)Word code (distributed in multiple tables by category and purpose)
功能Features 是组成词或短语的最小信息处理单位Is the smallest information processing unit that makes up a word or phrase 是组成复合词或句的单位Is the unit that makes up a compound word or sentence
核心表格Core form 《语素表》+附表(部首、声符……)Morpheme Table + Schedule ( radicals, notes...) 常用词、成语、人名、地名、术语……数十个表Commonly used words, idioms, names of people, place names, terms... dozens of tables
表格结构Table structure 《规范字字形表》和《语素表》是姐妹表The Normative Glyph Table and the Morpheme Table are sister tables. 每个表的字段数量和内容都不同,表与表互相串联The number and content of the fields in each table are different, and the table and table are connected in series.
基本用途Basic use 字形查阅、字音查阅、字义查阅、语素检索……Glyph look, word pronunciation, word sense, morpheme search... 信息的取阅、搜索、分析,支持智能提示输入法Information retrieval, search, analysis, support intelligent prompt input method
无论是常用词或是短语,它们都被一个或多个语素所定义(“词/短语”=“语素1”+“语素2”+“语素3”……)。以《成语表》为例,每个成语都由四个(或更多)语素组成,如下表5:Whether they are common words or phrases, they are defined by one or more morphemes ("word/phrase" = "morphe 1" + "morphe 2" + "morphe 3"...). Take the idiom table as an example. Each idiom consists of four (or more) morphemes, as shown in Table 5 below:
表5:table 5:
成语idiom 成语拼音Idiom phonetic 第一字First word 第二字Second word 第三字Third word 第四字Fourth word “道”字含义The meaning of the word "dao" 序号Serial number 语素码Morpheme code
安贫乐道Anomalous ān pín lè dàoNn pín lè dào Ann poor fun Road 法则,道德Rule, morality AA 2745A2745A
成语idiom 成语拼音Idiom phonetic 第一字First word 第二字Second word 第三字Third word 第四字Fourth word “道”字含义The meaning of the word "dao" 序号Serial number 语素码Morpheme code
班荆道故Ban Jingdao bān jīng dào gùBān jīng dào gù class Jing Road Therefore 说,讲Say, speak CC 2745C2745C
背道而驰Running in the opposite direction bèi dào ér chíBèi dào ér chí Back Road and Chi 路,途径Road BB 2745B2745B
兵行诡道Soldiers bīng xíng guǐ dàoBīng xíng guǐ dào Soldier Row sly Road 法则,道德Rule, morality AA 2745A2745A
惨无人道inhuman cǎn wú rén dàoCǎn wú rén dào awful no people Road 法则,道德Rule, morality AA 2745A2745A
豺狼当道Jackal chái láng dāng dàoChái láng dāng dào Wolf when Road 路,途径Road BB 2745B2745B
称孤道寡act like an absolute monarch chēng gū dào guǎChēng gū dào guǎ Weigh solitary Road Widow 说,讲Say, speak CC 2745C2745C
称兄道弟call each other brothers chēng xiōng dào dìChēng xiōng dào dì Weigh Brother Road younger brother 说,讲Say, speak CC 2745C2745C
盗亦有道Pirates also have a way dào yì yǒu dàoDào yì yǒu dào Thief also Have Road 法则,道德Rule, morality AA 2745A2745A
道傍之筑Building of the road dào bàng zhī zhùDào bàng zhī zhù Road It build 路,途径Road BB 2745B2745B
道边苦李Daobian Li dào biān kǔ lǐDào biān kǔ lǐ Road side bitter Lee 路,途径Road BB 2745B2745B
道山学海Daoshan Xuehai dào shān xué hǎiDào shān xué hǎi Road mountain learn sea 法则,道德Rule, morality AA 2745A2745A
本发明实施例中,例如其中的《成语表》,在七千个成语中,有75个成语包含“道”字,但是“道”字对应6、7个语素,所以本发明实施例为《成语表》编码的时候,应当指明构成成语的“道”字究竟是哪一个语素(A、B、C……的其中一个)。In the embodiment of the present invention, for example, in the "Idiom Table", among the seven thousand idioms, 75 idioms include the word "dao", but the word "dao" corresponds to 6 or 7 morphemes, so the embodiment of the present invention is " When the idiom table is coded, it should indicate which morpheme (one of A, B, C, ...) that constitutes the idiom.
本发明实施例中,老子《道德经》第一章第一句是:“道可道,非常道”;这句中出现三个“道”字,意思都不同,所以应该使用三个不同语素去表达和记忆其含义(影响解读和翻译)。其中,第一個“道”字是名詞,意指“道德经”道理的“道”(Dao);第二個“道”字是动詞,意指“说出”(talk);第三個“道”字,意指“方法”(method)。“道可道,非常道”这句子可以翻译成:“可以口述的道理不是永恒的真理”。《道德经》哲理深奥,后人的解释可能不是老子本人的意见,没有语素观念,无法准确翻译《道德经》作者所欲表达的道理。In the embodiment of the present invention, the first sentence of the first chapter of Laozi's Tao Te Ching is: "Tao Dao, very Tao"; three "Tao" characters appear in this sentence, meaning different, so three different morphemes should be used. To express and remember its meaning (affecting interpretation and translation). Among them, the first word "dao" is a noun, meaning "dao" (Dao) of the Tao Te Ching; the second "dao" is a verb, meaning "talk"; third The word "dao" means "method". The phrase "Tao Dao, very Tao" can be translated as: "The truth that can be dictated is not an eternal truth." The philosophical theory of Tao Te Ching is profound, and the explanations of later generations may not be the opinions of Lao Tzu himself. There is no morphological concept and the truth of the author of the Tao Te Ching cannot be accurately translated.
本发明实施例将全部汉语词汇(目标数量是一百万个)按词类分布在数十至数百个表格之中(常用词、成语、联语、惯用语、歇后语、谚语、俗语、格言、典故、人名、地名、学校机构组织名、专科术语……);将词汇编码意指为词和短语以语素定义,如上例将“安贫乐道”拆分成:安、贫、乐、道四个语 素,指明“道”字语素的“中性码”=2745A;将“班荆道故”拆分成:班、荆、道、故四个语素,指明“道”字语素的“中性码”=2745C……。2745这四个阿拉伯数字代表规范字“道”,A和C是“道”语素的序号。The embodiment of the present invention distributes all Chinese vocabulary (the number of targets is one million) in tens to hundreds of forms according to word classes (common words, idioms, linguistics, idioms, linguistics, slang, proverbs, maxims, Allusions, names of people, names of places, names of school organizations, specialist terms...); lexical coding means that words and phrases are defined by morphemes, as in the above example, the "Apocalypse" is divided into four morphemes: security, poverty, music, and Tao. , indicating the "neutral code" of the "dao" morpheme = 2745A; splitting "Bang Jing Dao" into four morphemes: class, Jing, Tao, and so, indicating the "neutral code" of the "dao" morpheme =2745C...... The two Arabic numerals of 2745 represent the normative word "dao", and A and C are the serial numbers of the "dao" morphemes.
本发明实施例中,核心方法是:(1)每个词和每个短语(成语、地名、专科术语……)都要被编码。每个码代表一个概念(concept),不代表一个汉字字串。同概念的词或短语(譬如“鼠标”/“滑鼠”;“宇航员”/“太空人”)虽字串不同,只用同一个码去代表。有N个含义的词或短语用N个码去代表(譬如“粉丝”这个词有“食物”和“FANS”两个显著不同的含义,所以用两个不同码去代表)(2)每个词句码都要被准确定义;为了定义的准确性,在很多情况下本发明实施例加进英语/法语对应词(譬如使用“FANS”去为“粉丝”准确定义)(3)字串(词/短语)以中性码(语句语素)表达;用中性码表达的词/短语令它们更具独立性和准确性,因此没有被简、繁、异体字形不同所困扰(4)视词或短语的性质,本发明实施例使用有不同结构的表去记录其属性(譬如常用词表、成语表、地名表……的字段数目和内容完全不同。In the embodiment of the present invention, the core method is: (1) Each word and each phrase (idiom, place name, specialist term, ...) are encoded. Each code represents a concept and does not represent a Chinese character string. Words or phrases of the same concept (such as "mouse" / "mouse"; "astronaut" / "spaceman"), although the strings are different, only use the same code to represent. Words or phrases with N meanings are represented by N codes (for example, the word "fan" has two significantly different meanings of "food" and "FANS", so it is represented by two different codes) (2) each The word code must be accurately defined; for the accuracy of the definition, in many cases the embodiment of the present invention adds English/French corresponding words (such as using "FANS" to accurately define "fans") (3) string (word /phrase) expressed in neutral code (sentence morphemes); words/phrases expressed in neutral codes make them more independent and accurate, so they are not plagued by differences in simple, complex, and alloglyphs. (4) Vision or The nature of the phrase, the embodiment of the present invention uses a table with different structures to record its attributes (such as the number of fields and contents of the common vocabulary, idiom table, place name table, etc.) are completely different.
以《成语表》为例,将“词/短语”编码的办法是:Take the "Idiom Table" as an example. The way to encode "words/phrases" is:
步骤A,将汉语词汇收集整理,按词性/词类储存在关系数据库的多个表格之中;In step A, the Chinese vocabulary is collected and stored in a plurality of forms in the relational database according to the part of speech/word class;
作为一种可实施方式,所述表格包括但不限于常用词表、成语表、格言表、典故表、中国地名表……等等。As an implementation manner, the table includes, but is not limited to, a common vocabulary, an idiom list, a philanthropy list, an allusion table, a Chinese place name table, and the like.
步骤B,将所述表格将词汇拆分成语素;Step B, the table is divided into morphemes;
逐个表由汉语专家们在技术人员的协助下审查、补充、改错,并在生成一个完整的表格后,通过编码程序将词汇拆分成语素。Each table is reviewed, supplemented, and corrected by Chinese experts with the assistance of technicians. After generating a complete form, the vocabulary is split into morphemes by the coding program.
譬如将“安贫乐道”成语拆分成:安、贫、乐、道四个规范字,每个规范字用四个阿拉伯数字为其定义(譬如安贫乐道的“道”字以2745代表)。For example, the idioms of "Apocalypse" are divided into four norms: security, poverty, music, and Tao. Each normative word is defined by four Arabic numerals (for example, the word "dao" in the middle of poverty is represented by 2745).
步骤C,将上述规范字以合适的语素替代,办法是为每个规范单字加插一个语素序号(A、B、C……),譬如安贫乐道的“道”字的语素序号是“A”(于是语素代号是2745A)。In step C, the above-mentioned canonical word is replaced by an appropriate morpheme by adding a morpheme number (A, B, C, ...) to each of the canonical words, for example, the morpheme number of the word "dao" of the sinister music is "A" ( So the morpheme code is 2745A).
步骤S300,构建汉语编码数据库,所述数据库中包括与所述语素相对应的语素表及与所述词及短语相对应的词句表;且所述语素表中包含每个语素的语素码,所述词句表中包含词及短语的词句码。Step S300, constructing a Chinese code database, the database includes a morpheme table corresponding to the morpheme and a vocabulary list corresponding to the word and the phrase; and the morpheme table includes a morpheme code of each morpheme, The word list contains the words and phrases of the words and phrases.
对中性码进行分类汇总链接,结合词句码,形成基于语义编码的语素数据库。The neutral code is classified and summarized, and the word code is combined to form a morpheme database based on semantic coding.
作为一种较佳的可实施方式,语素表集成到一八维矩阵空间中,并进行排序链接。As a preferred implementation, the morpheme table is integrated into an eight-dimensional matrix space and sorted and linked.
所述每个词句码以一个32位元,即8个16进数字,4个字节的长度的数字,整个词句码可以用数个向量和矩阵的组合表示。Each of the word codes is represented by a combination of a plurality of vectors and a matrix by a number of 32 bits, that is, 8 16-digit numbers and 4 bytes of length.
由于词句码的数量多达一百万,远远超出16个位元的极限(目前编码系统,包括统一码,都以16个位元为标准,以16个位元编码只能制作65536个码),所以本发明汉语编码方法是以一个32位元(即8个16进数字,4个字节的长度)的16进数字(hexadecimal number)来为词和短语进行编码(语素本身是用16个位元编码)。Since the number of words and sentences is up to one million, far exceeding the limit of 16 bits (current coding systems, including Unicode, are based on 16 bits, and only 16536 codes can be produced with 16 bits. Therefore, the Chinese encoding method of the present invention encodes words and phrases by a 32-bit (ie, 8 16-digit, 4-byte length) hexadecimal numbers (the morpheme itself is 16) One bit encoding).
为一百万个结构和属性殊异的词汇编码是非常复杂的问题,要考虑码的唯一性(unicity)、更新需要(update)和读取方便性(accessibility)……等等。作为一种可实施方式,将每个编码(包括中性码和词句码)视如八维矩阵 空间的一个点,点以X、Y、Z、P、Q、R、S、T八个16进数值(0、1、2、3、4、5、6、7、8、9、A、B、C、D、E、F)去定位。本发明实施例利用八维矩阵空间结构来做信息的储存地址。The vocabulary coding for a million structures and attributes is a very complicated problem, considering the uniqueness of the code, the update and accessibility, and so on. As an implementation manner, each code (including a neutral code and a word code) is regarded as a point in an eight-dimensional matrix space, and the points are eight, X, Y, Z, P, Q, R, S, and T. The values (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F) are de-positioned. The embodiment of the present invention utilizes an eight-dimensional matrix spatial structure to make a storage address of information.
编码是由分类值、排序链接值、保护码三个部分组成,它的结构是:The coding is composed of three parts: classification value, sorting link value and protection code. Its structure is:
Figure PCTCN2018086500-appb-000005
Figure PCTCN2018086500-appb-000005
分类值(class value)是一个立体矩阵(三维矩阵空间)里的一个点,存放在三维空间的一个表里,表的名称可称为数据布局图(data allocation table),它的任务是记录数百个表的属性;A class value is a point in a stereo matrix (three-dimensional matrix space) stored in a table in three-dimensional space. The name of the table can be called a data allocation table. Its task is to record the number. The attributes of a hundred tables;
排序链接值是每个表的记录(record)的序号(sequential number),用四个16进数字表达。;The sort link value is the sequence number of the record of each table, expressed in four 16-digit numbers. ;
作为一种可实施方式,所述排序可以按拼音字母进行排序,例如a、o、e……;也可以按笔划大小进行排序,例如:一、乙……。As an implementation manner, the sorting may be sorted by pinyin letters, such as a, o, e, ...; or may be sorted by stroke size, for example: one, B, ....
作为一种可实施方式,所述链接可以以具有相同字形进行链接,例如姓名“李白”的诗之间进行链接;也可以以具有相同字义的字词之间进行链接,例如以“孝”的所有语素之间进行链接。As an implementation manner, the links may be linked in the same glyph, for example, between the poems of the name “Li Bai”; or may be linked between words having the same meaning, for example, “filial piety” Link all morphemes.
保护码是从所述分类值和所述排序值凭编码图计算出来的控制数值(control digit),作为一种可实施方式,可以从所述分类值和所述排序值计算出来的控制数值。所述编码图是由数个向量[vector]和矩阵[array]组成的表结构。The protection code is a control digit calculated from the classification value and the ranking value by a code map, and as an implementable manner, a control value that can be calculated from the classification value and the ranking value. The code map is a table structure composed of a number of vectors [vector] and a matrix [array].
作为一种可实施方式,以下是分类值(class value)包括但不限于各表格分类:As an implementation, the following class values include, but are not limited to, table classifications:
(一)字典类,以下是本发明实施例的代码表(即字典类)的主要表格等(每个表格是一个独立档案,储存同类信息):(1) Dictionary class, the following is the main table of the code table (ie, dictionary class) of the embodiment of the present invention (each table is an independent file, storing the same kind of information):
单字表,其收集13000个汉语单字,它的主要字段(fields)是简繁体字形、拼音、短拼、部首、笔画数、声符、笔顺、短释义、详解。The word list, which collects 13,000 Chinese words, its main fields are simplified and traditional fonts, pinyin, short spells, radicals, strokes, notes, strokes, short interpretations, and detailed explanations.
常用字字形表,其收集约6000个常用简体、繁体及异体汉语字形,它的主要字段是:字形、统一码代号、部首、笔画、笔顺、声符、基本音、基本义(举例:碍/礙、袄/襖、鑑/鉴/鋻/鑒……这些字虽同义但是每个不同形字都以一个记录去登记信息);A commonly used glyph table, which collects about 6,000 commonly used simplified, traditional and variant Chinese glyphs. Its main fields are: glyph, Unicode code, radical, stroke, stroke order, sound, basic sound, basic meaning (example: / 碍,袄/袄,鉴/鉴/鋻/鉴...... These words are synonymous but each different form uses one record to register information);
常用字字义表(即常用语素表),其是依据4500个常用汉字制作出来的字义码(估计有8000个),它是将来制作中性码的蓝本,其主要字段是:字义码代号(即语素代号)、缺省字形、简体字形、繁体字形、异体字形、字义码定义、拼音、短拼、简释义、详释义、例子、英语对应字(举例:碍/礙、袄/襖、鑑/鉴/鋻/鑒……这些字虽不同形但是每组字都只以一个记录去登记信息;但是一些多义字,譬如储藏的“藏”字和西藏的“藏”字,虽然字形一样,因为前者的读音是cáng,后者的读音是zàng,含义完全不同,毫无瓜葛,所以我们用两个记录去登记信息);Commonly used word meaning table (ie commonly used morpheme table), which is based on 4500 commonly used Chinese characters (estimated 8000), it is the blueprint for making neutral code in the future, the main field is: word code code (ie Morpheme code), default glyphs, simplified glyphs, traditional glyphs, variants, word definitions, pinyin, short spells, short interpretations, detailed explanations, examples, English equivalents (example: obstacles, obstacles, 袄/袄, Jian/鉴/鋻/鉴...... These words are different, but each group of words only uses one record to register information; but some polysemous words, such as the word "Tibetan" stored in Tibet and the word "Tibetan" in Tibet, although the glyphs are the same, Because the former's pronunciation is cáng, the latter's pronunciation is zàng, the meaning is completely different, there is no connection, so we use two records to register information);
部首表,其收集260个简体及繁体部首(举例:钅、釒、金是三个不同部首);The radical table, which collects 260 simplified and traditional radicals (for example: 钅, 釒, and gold are three different radicals);
声符表,其收集1000个声符,汉语80%是形声字,本发明实施例中,“声符”改以“声”旁查字,用法类似部首。The sound note table collects 1000 sounds, and 80% of Chinese is a sound word. In the embodiment of the present invention, the "sound" is changed to the "sound" side by word, and the usage is similar to the radical.
关键字表,其收集500个关键字,词句由语素组成,作为一种可实施方式,选择最常用的500个语素,称之为“关键字”。凭语义的关键字去做信息搜索是一个基本功能,不同于其他汉语编码系统,因为关键字以字义码定义,含义准确,因而信息的搜索能做得非常细腻,非其他汉语系统所能做到。The keyword table, which collects 500 keywords, consists of morphemes. As an implementable method, the most commonly used 500 morphemes are selected and called "keywords". Information search by semantic keywords is a basic function. Different from other Chinese coding systems, because keywords are defined by word meaning, the meaning is accurate, so the search of information can be done very delicately. It can be done by other Chinese systems. .
(二)词典类-以下是本发明实施例的诗词词典类的表格等:(2) Dictionary Type - The following is a table of poetry dictionary classes of the embodiment of the present invention:
常用词表,其收集约六万个常用词,它是按照一码一义的原则去做出来的表格,每个记录只有一个基本含义。其主要字段(fields)是:简体词、繁体词、拼音、连拼、释义、例句、英语词、法语词、关键字、首字、尾字、字数。A commonly used vocabulary, which collects about 60,000 commonly used words. It is a form that is made according to the principle of one yard and one meaning. Each record has only one basic meaning. The main fields are: simplified characters, traditional Chinese characters, pinyin, even spells, definitions, example sentences, English words, French words, keywords, first words, tail words, words.
短语类-以下是短语诗词词典类的表格等:Phrase class - The following is a table of the phrase poetry dictionary class, etc.:
成语,其收集约七千个成语,它的字段是:简体字串、繁体字串、拼音、连拼、简体释义、繁体释义、使用例句、英语翻译、关键字、首字、尾字;Idiom, which collects about 7,000 idioms, its fields are: simplified string, traditional string, pinyin, even spell, simplified interpretation, traditional interpretation, use example sentences, English translation, keywords, first words, tail words;
中华联语,这个表可收集三千条联语(两句构成的成语/名言),它的字段是:联语、拼音、注解、短评、来源、出处、类别、关键词、首字;Chinese Association, this table can collect 3,000 links (two sentences of idioms / famous words), its fields are: joint language, pinyin, annotation, short comment, source, source, category, keyword, first word;
谚语,其收集约二千个常用谚语,它的字段是:谚语、类别、解释;Proverbs, which collect about two thousand common proverbs, whose fields are: proverbs, categories, explanations;
歇后语,其收集约二千个歇后语,它的字段是:歇后语、类别、解释;After the break, it collects about 2,000 words after the break, and its fields are: after-speaking, category, explanation;
名言,其收集约二千个中国名言,它的字段是:名言、类别、出处、作者;Famous saying, it collects about 2,000 Chinese famous sayings, its fields are: famous words, categories, sources, authors;
格言,其收集约二千个常用格言,它的字段是:格言、类别、出处、解释;The maxim, which collects about two thousand common adages, whose fields are: maxim, category, source, interpretation;
寓言,其收集约二千个中国寓言,字段包括但不限于:寓言、标题、类别、作者等;Fable, which collects about 2,000 Chinese fables, including but not limited to: fables, title, category, author, etc.;
熟语,其又称习惯用语,其收集约二千个常用熟语,它的字段是:熟语、类别、出处、解释。Sayings, also known as idioms, collect about two thousand common idioms, and its fields are: idioms, categories, sources, and explanations.
(三)诗词古籍类-以下是诗词古籍类的主要表格:(3) Poetry Ancient Books - The following are the main forms of poetry and ancient books:
唐诗三百首,其由诗内容和诗作者两个表构成,它的字段包括但不限于:诗名、作者、作者介绍、诗体裁、原文、注解、评论、汉语翻译、英语翻译、法语翻译;There are three hundred poems of Tang poetry, which consist of two tables of poetry and poetry. Its fields include but are not limited to: poetry, author, author introduction, poetry genre, original text, annotation, commentary, Chinese translation, English translation, French translation. ;
宋词三百首,其由词内容和词作者两个表构成,它的字段包括但不限于:词牌名、词题目、作者、作者介绍、词原文、注解、评论、汉语翻译、英语翻译、法语翻译;There are three hundred songs in Song Dynasty, which consist of two parts: word content and word author. Its fields include but are not limited to: name card name, word title, author, author introduction, word original, annotation, comment, Chinese translation, English translation, French translation;
白香词谱,其是清朝嘉庆年间靖安人舒梦兰所编选。它选录由唐到清的词作共一百篇,凡一百调,是填词的宝贵参考资料。这个表的字段包括但不限于:词牌名、作者、题目、词原文、题考、作法;Bai Xiang's lyrics, which was compiled by Shu Menglan of Jing'an people during the Jiaqing period of the Qing Dynasty. It selects a total of 100 words from Tang to Qing, and all hundred is a valuable reference for lyrics. The fields of this table include but are not limited to: name card name, author, title, original text, test, practice;
历代诗选,其收集秦代至近代的诗约二千首;Selected poems of the past dynasties, collecting about 2,000 poems from the Qin Dynasty to the modern times;
历代词选,其收集唐代至近代的词约二千首;Selected slogans in the past, collecting about 2,000 words from the Tang Dynasty to modern times;
佩文诗韵,其收集诗韵105个,其字段包括但不限于:诗韵名称、大类、诗韵数目、附属诗韵的单字;Pei Wen poetry, which collects 105 poems, its fields include but are not limited to: poetry rhyme name, big category, poetry rhyme number, attached poem rhyme word;
古文观止,其是历代中国散文总集,共218篇。它是清朝康熙年间由吴楚材和吴调侯两人所编选的学习古文读本。这个表的字段是:作者、作者介绍、朝代、书名、文章题目、原文、注释、白话译文、短评Gu Wenguan, which is a collection of Chinese prose in the past dynasties, a total of 218 articles. It was a study of ancient Chinese texts selected by Wu Chucai and Wu Tiaohou during the Kangxi reign of the Qing Dynasty. The fields of this table are: author, author introduction, dynasty, title, article title, original text, comment, vernacular translation, short comment
四书,其是《论语》、《孟子》、《大学》、《中庸》的合称。《论语》纪录孔子的言行,《孟子》纪录孟轲的言行,《中庸》和《大学》是南宋理学家朱熹取《礼记》的两篇文章成书。四书的作者是孔子、子思、孟子、程子、朱熹等,其编撰时间间隔达一千八百年。宋元以后,四书成为学校官定教科书和科举考试必读书。The four books, which are the collective name of "The Analects of Confucius", "Mencius", "University", and "The Doctrine of the Mean". "The Analects of Confucius" records the words and deeds of Confucius, "Meng Zi" records Meng Yan's words and deeds, "The Doctrine of the Mean" and "University" are two articles written by the Southern Song Dynasty scholar Zhu Xi from the "Book of Rites". The authors of the four books are Confucius, Zi Si, Mencius, Cheng Zi, Zhu Xi, etc., with a time interval of 1,800 years. After the Song and Yuan Dynasties, the four books became a must-read for the school's official textbooks and the imperial examinations.
道德经,《道德经》是中国春秋时期老子(李耳)所作,共81章,被译成多国文字。其字段包括但不限于:章节、原文、白话译文、英语译文、法语译文、评论;The Tao Te Ching, the Tao Te Ching, was made by Laozi (Li Er) in the Spring and Autumn Period of China. It consisted of 81 chapters and was translated into many languages. The fields include, but are not limited to, chapters, original texts, vernacular translations, English translations, French translations, and reviews;
中国民歌精选,其收集中国民歌约三百首。A selection of Chinese folk songs, which collects about 300 Chinese folk songs.
(四)史地类-以下是史地类的主要表格:(4) History and Geography - The following are the main forms of the history and geography:
中国朝代,其字段包括但不限于:朝代名、公元起止年代、创建人、都城、今地、主要人物、注释;In the Chinese dynasty, its fields include, but are not limited to: the name of the dynasty, the age of the beginning of the AD, the founder, the capital, the present place, the main characters, and the notes;
中国历史大事,其字段包括但不限于:年份、朝代、大事简述;The history of Chinese history, its fields include but are not limited to: year, dynasty, brief description of major events;
中国古代名人,其收集中国古代名人(政治家、哲学家、军事家、文学家、艺术家……)的资料进这个表,它的字段包括但不限于:人名、朝代、类别、简介;Chinese ancient celebrities, who collected information on ancient Chinese celebrities (politicians, philosophers, strategists, writers, artists...) into this table, its fields include but are not limited to: name, dynasty, category, introduction;
中国近代名人,其收集中国近代名人(政治家、哲学家、军事家、文学家、艺术家、科学家……)的资料进这个表,它的字段包括但不限于:人名、类别、简介;Chinese modern celebrities, who collect information on modern Chinese celebrities (politicians, philosophers, strategists, writers, artists, scientists, etc.) into this table, its fields include but are not limited to: name, category, introduction;
中国省份,其字段包括但不限于:省名(或区名)、简称、类别、面积、人口、省会(或首府)、主要城镇;Chinese provinces, including but not limited to: provincial name (or district name), abbreviation, category, area, population, provincial capital (or capital), major towns;
中国大城镇、中国地理名词、中国著名景点,其字段包括但不限于:省名(或区名)、简称、大类、细类、级别、短句、详细介绍、图片;China's big towns, Chinese geographical terms, China's famous attractions, its fields include but are not limited to: provincial name (or district name), abbreviation, major categories, fine categories, levels, short sentences, detailed introduction, pictures;
国名首都表,其字段包括但不限于:地区、国名(汉语+英语)、首都(汉语+英语)、面积、人口、短介绍、备注、国旗、国歌。The national name capital table, its fields include but are not limited to: region, country name (Chinese + English), capital (Chinese + English), area, population, short introduction, remarks, national flag, national anthem.
步骤S400,使用电子设备,根据中性码和词句码,利用语素辅助的输入法输入汉语。Step S400, using an electronic device, inputting Chinese using a morpheme-assisted input method according to a neutral code and a word code.
如图4所示,步骤S400包括以下步骤:As shown in FIG. 4, step S400 includes the following steps:
步骤S410,接收输入数据信息;Step S410, receiving input data information;
步骤S420,根据所述输入数据信息提供语素选择提示;Step S420, providing a morpheme selection prompt according to the input data information;
步骤S430,接收语素选择信息,确定所选择语素对应的语素编码;Step S430, receiving morpheme selection information, and determining morpheme coding corresponding to the selected morpheme;
步骤S440,调用所述字义总表查询并提供语素编码对应的汉语文字;Step S440, calling the word meaning summary table query and providing Chinese characters corresponding to the morpheme coding;
步骤S450,根据选择的汉语文字调用所述字形总表,查询并确定要录入的汉语;Step S450, calling the glyph summary table according to the selected Chinese characters, querying and determining the Chinese to be entered;
步骤S460,显示、录入所确定的汉语。Step S460, displaying and inputting the determined Chinese.
中性码和词句码设置为语素数据库之后,汉语可以采用(1)统一码(2)中性码(3)词句码三种不同格式储存和传输。以“汉语百宝箱”这个字串为例,以统一码存档,内码是:“6C49 8BED 767E 5B9D 7BB1”;以中性码存档,内码是:“BA7E BB79 A6CA C45F BD63”;以词句码存档,内码是:“ABCD1234”。After the neutral code and the word code are set as the morpheme database, the Chinese language can be stored and transmitted in three different formats: (1) Unicode (2) Neutral Code (3) Word Code. Take the string "Chinese Treasure Chest" as an example, and archive it with Unicode. The inner code is: "6C49 8BED 767E 5B9D 7BB1"; archived with neutral code, the inner code is: "BA7E BB79 A6CA C45F BD63"; archive with word code The internal code is: "ABCD1234".
作为一种可实施方式,0、1、2、3、4、5、6、7、8、9、A、B、C、D、E、F代表一个16进数字的位,由于计算机使用二进法(0、1进位)计算(数码照相机的“数码”是二进法的俗称),汉语代号不用10进法而用二进法。一个16进数字(hexadecimal digit)由4个二进数字(0、1)组成;两个16进数字组成一个字节(byte,是计算机内存的最小单位)。统一码和中性码的长度都是两个字节,词句码的长度是四个字节,由8个16进数字组成,本发明实施例的矩阵编码表是根据16进制构思出来。As an implementable manner, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F represent a 16-digit number, because the computer uses two The method of entering the law (0, 1 carry) (the "digital" of the digital camera is the common name of the binary method), the Chinese code does not use the 10 method and uses the binary method. A hexadecimal digit consists of 4 binary digits (0, 1); two 16 digits form a byte (byte, the smallest unit of computer memory). The length of the Unicode code and the neutral code are both two bytes, and the length of the word code is four bytes, which is composed of eight 16-digit numbers. The matrix coding table of the embodiment of the present invention is conceived according to the hexadecimal notation.
由于内码不同,所以本发明实施例应当开发新输入法来录入信息。输入方式可以是手写、口读、拼音、部首、声符、笔顺、外语……等等。无论是哪种 方式(拼音、手写、口读……),本发明实施例中,在现有技术的基础上,通过本发明实施例,即语素输入法来处理。Since the inner code is different, the embodiment of the present invention should develop a new input method to enter information. Input methods can be handwriting, oral reading, pinyin, radicals, notes, strokes, foreign languages, and so on. Regardless of the manner (pinyin, handwriting, oral reading, etc.), in the embodiment of the present invention, on the basis of the prior art, it is processed by the morpheme input method according to an embodiment of the present invention.
以语素码为基础,以现有汉字输入法输入所要使用的现有汉字,然后列出这个字的全部语素,由用户点选,得到语素。Based on the morpheme code, input the existing Chinese characters to be used in the existing Chinese character input method, and then list all the morphemes of the word, and the user selects the morphemes.
现有汉字有多种汉字输入法,有拼音输入法、五笔输入法等,总结而言,无论使用什么方法,输入单位是词或短语而不是字。以词为输入单位可以减少重码率。如果碰到多义场合,便显示一个对话框叫用户选择合适的词义。There are a variety of Chinese character input methods for existing Chinese characters, such as pinyin input method and Wubi input method. In summary, no matter what method is used, the input unit is a word or a phrase instead of a word. Using the word as the input unit can reduce the repetition rate. If you encounter a polysemy, a dialog box is displayed asking the user to select the appropriate meaning.
例1:输入意为FANS含义的“粉丝”:Example 1: Enter "fans" meaning FANS:
11)用户输入“fs”,按回车键;11) The user inputs "fs" and presses the Enter key;
12)显示:⑴方式⑵发射⑶发生⑷附属⑸粉丝……;12) Display: (1) mode (2) launch (3) occurs (4) affiliated (5) fans...;
13)用户点选⑸粉丝,按回车键;系统问:⑴食品含义⑵英语FANS的音译;是哪个含义?13) The user clicks (5) the fan and presses the Enter key; the system asks: (1) the meaning of the food (2) the transliteration of the English FANS; what is the meaning?
14)用户点选⑵英语FANS的音译,按回车键;14) User click (2) Transliteration of English FANS, press Enter;
15)选择意为英语FANS的词句码,然后显示一个新的输入版面。15) Select the word code that means English FANS, and then display a new input layout.
其中,步骤11.至13.和现有技术一样;步骤14.至15.是本发明实施例用户选择食品含义或FANS含义,得到合适的词句码。Wherein, steps 11. to 13. are the same as in the prior art; and steps 14. to 15. are the embodiments of the present invention in which the user selects the meaning of the food or the meaning of the FANS, and obtains a suitable word code.
例2:用户键入“CHEN”这个字组之时,显示:1.陈2.陳3.尘4.塵5.晨6.趁……;若用户选择第一或第二项,内码都是B3AF,若用户选择第三或第四项,内码都是B9D0。其中,B3AF是“陈姓”的中性码,B9D0是“DUST”的中性码;在输入阶段,只理会字义,不理会字形,写简写繁,都以同一个中性码代号记忆。Example 2: When the user types the word "CHEN", it displays: 1. Chen 2. Chen 3. Dust 4. Dust 5. Morning 6. 趁......; If the user selects the first or second item, the internal code is It is B3AF. If the user selects the third or fourth item, the internal code is B9D0. Among them, B3AF is the neutral code of "Chen surname", B9D0 is the neutral code of "DUST"; in the input stage, only the meaning of the word, ignore the glyph, write abbreviated and write, all remember with the same neutral code.
例3:用户键入“BAI”这个字组之时,显示:1.白(颜色)2.白(说话)3.白(姓氏)4.拜5.摆6.败……;若用户是想输入“白”字,在第一、第二或第三义之中选择一个,是颜色的“白”、或说话的“白”,或姓氏的“白”字,内码依次是A5D5、A5D6、A5D7,将“白”字的含义弄清楚。Example 3: When the user types the word "BAI", it displays: 1. white (color) 2. white (speaking) 3. white (last name) 4. worship 5. pendulum 6. defeat...; if the user wants Enter the word "white" and choose one of the first, second or third meanings, which is the "white" of the color, or the "white" of the speech, or the "white" of the last name. The inner code is A5D5, A5D6. , A5D7, to clear the meaning of the word "white".
如果碰到重码率太高的场合(譬如单字词和不常用的双字词),可以输入词句的完整拼音(声母+韵母),再键入词句的类别(即分类值)来减少重码率。If you encounter a situation where the re-encoding rate is too high (such as single words and unusual double words), you can enter the complete pinyin (consonant + final) of the words, and then type the category (ie, the classification value) to reduce the re-code. rate.
为实现本发明目的,本发明实施例还提供一种存储介质,用于存储本发明实施例所述基于汉语含义的编码处理方法的计算机程序指令。In order to achieve the object of the present invention, an embodiment of the present invention further provides a storage medium for storing computer program instructions according to the Chinese meaning encoding processing method according to the embodiment of the present invention.
为实现本发明目的,本发明实施例更进一步提供一种基于汉语含义的编码处理软件系统,包括所述的存储介质,所述存储介质中的计算机程序指令被调用完成基于汉语含义的编码处理。In order to achieve the object of the present invention, an embodiment of the present invention further provides an encoding processing software system based on Chinese meaning, including the storage medium, where computer program instructions in the storage medium are called to complete encoding processing based on Chinese meaning.
作为一种可实施方式,如图5所示,所述软件系统包括语素编码模块10,语句编码模块20,表模块30,以及输入模块40。其中:As an implementation manner, as shown in FIG. 5, the software system includes a morpheme encoding module 10, a statement encoding module 20, a table module 30, and an input module 40. among them:
所述语素编码模块10,用于对汉语的语素进行编码,得到每个所述语素的语素码。The morpheme encoding module 10 is configured to encode a morpheme of a Chinese language to obtain a morpheme code of each of the morphemes.
所述语句编码模块20,用于对汉语中的词及短语进行编码,得到所述词和短语的词句码。The sentence encoding module 20 is configured to encode words and phrases in Chinese to obtain word codes of the words and phrases.
所述表模块30,用于构建汉语编码数据库,所述数据库中包括与所述语素相对应的语素表及与所述词及短语相对应的词句表;且所述语素表中包含每个语素的语素码,所述词句表中包含词及短语的词句码。The table module 30 is configured to construct a Chinese code database, where the database includes a morpheme table corresponding to the morphemes and a vocabulary list corresponding to the words and phrases; and the morpheme table includes each morpheme The morpheme code, the word list contains the word code of the word and the phrase.
所述输入模块40,用于使用电子设备,根据中性码和词句码,利用语素辅助的输入法输入汉语。The input module 40 is configured to input Chinese by using a morpheme-assisted input method according to a neutral code and a word code using an electronic device.
为实现本发明目的更更进一步提供一种基于汉语含义的编码处理设备,包括中央处理器,以及与中央处理器相连接的所述的存储介质;In order to achieve the object of the present invention, an encoding processing device based on Chinese meaning is further provided, including a central processing unit and the storage medium connected to the central processing unit;
所述中央处理器调用所述存储介质中的计算机程序指令执行完成基于汉语含义的编码处理。The central processor invokes computer program instructions in the storage medium to perform an encoding process based on Chinese meaning.
本发明实施例中的储存介质、软件系统、处理设备的工作过程与基于汉语含义的汉语编码方法过程基本相同,因此,在具体实施方式中,不再一一详细描述。The working process of the storage medium, the software system, and the processing device in the embodiment of the present invention is basically the same as the Chinese encoding method based on the Chinese meaning. Therefore, in the specific embodiment, the detailed description will not be repeated.
作为本发明较佳的实施方式,为了帮助不懂拼音的海外华人和外国人输入汉语,本发明实施例中也具备字形输入方式(手写、仓颉、五笔、部首笔画、声符、笔顺)。逻辑是以手写、仓颉、五笔、部首笔画、声符、笔顺……等传统方法输入整个词(譬如“白日”)或短语(譬如“李白”)后才叫系统辨识,于是系统知道“白日”和“李白”是不可分割的字串,要寻找“白日”或“李白”词句码。As a preferred embodiment of the present invention, in order to help overseas Chinese and foreigners who do not understand pinyin to input Chinese, the present embodiment also has a glyph input method (handwriting, cangjie, wubi, radical strokes, notes, strokes). The logic is called system identification after inputting the whole word (such as "dayday") or phrase (such as "Li Bai") by traditional methods such as handwriting, Cangjie, Wubi, radical strokes, notes, strokes, etc., so the system knows " "White Day" and "Li Bai" are inseparable strings. Look for the words "dayday" or "Li Bai".
语素是用来定义词或短语;输入的单位不是语素而是词或短语。当用户输入“李白”之时,查找“李白”这个短语的代码,找到后,凭代码数值(XYZ矩阵的数值),准确判断出“李白”是人名。A morpheme is used to define a word or phrase; the unit of input is not a morpheme but a word or phrase. When the user enters "Li Bai", the code of the phrase "Li Bai" is found. After finding the code, the value of the code (the value of the XYZ matrix) is used to accurately determine that "Li Bai" is the name of the person.
本发明实施例基于汉语含义的语言处理方法及系统和介质设备,其效益是:The language processing method, system and medium device based on Chinese meaning in the embodiment of the present invention have the following advantages:
(一)唯一性(unicity)–基于唯一语义所对应的编码方式令语素可以实现与代号的一一对应关系,即代号的唯一性。分析收集汉语词及短语词句码,分门别类存放进关系数据库(Access、Oracle、其他)的数百个表格之中,其编码码无一相同;(1) Unicity - Based on the coding method corresponding to the unique semantics, the morpheme can realize the one-to-one correspondence with the code, that is, the uniqueness of the code. Analyze and collect Chinese words and phrases, and store them in hundreds of tables in relational databases (Access, Oracle, others). The coding codes are not the same;
(二)易读性(accessibility)–由于词句码是一个数据地址,所以获取数据的方式是直接阅读(direct access),非常迅速;(2) Accessibility – Since the word code is a data address, the way to obtain data is direct access, which is very fast;
(三)遨游性(navigability)–每个语素都具备超链接(hyperlink)功能,不离开本发明运行的计算机环境,用户可以随意遨游整个知识库(譬如读白居易的《长恨歌》,用户点击“渔阳鼙鼓”这个语素,便能显示“渔阳鼙鼓”的解释;用户再从解释文字中点击“安禄山”这个语素,进入《中国古代人名表》显示安禄山的生平和“安史之乱”的评述;读完解释,可返到《长恨歌》“渔阳鼙鼓动地来”这个诗句。(3) navigability – each morpheme has a hyperlink function. Without leaving the computer environment in which the present invention operates, the user can browse the entire knowledge base at will (for example, reading Bai Juyi’s "The Song of Everlasting Sorrow", the user clicks on "fishing" The morpheme of Yangshuo Drum can show the interpretation of "Yuyang Drums"; the user then clicks on the morpheme "An Lushan" from the explanatory text, and enters the "Ancient Chinese Names List" to show the life of An Lushan and "The Anshi Rebellion" "Review; after reading the explanation, you can return to the verse of "The Song of Everlasting Sorrow".
(一)语言系统已经使用多年,特别是统一码系统启用至今已经有二十多年历史,开始过时,没有能力负荷信息时代科技急剧进步所引发的新需要,本发明实施例引进中性码(语素编码)和词句码(词和短语编码),以这个方法和系统的活力来推动语言文化和技术的继续发展。(1) The language system has been used for many years, especially since the Unicode system has been in use for more than 20 years, and it has become obsolete. Without the ability to load the new needs caused by the rapid advancement of technology in the information age, the embodiment of the present invention introduces a neutral code ( Morpheme coding) and word code (word and phrase coding), with the vitality of this method and system to promote the continued development of language culture and technology.
(二)语言系统多种多样,以汉字为例,由于历史原因,独立发展形成简、繁体两个华语世界,不利于文化和经济的交流。同时,语言在新时代,也会出现一些新词的应用、外来词的翻译、技术词汇的制作是非常不统一的,妨碍语言文化互动。本发明实施例以技术改革服务全世界人们为主旨,它收集语言,尽量统一外来旧词和新词的语言处理,启用多字体共存的中性码(语素编码)逻辑,让用户能方便选择使用。(2) The language system is diverse. Taking Chinese characters as an example, due to historical reasons, independent development has formed two simple and traditional Chinese languages, which is not conducive to cultural and economic exchanges. At the same time, in the new era, the application of new words, the translation of foreign words, and the production of technical vocabulary are very non-uniform and hinder the interaction of language and culture. The embodiment of the present invention serves the people of the world with technical reforms. It collects languages, tries to unify the language processing of foreign words and new words, and enables the neutral code (morpheme coding) logic of multi-font coexistence, so that users can conveniently select and use. .
(三)本发明实施例收集大量语素词汇,加以编码处理(添补外语对应词……等等),得到1)以语素为核心的语言知识库;2)以中性码和词句码为骨干的语言处理系统;3)以语言知识库为后盾的智能提示输入法。知识库、编码系统和输入法三个大模块都能被独立应用,也可以被结合应用。(3) The embodiment of the present invention collects a large number of morpheme vocabulary, performs coding processing (adding foreign language corresponding words, etc.), and obtains 1) a linguistic knowledge base with morpheme as the core; 2) a neutral code and a word code as the backbone. Language processing system; 3) Intelligent prompt input method backed by language knowledge base. The three major modules of knowledge base, coding system and input method can be applied independently or combined.
总而言之,本发明实施例的基于语素的语言处理方法和系统,其语言信息的处理方便、细腻、灵活,能够进行语言大数据的搜索、分析、统计,拥有遨游语言大关系数据库的超强功能,具有强大的提升其价值。In summary, the morpheme-based language processing method and system of the embodiment of the present invention is convenient, delicate, and flexible in processing language information, and can perform search, analysis, and statistics of language big data, and has a super-function of a large relational database of a language. Has a strong boost to its value.
以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进轨道了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The specific embodiments of the present invention are described in detail in the detailed description of the embodiments of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Claims (15)

  1. 一种基于汉语含义的编码处理方法,其特征在于,包括以下步骤:A coding processing method based on Chinese meaning, characterized in that it comprises the following steps:
    对汉语的语素进行编码,得到每个所述语素的语素码;Encoding the morphemes of Chinese to obtain the morpheme code of each of the morphemes;
    对汉语中的词及短语进行编码,得到所述词和短语的词句码;Encoding words and phrases in Chinese to obtain the word code of the words and phrases;
    构建汉语编码数据库,所述数据库中包括与所述语素相对应的语素表及与所述词及短语相对应的词句表;且所述语素表中包含每个语素的语素码,所述词句表中包含词及短语的词句码。Constructing a Chinese code database, the database includes a morpheme table corresponding to the morphemes and a vocabulary list corresponding to the words and phrases; and the morpheme table includes a morpheme code of each morpheme, the vocabulary list The word code containing the words and phrases.
  2. 根据权利要求1所述的基于汉语含义的编码处理方法,其特征在于,所述语素表包括字形总表和字义总表,且所述字形总表和所述字义总表之间通过所述语素码相互关联。The encoding processing method based on Chinese meaning according to claim 1, wherein the morpheme table comprises a glyph summary table and a word meaning summary table, and the morphemes are passed between the glyph summary table and the word meaning summary table. Codes are related to each other.
  3. 根据权利要求2所述的基于汉语含义的编码处理方法,其特征在于,还包括如下步骤:The encoding processing method based on the Chinese meaning according to claim 2, further comprising the following steps:
    使用电子设备录入汉语;Enter Chinese using electronic devices;
    其包括以下步骤:It includes the following steps:
    接收输入数据信息;Receiving input data information;
    根据所述输入数据信息提供语素选择提示;Providing a morpheme selection prompt according to the input data information;
    接收语素选择信息,确定所选择语素对应的语素编码;Receiving morpheme selection information, determining a morpheme code corresponding to the selected morpheme;
    调用所述字义总表查询并提供语素编码对应的汉语文字;Calling the word meaning summary table query and providing Chinese characters corresponding to the morpheme code;
    根据选择的汉语文字调用所述字形总表,查询并确定要录入的汉语;Calling the glyph summary table according to the selected Chinese characters, querying and determining the Chinese to be entered;
    显示、录入所确定的汉语。Display and enter the confirmed Chinese.
  4. 根据权利要求3所述的基于汉语含义的编码处理方法,其特征在于,所述对汉语语素进行编码,包括对单个汉字的编码;The encoding processing method based on Chinese meaning according to claim 3, wherein the encoding of the Chinese morpheme comprises encoding a single Chinese character;
    包括如下步骤:Including the following steps:
    对每个所述单个汉字语构建唯一的规范字代码;Constructing a unique canonical code for each of the individual Chinese words;
    确定所述单个汉字包含的不同含义数量;Determining the number of different meanings contained in the single Chinese character;
    为所述单个汉字的每个含义确定一个语素序号;Determining a morpheme number for each meaning of the single Chinese character;
    所述规范字代码和所述语素序号组合构成所述单个汉字不同含义语素的语素码。The canonical code is combined with the morpheme number to form a morpheme code of a different meaning morpheme of the single Chinese character.
  5. 根据权利要求1所述的基于汉语含义的编码处理方法,其特征在于,所述词句表的数量为多个。The encoding processing method based on Chinese meaning according to claim 1, wherein the number of the sentence table is plural.
  6. 根据权利要求2所述的基于汉语含义的编码处理方法,其特征在于,所述词句码采用32位元的16进制数字表示。The encoding processing method based on Chinese meaning according to claim 2, wherein the word code is represented by a 32-bit hexadecimal number.
  7. 根据权利要求6所述的基于汉语含义的编码处理方法,其特征在于,所述词句码在所述汉语编码数据库中的存储,采用八维度矩阵空间实现。The encoding processing method based on Chinese meaning according to claim 6, wherein the storage of the word code in the Chinese encoding database is implemented by using an eight-dimensional matrix space.
  8. 根据权利要求7所述的基于汉语含义的编码处理方法,其特征在于,采用八维矩阵空间存储所述词句码,包括:The encoding processing method based on Chinese meaning according to claim 7, wherein the word code is stored in an eight-dimensional matrix space, including:
    将每个词句码作为八维矩阵空间的一个点,点以X、Y、Z、P、Q、R、S、T八个16进制数值定位,利用八维矩阵空间结构作为储存地址。Each word code is used as a point in the eight-dimensional matrix space, and the points are positioned by eight hexadecimal values of X, Y, Z, P, Q, R, S, and T, and the eight-dimensional matrix space structure is used as the storage address.
  9. 根据权利要求8所述的基于汉语含义的编码处理方法,其特征在于,所述利用八维矩阵空间结构作为词句码的储存地址,包括:The encoding processing method based on the Chinese meaning according to claim 8, wherein the using the eight-dimensional matrix spatial structure as the storage address of the word-sentence code comprises:
    词句码由顺序的分类值、排序链接值、保护码三个部分组成,结构如下:The word code consists of three parts: the sequential classification value, the sorting link value, and the protection code. The structure is as follows:
    Figure PCTCN2018086500-appb-100001
    Figure PCTCN2018086500-appb-100001
    其中,所述分类值包含三个数值,分别表征X轴,Y轴和Z轴坐标值;所述 排序链接值包括四个数值,分别表征P轴,Q轴,R轴和S轴坐标值,保护码为最后一位,表征T轴坐标值。Wherein, the classification value includes three values, respectively representing X-axis, Y-axis and Z-axis coordinate values; the sorting link value includes four values, respectively representing P-axis, Q-axis, R-axis and S-axis coordinate values, The protection code is the last digit and represents the T-axis coordinate value.
  10. 根据权利要求9所述的基于汉语含义的编码处理方法,其特征在于,所述分类值表征三维空间里的一个点,所有分类值存储在三维空间的数据布局图中,所述数据布局图为一个表;且The encoding processing method based on Chinese meaning according to claim 9, wherein the classification value represents a point in the three-dimensional space, and all the classification values are stored in a data layout diagram of the three-dimensional space, and the data layout diagram is a table; and
    所述词句表分为多种类型,不同的分类值对应的不同类型的词句表。The vocabulary list is divided into a plurality of types, and different categorical values correspond to different types of vocabulary tables.
  11. 根据权利要求9所述的基于汉语含义的编码处理方法,其特征在于,所述保护码为从所述分类值和所述排序链接值凭编码图计算出来的控制数值,且所述编码图为由多个向量和矩阵组成的表结构。The encoding processing method based on the Chinese meaning according to claim 9, wherein the protection code is a control value calculated from the classification value and the sorting link value by a code map, and the coded picture is A table structure consisting of multiple vectors and matrices.
  12. 根据权利要求5所述的基于汉语含义的编码处理方法,其特征在于,所述词句表包括:字典类词句表、词典类词句表、诗词古籍类词句表及史地类词句表。The encoding processing method based on Chinese meaning according to claim 5, wherein the word list includes: a dictionary type sentence list, a dictionary type word list, a poetry ancient sentence list, and a history list.
  13. 一种存储介质,其特征在于,其用于存储所述权利要求1-12任一项的基于汉语含义的编码处理的计算机程序指令。A storage medium, characterized in that it is used to store computer program instructions for encoding processing based on Chinese meaning according to any one of claims 1-12.
  14. 一种基于汉语含义的编码处理软件系统,其特征在于,包括权利要求13所述的存储介质,所述存储介质中的计算机程序指令被调用完成基于汉语含义的编码处理方法。An encoding processing software system based on Chinese meaning, comprising the storage medium according to claim 13, wherein the computer program instructions in the storage medium are called to complete an encoding processing method based on Chinese meaning.
  15. 一种基于汉语含义的编码处理设备,包括中央处理器,其特征在于,还包括与中央处理器相连接的权利要求13所述的存储介质;An encoding processing device based on Chinese meaning, comprising a central processing unit, characterized by further comprising a storage medium according to claim 13 connected to a central processing unit;
    所述中央处理器调用所述存储介质中的计算机程序指令执行完成基于汉语含义的编码处理。The central processor invokes computer program instructions in the storage medium to perform an encoding process based on Chinese meaning.
PCT/CN2018/086500 2017-06-14 2018-05-11 Chinese meaning based chinese encoding method and system, and medium device WO2018228101A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710446499.2 2017-06-14
CN201710446499 2017-06-14

Publications (1)

Publication Number Publication Date
WO2018228101A1 true WO2018228101A1 (en) 2018-12-20

Family

ID=64659985

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/086500 WO2018228101A1 (en) 2017-06-14 2018-05-11 Chinese meaning based chinese encoding method and system, and medium device

Country Status (3)

Country Link
CN (1) CN109086257A (en)
TW (1) TW201915775A (en)
WO (1) WO2018228101A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050027547A1 (en) * 2003-07-31 2005-02-03 International Business Machines Corporation Chinese / Pin Yin / english dictionary
CN101923399A (en) * 2010-06-07 2010-12-22 范显镔 Encoding method of computer Chinese character encoded texts capable of being used as input codes and internal codes
CN105955936A (en) * 2016-04-18 2016-09-21 王欣 Novel Mandarin Chinese information ASCII code
CN105975597A (en) * 2016-05-10 2016-09-28 北京信息科技大学 Digitized international sharing platform of Dongba classic ancient book inheriting system
CN106372039A (en) * 2016-08-18 2017-02-01 王欣 Standard Chinese information ASCII system codes

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005309164A (en) * 2004-04-23 2005-11-04 Nippon Hoso Kyokai <Nhk> Device for encoding data for read-aloud and program for encoding data for read-aloud
CN101419760A (en) * 2007-10-24 2009-04-29 张基文 Morpheme type setting scheme for Chinese Pinyin

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050027547A1 (en) * 2003-07-31 2005-02-03 International Business Machines Corporation Chinese / Pin Yin / english dictionary
CN101923399A (en) * 2010-06-07 2010-12-22 范显镔 Encoding method of computer Chinese character encoded texts capable of being used as input codes and internal codes
CN105955936A (en) * 2016-04-18 2016-09-21 王欣 Novel Mandarin Chinese information ASCII code
CN105975597A (en) * 2016-05-10 2016-09-28 北京信息科技大学 Digitized international sharing platform of Dongba classic ancient book inheriting system
CN106372039A (en) * 2016-08-18 2017-02-01 王欣 Standard Chinese information ASCII system codes

Also Published As

Publication number Publication date
CN109086257A (en) 2018-12-25
TW201915775A (en) 2019-04-16

Similar Documents

Publication Publication Date Title
Song et al. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison
US8731901B2 (en) Context aware back-transliteration and translation of names and common phrases using web resources
JP3196868B2 (en) Relevant word form restricted state transducer for indexing and searching text
JP2012248210A (en) System and method for retrieving content of complicated language such as japanese
US20070179932A1 (en) Method for finding data, research engine and microprocessor therefor
Jabbar et al. An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach
WO2009046612A1 (en) System for synthetically cognizing entire semantic information and applications thereof
Yıldırım et al. Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques
Dexter et al. Quantitative criticism of literary relationships
Thomas Natural language processing with spark NLP: learning to understand text at scale
Wintner Morphological processing of semitic languages
Haq et al. Urdu named entity recognition system using deep learning approaches
CN109086285B (en) Intelligent Chinese processing method, system and device based on morphemes
CN113963748A (en) Protein knowledge map vectorization method
Cheng et al. MTNER: a corpus for Mongolian tourism named entity recognition
CN115617965A (en) Rapid retrieval method for language structure big data
WO2018228101A1 (en) Chinese meaning based chinese encoding method and system, and medium device
JP4953440B2 (en) Morphological analysis device, morphological analysis method, morphological analysis program, and recording medium storing computer program
Khoufi et al. Chunking Arabic texts using conditional random fields
Choudhary et al. An annotated urdu corpus of handwritten text image and benchmarking of corpus
Dash Polysemy and homonymy: a conceptual labyrinth
Oudah et al. Studying the impact of language-independent and language-specific features on hybrid Arabic Person name recognition
Fang et al. Creation and significance of database of Dictionary of Cognate Words
Guo et al. Research on verb reduplication based on the corpus of international Chinese textbooks
Jadhav et al. Study of machine transliteration for cross language retrieval

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18817366

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18817366

Country of ref document: EP

Kind code of ref document: A1