WO1996011442A1 - Procede de traitement de donnees de caracteres, et appareil associe - Google Patents

Procede de traitement de donnees de caracteres, et appareil associe Download PDF

Info

Publication number
WO1996011442A1
WO1996011442A1 PCT/CN1995/000078 CN9500078W WO9611442A1 WO 1996011442 A1 WO1996011442 A1 WO 1996011442A1 CN 9500078 W CN9500078 W CN 9500078W WO 9611442 A1 WO9611442 A1 WO 9611442A1
Authority
WO
WIPO (PCT)
Prior art keywords
code
text information
component
internal
input
Prior art date
Application number
PCT/CN1995/000078
Other languages
English (en)
Chinese (zh)
Inventor
Shengyuan Wu
Original Assignee
Shengyuan Wu
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shengyuan Wu filed Critical Shengyuan Wu
Priority to AU36032/95A priority Critical patent/AU3603295A/en
Publication of WO1996011442A1 publication Critical patent/WO1996011442A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes

Definitions

  • the invention relates to a method and a device for processing text information.
  • it relates to a method and device for processing words or phrases in text information as a unit.
  • the invention is a method and device for text information processing that directly processes text information containing multiple internal codes. Background technique
  • the in-machine code of the text is based on the encoding of the letter or basic unit of the text.
  • the processing of the text information is actually the processing of the in-machine code.
  • this in-machine code is called a single code.
  • Level inner code also known as a stack of inner codes. Therefore, in the existing text information processing system, text information is processed in the form of a single internal code.
  • the text information has a large storage volume, a large transmission volume, and a slow processing speed.
  • An object of the present invention is to provide a method for processing text information containing multiple internal codes, so as to increase the storage amount of text information on a storage medium, improve transmission efficiency, and increase processing speed. Another object of the present invention is that a multilevel inner code encoding method is used in the encoding of an existing text component input code.
  • a third object of the present invention is to use a multi-level internal code component library device for inputting existing characters.
  • a fourth object of the present invention is to use a multi-directional conversion method of a multi-frame inner code for matching a character string.
  • a fifth object of the present invention is to provide a device for processing text information containing multiple levels of internal codes. The device can improve the processing efficiency of text information.
  • the internal code is the representation of the text information inside the machine. It is called internal code or internal code.
  • Single-level internal codes are internal codes corresponding to literal characters or basic units. Such as ASCII code and Chinese character machine code.
  • a single-level inner code can also be called a first-level inner code.
  • the text component is the text part of the corresponding word, phrase or phrase in the text.
  • a multilevel internal code is an internal code corresponding to a text component, that is, a multilevel internal code is a representation of a word, phrase, or phrase inside the machine. Multi-code is not only used for the storage and transmission of text information, but also for the calculation and processing of text information.
  • a single-level internal code can be regarded as a multi-level internal code, so a system capable of processing multiple internal codes can naturally process only text information containing a single-level internal code.
  • a multi-code inner code has 1 more than the highest level of the multi-code inner code contained in the corresponding component.
  • the single-code internal codes corresponding to productivity are a, b, and c.
  • Internal code (Note: this means internal codes a, b, c.... Or A, B, C,... Etc. are for illustrative purposes only, not true internal code values.)
  • a multi-byte inner code is a multi-byte encoding, which can use bit identification, byte identification, string identification or no identification encoding.
  • Multi-frame inner codes should be easily distinguishable from single-frame inner codes.
  • Multilevel inner codes are not only used for storage and transmission, but also for operations and processing.
  • the multilevel internal code is related to the structure of the component library device.
  • the component item is a text information part corresponding to a multi-frame inner code, and it may contain a single-frame inner code or a multi-level inner code, or a multi-level inner code and a single-level inner code.
  • the sum of the length of the corresponding single-level internal code contained in the component term is called the actual length of the component term, and the sum of the included code and length is the entry length of the component term.
  • the length or actual length of the entry of the component term is called the component term. length.
  • a component library device is a device that arranges component items according to a certain rule. For example, the arrangement rule of the basic component library device is to segment according to the item length of the component item, and then arrange the segments according to a certain rule. The encoding of multiple internal codes is linked to the address of the corresponding component item in the component library. .
  • Multi-level internal codes are codes about text components. They are the same as single-level internal codes and are also internal codes. They can directly process text information containing multiple internal codes.
  • Unidirectional conversion is the conversion from a high-level inner code to a low-level inner code. Generally, it refers to the conversion to a single-level inner code. Multi-directional conversion is the conversion from low-level inner code to high-level inner code. Generally, it refers to the conversion from a single-level inner code to a multi-level inner code.
  • Substitute hundred A unidirectional conversion device is a device that implements unidirectional conversion.
  • a multi-directional conversion device is a device that implements multi-directional conversion.
  • the pipeline conversion device contains a pipeline composed of pipeline components, which can realize multidirectional conversion and unidirectional conversion.
  • a one-way conversion operation is an operation that performs a one-way conversion.
  • a multidirectional conversion operation is an operation that performs a multidirectional conversion.
  • the input code input multi-directional conversion operation refers to an operation in which an input code inputted by an input code input conversion device is converted into a corresponding multi-code internal code.
  • the transmission of text information refers to the exchange and transmission of text information between various parts of the word processing device, or the transmission and communication of text information between word processing devices.
  • text information transmission between the host and printer, display terminal, external storage text information transmission and communication within or between computer networks, or text information transmission and communication between communication devices.
  • the text information transmission operation with multilevel internal codes refers to the transmission of text information with multiple internal codes.
  • the pipeline conversion operation is an operation to realize multi-directional conversion and unidirectional conversion through a pipeline conversion device.
  • the comparison operation is a comparison operation of internal codes.
  • the comparison operation with multiple internal codes refers to the comparison of a single-level internal code with multiple internal codes, or the comparison of multiple internal codes with multiple internal codes.
  • Multi-level internal codes can be compared with single-coded internal codes. Multi-coded codes are converted into single-coded internal codes and then compared. Equality comparison can be performed among multiple-level code. If the order of the multi-level inner code is exactly the same as that of the corresponding single-level inner code, the sizes of the multi-level inner codes can be compared. Therefore, the comparison of the single-level inner code and the multi-level inner code can also be converted into a multi-level inner code. ⁇ Internal code comparison.
  • the operation containing multiple internal codes refers to the search, replace, insert, delete operation, one-way conversion operation, multi-directional conversion operation, input code input and multi-directional conversion operation of the text information containing multiple internal code.
  • the method and device for implementing multi-directional conversion are related to the mapping component library device and the index device.
  • the mapping component term is composed of a component term, a component term length, and a corresponding multilevel inner code.
  • the mapping component library device is a device that arranges the mapping component items according to a certain rule.
  • the arrangement rules of the mapping component library devices are generally arranged in ascending (or descending) order of the corresponding single-level inner code of the mapping component items.
  • the indexing device is composed of index entries, and the index entries are mainly composed of address entries that indicate the addresses of the first (or first few) internal codes corresponding to the mapping component entries appear in the mapping component library device for the first time.
  • the text component is input from the input device, and multi-directional conversion can be completed by adding a multi-directional device to the input device. It can also be completed by inputting the code into the conversion device.
  • the text input code input multi-directional conversion operation is an operation for converting an input code of a text component into a multi-code internal code.
  • the input code input conversion device is a device that implements the input code input conversion operation. For the convenience of description, the encoding of the text component input code is described as follows.
  • the encoding of the text component input code can be an equal length code or an unequal length code, and generally uses three letters or four letters.
  • the following uses the four-letter code length as an example to describe the method of structural level coding.
  • the basic idea of the hierarchical coding is to first decompose the text components into parts as a whole, which is the first level, and then the partial decomposition of the first level, which is the second level, so that it can be decomposed in turn.
  • the coding principle for the first level decomposition into different part numbers is as follows.
  • Two parts Take two yards for each part and one yard for parts less than two yards.
  • Three parts One code for each of the first and second parts, and two codes for the third part.
  • Five or more parts one code for each of the first, second, and third parts, and one code for the last part.
  • English is alphabetic. English words and phrases can be divided into continuous and discontinuous types from the knot dog. Continuous means that there are no separators in the middle, such as spaces or commas.
  • the letters are used as input codes in sequence.
  • the second layer is decomposed according to the letters corresponding to the syllables.
  • it is decomposed into parts according to the separator, and the second layer is decomposed according to the continuous case. For example, father is divided into two syllables, fate is the input code, difference is four syllables, and dfrc is the input.
  • Replace page (then ⁇ ) Code differentiate is five bytes, dfra can be used as the input code, aafj can be used as the input code at a full jump, and all bark and no bite can be used as the aba input code.
  • Non-letter-structured text encoding is more complicated.
  • the following uses Chinese characters as examples to illustrate the method of concatenating hierarchical coding.
  • Two-character words are divided into two parts by words, and words can be broken down by pronunciation or glyphs.
  • physical education is divided into two parts: physical education and education. If you use code to get the code, you can use tiyu as the input code. For example, stadiums can be broken down into sports and stadiums, so tyia is preferred (here, o, i, u are used instead of zh, ch, sh, respectively).
  • the Supreme People's Court is decomposed into three parts: the Supreme People's Court and the People's Court, so zrfy can be used as the input code.
  • the Supreme People's Procuratorate is divided into the Supreme People's Procuratorate and the People's Procuratorate.
  • the Procuratorate is divided into the Procuratorate and the Procuratorate. Therefore, zrjy can be used as the input code.
  • the input codes of the last two cases are different. However, if the first code of one, two, three, and the last word is taken as the input code, the input codes of the last two cases are the same. Because the decomposition of three or four words sometimes has different meanings, an approximate structural level decomposition method can be used.
  • the input code table device is a correspondence table of component input codes and component items of the component library device and their multilevel internal codes. The following uses English and Chinese as examples. Each input code is represented by 5 digits, and 4 digits total 20 digits.
  • the entry of the input code table device consists of 3 bytes, the remaining 4 digits, 3 digits are used as one, two, and three code flags, and one digit is used. As the recode flag.
  • the entries are arranged in the order of the corresponding composition items of the composition library device or the phrases are entered according to their characters. For words or phrases with corresponding composition items, they can be entered according to their characters, or they can be entered according to the composition input codes, regardless of the The input of two or three short codes is still full code. Finally, a special key must be entered as a sign of the input code.
  • a text information processing method in which text characters are represented by an internal code.
  • the internal code is also referred to as a single-level internal code.
  • the processing of text information is achieved by processing a single-level internal code.
  • the word or phrase of the text is also expressed as an internal code.
  • the internal code of a word or phrase is called a multilevel internal code.
  • the text information is also processed by processing the multilevel internal code in the text information.
  • the word processing device that can only process single-coded internal code text information but not multi-coded internal code text information as the first type of word processing device.
  • the word processing device for text information is a second type of word processing device.
  • the encoding method and processing method of the multi-frame inner code can be used in the first type of word processing device.
  • multidirectional transformations can be used in the matching of literal strings, such as in word segmentation.
  • the following is a description of the characteristics of multi-directional conversion for maximum string matching.
  • a text information processing method in which the basic character of a character or a character is represented by an internal code.
  • the internal code is also called a single internal code.
  • the text information is in a single level.
  • the text information processing method is characterized by:
  • the method of multi-directional conversion can be used in the maximum matching of a string containing a single inner code, and can also be used in the maximum matching of a string containing a multi-level inner code;
  • mapping component item corresponding to the address of the mapping component library device
  • mapping components move one item in the ascending (or descending) direction, and go to step
  • the last matched mapping component item is the text information of the largest match, and is returned. If not, go to (8);
  • the execution of the matching operation in step 5 above is generally performed by "setting the matching flag, and the matching pointer points to the mapping component item".
  • the purpose of the matching operation is to finally return the maximum value.
  • the multi-directional conversion method can be used in the maximum matching segmentation of a text information sequence containing only a single level of internal code, and can also be used in the maximum matching segmentation of a text information sequence containing a multi-level internal code;
  • the dictionary is composed of dictionary entries, and the dictionary The items contain component items for segmentation, and the dictionary items are arranged in the order of the corresponding internal code size of the component items;
  • the text information sequence may be input by an input device, may be input by a storage device, or may be input by a communication device;
  • step (4) If the result meets the jump condition, go to step (4);
  • step (3) Move the dictionary item in the forward direction by the mobile device, and go to step (2);
  • the matching device If there is a match, the matching device returns the maximum segmented text information.
  • the execution of the matching operation in the above steps is generally performed by "setting the matching flag, and the matching pointer points to the component item".
  • the purpose of the matching operation is to finally return the component item with the largest match.
  • jump-out condition refers to: when the component items are arranged in ascending order of the corresponding high-level inner code, it is: becomes the text information whose item is greater than the segmentation;
  • the component items are smaller than the segmented text information.
  • the above forward direction refers to: when the component items are arranged in the ascending order of the corresponding single-level inner code: ascending direction of the component items;
  • the descending order is: The descending direction of the component term.
  • FIG. 2, FIG. 3 and FIG. 8. 32 When the Chinese word segmentation is performed by using the above string maximum matching method, time complexity and space complexity can be reduced. After theoretical analysis and practical testing, the time complexity of the existing word segmentation method is 12.32 , And the time complexity of the above method is 2.89.
  • the encoding method of multi-frame inner code can be used for input in the first type of word processing device.
  • the component library device containing multi-level internal codes can be used for inputting the text components of the first type of word processing device.
  • the input code input conversion operation refers to an operation in which an input code inputted by an input code input conversion device is converted into a corresponding multi-code or single-code internal code.
  • the input code of the text component can adopt various encoding schemes.
  • the location code of the text component is also an input code of the text component.
  • Structure-level coding includes approximate structure-level coding.
  • a text information processing method in which text characters are represented by an internal code.
  • the internal code is called a single-level internal code.
  • the processing of text information is performed in the form of a single internal code.
  • the input code of the input text component to convert to the corresponding single-code internal code is as follows:
  • the text information processing method is characterized by: a code.
  • the following describes the application of a component library device containing multi-level internal codes in the first type of text processing method. Because the component library containing multi-level internal codes has the characteristics of compression, the thesaurus of Chinese characters can also be compressed into equal length Thesaurus, thus bringing convenience to retrieval.
  • Replace page (then # 20 pro) A text information processing method, in which text characters are represented by an internal code.
  • This internal code is also called a single internal code.
  • the processing of the text information is performed in the form of a single-level internal code. Code, convert it to the corresponding single code.
  • the operation steps to convert the input code of the input text component to the corresponding single code are:
  • the component item of the component library device contains multilevel internal codes.
  • the input code input conversion device is a device that performs an input code conversion operation.
  • FIG. 4 is a schematic diagram of an input code input conversion device. 1 is the input device; 2 is the conversion part; 3 is the input code table device; 4 is the component library device.
  • the method of the present invention is used to guide the writing of a computer program, generate computer instructions, and control a computer to complete corresponding operations. These methods can be described in the form of methods or devices.
  • a text information processing device In this text processing device, the basic unit of a character or a character is represented as an internal code, also called a single-level internal code. The processing of text information is shown as a single-level internal code. Processing, the word processing device includes:
  • Input device 1 a device for inputting text information
  • the processing device 2 processes the text information of the single code
  • Output device 3 text information output
  • the word or phrase of the text is represented as an internal code.
  • the internal code of the word or phrase is called a multilevel internal code.
  • the text information containing the multilevel internal code can participate in the calculation and processing of the text information. Processing is actually processing text information containing multiple levels of internal codes;
  • the text information processing device further includes:
  • the multidirectional conversion device 4 is a device that receives instructions from a processing device and implements multidirectional conversion
  • the unidirectional conversion device 5 is a device that receives instructions from a processing device to implement unidirectional conversion
  • the input device 1 can input text information containing a single code, or text information containing multiple codes.
  • the processing device 2 processes the text information containing the multi-byte internal code, and issues corresponding conversion instructions to the multi-directional conversion device or the uni-directional conversion device as required;
  • the output device 3 outputs text information containing a single inner code or text information containing a plurality of levels of internal codes as required.
  • FIG. 6 is a schematic diagram of a text information processing device. 1 is an input device; 2 is a processing device; 3 is an output device; 4 is a multidirectional conversion device; 5 is a unidirectional conversion device. Overview of the drawings
  • FIG. 2 is a schematic diagram of a comparison matching process in a multi-directional conversion device
  • FIG. 3 is a schematic diagram of a multi-directional conversion device
  • FIG. 4 is a schematic diagram of an input code input conversion device
  • FIG. 5 is a schematic diagram of a pipeline multi-directional conversion device
  • FIG. 6 is a schematic diagram of a text information processing device
  • FIG. 7 is a schematic diagram of a unidirectional conversion device
  • FIG. 8 is a schematic diagram of a comparison matching device in a multi-directional conversion device. Best Mode of the Invention
  • the encoding of the multi-level internal code is related to the component library device.
  • the component library device is different, and the encoding of the multi-level internal code is different.
  • the component library device may be a basic component library device, or an isometric component library device, or a semi-indexed component library device, or a fully indexed component library device.
  • the entry lengths of the component items of the isometric component library device are all equal. When the length of the entries of a small number of component items is different, an index can be set in the component items, which is called an index component item.
  • Another auxiliary component library device is built, and the index component item contains the position of the component item in the auxiliary component database. And entry length and other information; The length of the index component entries is equal to the entry length of the remaining component entries; This kind of component library device is called a semi-index component library device. When there is a large difference in the length of the entries of the component items, the
  • Replacement page (then 26 ⁇ )
  • the sub-items are all replaced by index component items; the real component items are all in the auxiliary component library device; this is called a full-index component library device.
  • the content of the index component items of the semi-index component library device should be distinguishable from the single-code internal code and the multiple-code internal code.
  • the corresponding one-way conversion device in the full-index or semi-indexed component library device needs to be slightly modified. For the semi-indexed component library device, it is necessary to determine whether the component library device is a component item or an index component item. Steps for accessing the auxiliary component library device are added to the index component items of the full-index and semi-index component library devices.
  • the entries of equal length, semi-indexed and full-indexed basic component library devices are equal in length, and can be sorted in the order of their corresponding single-coded internal codes, so the order of multiple-coded internal codes and single-coded internal codes is completely the same. Internal codes are comparable in size. At the same time, the order of the component library device and the mapped component library device is completely the same. When the mapping component item contains only the component item, they can be combined into one.
  • the mapping component library device is arranged in ascending order (or descending order) of the corresponding unitary code of the mapping component item.
  • the mapping component item is composed of the component item length, the component item, and the multiple internal code, or the component item length and the component item, or It is composed of a component term and a multilevel inner code, or a component term length and a multi-level inner code, or a component term, or a multilevel inner code.
  • the index entry of the indexing device is composed of an address entry or a flag entry; or an address entry and a multilevel internal code, or a flag entry; or an internal code entry and an address item, or an internal code entry, an address entry, and multiple entries ⁇ Internal code.
  • the following uses two-byte bit identification codes and equal-length component libraries as examples to describe the encoding of multilevel inner codes.
  • the internal code of a Chinese character is 2 bytes, and the two-word word is 4 bytes long.
  • the item length is also 4 bytes long.
  • the "Liberation Army” sets its single-code internal codes as "a”, "b", and "c”, and sets the corresponding two-code internal code of "ab” to A.
  • the entry length of the three-word composition item composed of "Ac” is also 4 bytes long.
  • the second-level internal code of "China” is B
  • the second-level internal code of "People” is "C”.
  • the entry length of the two-word four-word BC "two-level internal code is also 4 bytes long.
  • Let the corresponding three-level internal code of" Ac “be D and the corresponding three-level internal code of" BC "be E, then” ED The entry length of the corresponding seven-character Chinese People's Liberation Army is also 4 bytes long, and the corresponding multi-code internal code is four.
  • the component items are arranged in the ascending order of the corresponding single-level inner code to form the component library device. Let each area be 94 component items.
  • the first byte of the code is high Bit is
  • the high-order bit of the second byte is 0, then the first byte of the multi-code inner code is the area code plus AOH, and the second code of the multi-level inner code is the bit number plus 20H.
  • the two-byte identification code encodes the high order bits of the two taro sections as different combinations of 0 and 1. (Note: this means that a, b, c ... or A, B, C ... etc. of the inner code is for illustration purposes only, not the actual inner code value.)
  • mapping component item contains only component items
  • mapping component library device and the component library device are the same.
  • An index device of the indexing device is formed by an address item or a mark item, and an index device containing 6763 index items is composed of 6763 Chinese taro leaves.
  • the index items are sorted in ascending order by the Chinese internal code. If the index item is a mark item, Then it indicates that the word in the mapping component ⁇ does not start with the Chinese character. If it is an address item, it is the address of the mapping component item in the mapping component database with the first word with the Chinese character as the prefix. Of course, this address It can also be represented by multiple levels of internal codes of corresponding words.
  • D the corresponding three-level internal code of the "PLA” be D. If D is contained in a text sequence, D is first recognized as a multi-code internal code, because the single-code internal code of Chinese characters is also a two-byte code. The high-order bits of each byte are "1", and the high-order bits of the two bytes of the multilevel inner code are "1" and the other is "0", so it is judged that D is a multi-code internal code, because D contains the corresponding The information of the area number and bit number of the component item in the component library, that is, the relevant address information, so that the component item "Ac" can be obtained, because the component item contains a multilevel internal code A, and the above process is repeated.
  • the component item of A “ab” replaces A, and the square term “ab” does not contain multi-code internal codes.
  • the one-way conversion ends.
  • the conversion result of D is "abc", that is, "the People's Liberation Army”. (Note: this means that a, b, c ... or A, B, C ⁇ , etc. of the inner code is only used for illustration purposes, not the actual inner code value.)
  • mapping component term of the first word Suppose that the mapping component terms are arranged in the order of "ab", “Ac” ..., and compare the single lib code "ab” of "Liberation” with "ab” in the mapping component term, Because they are equal, the match flag is set and the matching pointer points to the mapping component item.
  • mapping component item is moved in ascending direction, and the converted “abc” is compared with the current mapping component item "Ac".
  • A is converted to "ab” by one-way conversion. Because they are equal, the match flag is set, the match pointer is pointed to the mapped component term "Ac", and then the mapped component term is moved by one term. At this time, “abed” (set “ ⁇ ” The single-level inner code is "d") compared with the mapping component term,
  • the English component library device can use a basic component library device or a full index basic component library device.
  • the index device can use a HASH query method.
  • the first letter is one of the 26 letters
  • the second letter is one of the 26 letters or spaces.
  • a total of 26 * 27 702 index entries.
  • several two-byte identification codes can be used. For example, the first byte is 0 or 1, and the second byte is 0.
  • the recognition step according to the coding characteristics of the multi-code inner code, identify whether the corresponding text information contains multi-level code, and perform the following actions according to the recognition result-if the multi-code inner code is included, continue the conversion process, otherwise the conversion ends;
  • the multi-step conversion steps are:
  • the inner code in the text information is identified according to the coding characteristics of the inner code; in the index checking step, the index device is checked according to the corresponding inner code, and the corresponding mapping component item of the mapping component database device is found;
  • the comparison and matching step compares the mapping component with the corresponding converted text information and makes the following selections based on the results:
  • the multi-level internal code corresponding to the last matched mapping component item is returned.
  • mapping component item is moved one item in the forward direction, and the comparison matching step is continued.
  • mapping out condition refers to: when the mapping component items of the mapping component library are arranged in ascending order of the corresponding single-level inner code of the corresponding component item, the component item of the mapping component item is greater than the converted text information;
  • the above forward direction refers to: when the mapping component items of the mapping component library are arranged in ascending order of the single inner code of the corresponding component item: ascending direction of the mapping component library;
  • the descending order is: the descending direction of the mapping component library.
  • One-way conversion can convert a multi-frame inner code into a single-level inner code, or a lower-level multi-frame inner code.
  • the following is a more detailed description of the one-way conversion operation and multi-directional conversion.
  • An example is to convert it to a single-coded internal code to illustrate its characteristics.
  • step (2) If the component term contains multiple internal codes, return to step (2);
  • Figure 1 is a schematic diagram of the one-way conversion process.
  • 1 is the converted multi-level internal code; 2 is the conversion part; 3 is the component library device; 4 is the judgment part; 5 conversion is complete.
  • Multidirectional conversion can convert text information containing a single level of internal code into the corresponding multi-level internal code, or it can convert low-level multilevel internal codes into high-level multilevel internal codes. The characteristics are described below.
  • the operation steps of the multi-directional conversion operation are:
  • mapping component item corresponding to the address of the mapping component library device
  • mapping component items of the mapping component library are arranged in ascending order of the corresponding internal code of the corresponding component item, it is:
  • the component items of the mapping component item are greater than the converted yam information, and in descending order: mapping
  • mapping The component item is smaller than the converted text information.
  • the foregoing forward direction refers to: when the mapping component items of the mapping component library are arranged in the ascending order of the single inner code of the corresponding component item: the ascending direction of the mapping component library;
  • the descending order is: the descending direction of the mapping component library.
  • the execution of the matching operation in step 5 above is generally to perform "set the matching flag, and the matching pointer points to the mapping component item". If the size of the multiple internal codes can be correctly compared, you can also execute "the corresponding multiple of the current mapping component item. ⁇ Internal code replaces the corresponding part of the compared text ". In short, the purpose of the matching operation is to finally return the correct multilevel internal code.
  • the input code input multi-directional conversion in the text information processing method containing multiple internal codes can be described as follows:
  • the maximum matching segmentation method can be used in the input of Chinese characters and words, and it can also be used in a shorthand recorder.
  • the maximum matching segmentation method of the text information sequence can be used to maximize the Chinese pinyin sequence.
  • the dictionary device is called thesaurus device, thesaurus device contains words and terms, and the words contain the internal code of Chinese characters with the same syllables, words
  • An item contains a pinyin item and multiple or single-level internal codes of the word corresponding to the pinyin item; the words or words of the same pinyin are arranged according to a certain rule, and the items and terms are related in the word dictionary device according to
  • jump-out condition refers to: When the pinyin items of a word are arranged in ascending order of the corresponding inner code of the corresponding pinyin character, it is: The pinyin item of the word is greater than the segmented pinyin information;
  • the pinyin items of words are smaller than the segmented pinyin information.
  • the foregoing forward direction refers to: When the Pinyin items of a word are arranged in ascending order of the corresponding single-level inner code of the Pinyin character, it is: the ascending direction of the Pinyin item of the word;
  • the execution of the matching operation in step 3 above is generally performed by "setting the matching flag, and the matching needle points to the term".
  • the purpose of the matching operation is to return the longest word at the end.
  • text information is communicated through the transmission of text information containing multilevel internal codes.
  • Multi-directional conversion and unidirectional conversion can also use pipe conversion devices.
  • Pipeline component warehouse devices can be divided into pipeline single-stage component warehouse devices and pipeline multi-stage component warehouse devices.
  • the pipeline single-segment component library device is arranged in ascending (or descending) order of the corresponding single-level inner code of the pipeline component item.
  • the lowest end is called the beginning and the highest end is called the terminal.
  • Replacement page (Article 26) The length of a sub-item, a component term and a multilevel inner code are formed, or are composed of a component term and a multi-level inner code, or are composed of a component term.
  • the pipelines of the pipeline multi-segment component library device are segmented according to the level of the internal code, and then each segment is connected from a low internal code to a high internal code to form a pipeline.
  • the outer end of the lowest internal code segment is called the beginning and the highest internal
  • the outer end of the code segment is called the terminal.
  • the component term of the pipeline multi-segment component library device is formed by the length of the component term, the component term and the multilevel inner code, or the component term and the multilevel inner code.
  • the pipeline conversion device includes a pipeline component library device, which can realize multi-directional conversion and unidirectional conversion.
  • the pipeline conversion operation is the operation of multi-directional conversion and unidirectional conversion through the pipeline conversion device.
  • the pipeline multi-directional conversion steps are as follows: the text information is entered from the beginning, and compared with the component items while advancing, if they are equal, the corresponding text information is replaced with the corresponding multi-level internal code, and the text information comes out of the terminal to complete the multi-directional conversion;
  • the steps for unidirectional conversion of the pipeline are as follows: The text information is entered from the terminal, and compared with the multilevel internal code while advancing. If they are equal, the corresponding multilevel internal code is replaced with the corresponding component item. To complete the one-way conversion.
  • the pipeline conversion operation is an operation to realize multi-directional conversion and unidirectional conversion through a pipeline conversion device
  • Multidirectional conversion includes the following steps:
  • the comparison and replacement operation compares the converted text information in the pipeline with the component items. If they are equal, the corresponding text information is replaced with corresponding multilevel internal codes;
  • One-way conversion includes the following steps:
  • Compare and replace operations compare the converted text information in the pipeline with the component items, if
  • Figure 5 is a schematic diagram of a pipeline multi-directional conversion device.
  • 1 is the text information that is multidirectionally converted; 2 is the text information after multidirectional conversion; 3 is the pipeline component library device; 4 is the beginning; 5 is the terminal. 6 is the one-way converted yam information; 7 is one-way converted yam information.
  • the following is a description of a general unidirectional conversion device and a multidirectional conversion device.
  • One-way conversion device includes
  • a recognition device which recognizes whether a component item contains a multi-level internal code, and can select the following actions based on the recognition result
  • the unidirectional conversion continues; otherwise, the unidirectional conversion ends.
  • Fig. 7 is a schematic diagram of a unidirectional conversion device; Fig. 1 is a component plutonium device; 2 is a calculation device; 3 is a conversion device; 4 is an identification device; 5 is an entrance; 6 is an exit.
  • the multi-directional conversion device includes:
  • an indexing device which can be used to determine whether there is a component item starting with the corresponding internal code in the mapping component database, and if it exists, the address in the mapping component database is given;
  • mapping component library device the mapping component items of the device are arranged in ascending order (or descending order) of the single-level inner code of the corresponding component item;
  • a comparative matching device the device includes:
  • Judging device By comparing the converted text information with the component items in the mapped component items, the following actions are selected based on the comparison result:
  • the loop is exited, and if it matches, the last matching operation is performed; if the judgment result of the judgment device is equal, it enters the matching device;
  • Matching device perform the matching operation and enter the mobile device;
  • Mobile device Move the mapped component item in the forward direction to enter the judgment device.
  • mapping component items of the mapping component library are arranged in ascending order of the corresponding internal code of the corresponding component item, it is:
  • the component items of the mapping component item are greater than the converted yam information, and in descending order: mapping
  • mapping The component item is smaller than the converted text information.
  • the foregoing forward direction refers to: when the mapping component items of the mapping component library are arranged in the ascending order of the single inner code of the corresponding component item: the ascending direction of the mapping component library;
  • the descending order is: Mapping the descending direction of the component library.
  • the execution of the matching operation of the above step ankle 5 is generally to "set the matching flag, and the matching pointer points to the mapping component item". If the size of the multi-coded internal code can be correctly compared, you can also perform "the corresponding mapping component item Multi-code inner code replaces the corresponding part of the compared text. In short, the purpose of the matching operation is to finally return the correct multi-code inner code.
  • Figure 3 is a schematic diagram of a multi-directional conversion device.
  • 1 is the identification device, which recognizes the internal code in the text information, and searches the index device accordingly;
  • 2 is the comparison and matching device;
  • 3 is the index device;
  • 4 is the mapping component library device;
  • 5 is the component library device.
  • the figure is connected by a dashed line. If a unidirectional conversion is required in the comparison operation, a component library device is required.
  • Fig. 8 is a schematic diagram of a comparison matching device in a multi-directional conversion device, 1 is a judging device, 3 is a matching device, 3 is a mobile device, 4 is an entrance, and 5 is an exit.
  • FIG. 2 is a schematic diagram of a comparison matching process in a multi-directional conversion device. Recognize the internal code in the text information, check the indexing device accordingly, and if the address can be found, enter step 1 from entry 5;
  • Step 1 Compare whether the converted text information is greater than or equal to (or less than or equal to) the corresponding component item of the mapped component item; jump out of the comparison process from exit 6 when the condition is not met; Step 2: If the result of step 1 is greater than go to step 4 ;
  • Step 3 Perform the matching operation:
  • Step 4 Move the mapping component item in ascending (or descending) direction.
  • step 3 The execution of the matching operation in step 3 above is generally performed by "setting the matching flag, matching
  • a unidirectional conversion device is added to the first type of text output device to become a second type of text output device. For example, adding software or hardware containing a unidirectional conversion device to a printer enables the printer to print text information containing multiple levels of internal codes.
  • An input code input conversion device, a multi-directional conversion device, or a combination of the above devices is added to the first type of text input device to become a second type of text input device.
  • text information containing multiple levels of internal codes can be processed directly, the processing speed of text information can be improved. There are three main reasons for the increase in processing speed. The first is that the text information containing multiple levels of internal codes is shorter than the text information containing only a single level of internal codes. The second is that the text information to be calculated is shorter. Because text information containing multiple levels of internal codes is short,
  • the method and device provided by the present invention can be widely used in the first type of word processing device, and examples are as follows.
  • the input code input conversion device can be used to convert the input code of the components in the first type of word processing device to the corresponding unicode;
  • the unidirectional conversion device and the multidirectional conversion device can be used to compress the text information in the first type of word processing device Storage and communication.
  • unidirectional and multidirectional conversion devices are added to the file operation or disk operation system, so that the text information is automatically stored in a compressed form.
  • text information containing multiple internal codes can be used for storage, transmission, and processing, as compared with text information containing only single internal codes.
  • the text information containing multi-level internal codes is fast in processing speed, and can increase the storage capacity on the storage medium and increase the transmission efficiency on the transmission medium, thereby improving the performance inside the system and the machine, and increasing the efficiency of the word processing device.
  • the second type of word processing device also simplifies the processing process in language engineering. For example, part or all of the word segmentation work can be omitted, which can be applied in natural language understanding, machine translation, and text-to-speech conversion. For example, in the conversion of Chinese text-to-speech, the problem of word segmentation, accent of Chinese characters, and prosody must be solved. At present, the problem of Chinese word segmentation is a difficult problem to solve. The problem of Chinese word segmentation was solved, and the problem of stress was also solved.
  • the invention can also be used in word segmentation, especially in Chinese word segmentation and pinyin input of Chinese characters and words.
  • the technical solution of the present invention can be widely used in various fields of word information processing, and can also be used to guide related software, semi-software, firmware, and integrated circuit design and manufacturing, which has huge economic and economic benefits.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Procédé de traitement de données de caractères, et appareil associé. Selon le procédé classique de traitement de données de caractères, et dans l'appareil associé, le symbole de caractère ou l'unité de base correspond à un code machine interne, également appelé code machine interne à un seul niveau. En réalité, le traitement des données de caractères est un traitement de codes machine internes à un seul niveau. Dans le procédé et l'appareil de l'invention, les mots et les phrases correspondent également à certains codes machine internes, et, dans ce cas, le code machine interne est appelé code machine interne à plusieurs niveaux. Les données de caractères renfermant les codes machine internes à plusieurs niveaux sont utilisables dans le calcul et le traitement des données de caractères. Par conséquent, le traitement des données de caractères est en réalité un traitement de données de caractères renfermant les codes machine internes à plusieurs niveaux. Ceci augmente la capacité de stockage des données de caractères, la vitesse de transmission et de traitement ainsi que le rendement. Cette invention s'applique à la transmission, au stockage et au traitement de données de caractères dans ou entre les réseaux informatiques, les réseaux de télécommunication et différents dispositifs de traitement de caractères, ainsi qu'à différentes applications multimédias et d'ingénierie linguistique.
PCT/CN1995/000078 1994-10-05 1995-10-05 Procede de traitement de donnees de caracteres, et appareil associe WO1996011442A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU36032/95A AU3603295A (en) 1994-10-05 1995-10-05 Character information processing method and apparatus for the same

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN94114104.7 1994-10-05
CN94114104A CN1068688C (zh) 1994-10-05 1994-10-05 一种文字信息处理方法和装置

Publications (1)

Publication Number Publication Date
WO1996011442A1 true WO1996011442A1 (fr) 1996-04-18

Family

ID=5037040

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN1995/000078 WO1996011442A1 (fr) 1994-10-05 1995-10-05 Procede de traitement de donnees de caracteres, et appareil associe

Country Status (3)

Country Link
CN (1) CN1068688C (fr)
AU (1) AU3603295A (fr)
WO (1) WO1996011442A1 (fr)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7921059B2 (en) * 2005-12-15 2011-04-05 Microsoft Corporation Licensing upsell
US8010471B2 (en) * 2007-07-13 2011-08-30 Microsoft Corporation Multiple-instance pruning for learning efficient cascade detectors
KR101424718B1 (ko) * 2007-10-17 2014-08-04 삼성전자 주식회사 원격 접속 환경에서 접속 가능한 홈 네트워크 정보를제공하는 장치 및 그 방법
KR100960152B1 (ko) * 2007-10-24 2010-05-28 플러스기술주식회사 네트워크상의 복수 단말을 검출하여 인터넷을 허용 및차단하는 방법
US8239345B2 (en) * 2007-12-27 2012-08-07 Microsoft Corporation Asynchronous replication
JP4775463B2 (ja) * 2009-03-12 2011-09-21 カシオ計算機株式会社 電子計算機及びプログラム

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN86102418A (zh) * 1986-04-11 1987-11-25 石油部勘探开发科学研究院计算中心站 汉语音节处理机及汉语音节处理方法
CN1030985A (zh) * 1987-07-23 1989-02-08 中国商用机器公司 表意文字的处理方法及装置
CN1006251B (zh) * 1986-10-19 1989-12-27 中国民主促进会邯郸市委员会 词字二元编码输入汉字系统及键盘
CN1053960A (zh) * 1990-02-06 1991-08-21 松下电器产业株式会社 中文连续汉字变换装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN86102518A (zh) * 1986-09-10 1988-03-23 施国梁 模糊词汇键盘输入技术

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN86102418A (zh) * 1986-04-11 1987-11-25 石油部勘探开发科学研究院计算中心站 汉语音节处理机及汉语音节处理方法
CN1006251B (zh) * 1986-10-19 1989-12-27 中国民主促进会邯郸市委员会 词字二元编码输入汉字系统及键盘
CN1030985A (zh) * 1987-07-23 1989-02-08 中国商用机器公司 表意文字的处理方法及装置
CN1053960A (zh) * 1990-02-06 1991-08-21 松下电器产业株式会社 中文连续汉字变换装置

Also Published As

Publication number Publication date
CN1122476A (zh) 1996-05-15
CN1068688C (zh) 2001-07-18
AU3603295A (en) 1996-05-02

Similar Documents

Publication Publication Date Title
US8321442B2 (en) Searching and matching of data
US6873986B2 (en) Method and system for mapping strings for comparison
US6507678B2 (en) Apparatus and method for retrieving character string based on classification of character
US9110980B2 (en) Searching and matching of data
JPH08194719A (ja) 検索装置および辞書/テキスト検索方法
JPS6211932A (ja) 情報検索方法
WO2004109492A1 (fr) Procede et appareil de traitement et de representation d'objets
JP6447161B2 (ja) 意味構造検索プログラム、意味構造検索装置、及び意味構造検索方法
US5560037A (en) Compact hyphenation point data
WO2020037794A1 (fr) Procédé de construction d'index pour nom géographique anglais, et procédé et appareil d'interrogation associés
WO1996011442A1 (fr) Procede de traitement de donnees de caracteres, et appareil associe
CN114595665A (zh) 一种二进制极短码字符词编码集的构建方法
JPH04326164A (ja) データベース検索システム
CN108595584B (zh) 一种基于数字标记的汉字输出方法和系统
JPH056398A (ja) 文書登録装置及び文書検索装置
JP3253657B2 (ja) 文書検索方法
JPS63263561A (ja) 日本語文の圧縮方法
CN102478971A (zh) 一种方块字的键盘输入方法及具有键盘的数字电子装置
JPH07282040A (ja) 日本語情報圧縮方式
JP2921119B2 (ja) 数値検索装置および数値検索方法
JPH06251070A (ja) 単語検索のための電子辞書圧縮方法及び装置
JP3526748B2 (ja) 文字列探索装置および方法
JPH0140370B2 (fr)
JP2993539B2 (ja) データベース検索システムおよびその方法
JP2526678B2 (ja) 単語辞書検索装置

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AM AT AU BB BG BR BY CA CZ DE DK EE ES FI GB GE HU IS JP KE KG KP KR KZ LK LR LT LU LV MD MG MN MW MX NO NZ PL PT RO RU SD SE SG SI SK TJ TM TT UA UG US UZ VN

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): KE MW SD SZ UG AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
ENP Entry into the national phase

Ref country code: US

Ref document number: 1997 817539

Date of ref document: 19970404

Kind code of ref document: A

Format of ref document f/p: F

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase