UA64907A

UA64907A - Adaptive device for compressing text messages

Info

Publication number: UA64907A
Application number: UA2003010412A
Authority: UA
Inventors: Viktor Stepanovych Chernega
Original assignee: Univ Sevastopol Nat Technical
Priority date: 2003-01-16
Filing date: 2003-01-16
Publication date: 2004-03-15

Abstract

The proposed adaptive device for compressing text messages contains an input data register, a register for forming the prefix code of a text row, a comparator, a random-access memory unit, a timing pulse generator, and additionally, a unit for determining the text type, a unit for selecting a module of the random-access memory, a counter module, and a control unit. The input of the register for forming the prefix code is connected to the first output of the input data register. The second output of the input data register and the output of the register for forming the prefix code are connected to the corresponding inputs of the comparator. The unit for determining the text type and the module of the random-access memory are connected in series to each other and connected between the input of the proposed device and the input of the random-access memory unit. The first output of the comparator is connected to the corresponding input of the unit for determining the text type, and the second output is connected to the input of the control unit. The outputs of the control unit are connected to the corresponding control inputs of the input data register, the register for forming the prefix code, and the counter module.

Description

Винахід відноситься до області ефективного кодування інформації і може бути використаний в комп'ютерних вузлах зв'язку та мережі Інтернет.The invention relates to the field of effective coding of information and can be used in computer communication nodes and the Internet.

Існує пристрій стискання текстових повідомлень методом Г2М/ Чернега В.С. Сжатие информации в компьютерньїх сетях / В.С. Чернега - Севастополь: СевГТУ, 1997. - 214 с), що містить у собі кодову таблицю, кодовий лічильник, блок формування хеш-функції, схему порівняння і префіксний регістр. Однак, ця схема не враховує інформаційно-статистичні властивості мови текстового повідомлення. Тому на початковому етапі кодування, коли йде побудова кодової таблиці рядків змінної довжини, з виходу пристрою в канал передачі надходять одиночні символи, закодовані кодовими словами довжиною більш 8 біт, що приводить до розширення переданого повідомлення. В міру заповнення кодової таблиці підрядками тексту, що стискається, і кодуванню двох і більш символів (підрядків) однією кодовою комбінацією відбувається зменшення обсягів переданої в канал інформації за рахунок збільшення коефіцієнта стиску.There is a device for compressing text messages using the G2M method/ V.S. Chernega. Compression of information in computer networks / V.S. Chernega - Sevastopol: SevGTU, 1997. - 214 p.), which contains a code table, a code counter, a block for forming a hash function, a comparison scheme and a prefix register. However, this scheme does not take into account the informational and statistical properties of the language of the text message. Therefore, at the initial stage of coding, when the code table of strings of variable length is being built, single symbols coded with code words longer than 8 bits are sent from the output of the device to the transmission channel, which leads to the expansion of the transmitted message. As the code table is filled with substrings of the compressed text and two or more symbols (substrings) are encoded with one code combination, the amount of information transmitted to the channel decreases due to the increase in the compression ratio.

За прототип узятий пристрій кодування рядків змінної довжини (УмМеїсп Т.А. А їесппідне їТог підп-репотапвзе дата сотргезвіоп / ІЕЕЕ Сотршиїег. - 1984. - МоІ.17. -Мб. - Р. 8-19.Ї. Даний пристрій містить регістр вхідних символів, до якого підключені блок формування хеш-функції і регістр формування коду під, що утвориться із символів стисливого тексту. Вихід регістра формування коду підрядка підключений до блоку формування хеш- функції, що прискорює процедуру пошуку рядка в таблиці рядків змінної довжини; пристрій порівняння, що здійснює пошук у таблиці рядків перемінної довжини підрядок, що надійшов у вхідний регістр із входу пристрою; і таблиця рядків перемінної довжини, що містить підрядки, що знайдені пристроєм у тексті. Причому вихід регістра формування коду підрядка є виходом пристрою. До блоку вибору адреси таблиці рядків підключені початковий лічильник і блок формування хеш-функції. До виходу блоку вибору адреси таблиці рядків підключена таблиця рядків перемінної довжини, до якої у свою чергу підключений пристрій порівняння регістра формування коду підрядка і кодовий лічильник.The prototype was taken as a device for encoding lines of variable length (UmMeisp T.A. A iespppidne yTog podp-repotapvze data sotrgezviop / IEEE Sotrshiieg. - 1984. - MoI.17. -Mb. - R. 8-19.Y. This device contains a register of input characters, to which the hash function formation block and the subcode formation register formed from the characters of the compressible text are connected. The output of the substring code formation register is connected to the hash function formation block, which speeds up the procedure for finding a line in a table of variable-length lines; a comparison device , which searches the table of variable-length strings for substrings received in the input register from the input of the device; and the table of variable-length strings containing the substrings found by the device in the text. Moreover, the output of the substring code formation register is the output of the device. To the address selection block string tables are connected with an initial counter and a block for forming a hash function. A string table of variable length is connected to the output of the string table address selection block, to which in turn, the device for comparing the substring code generation register and the code counter is connected.

В основу винаходу поставлена задача збільшення ступеня стискання і зменшення часу стискання текстової інформації, особливо на початковому етапі ефективного кодування тексту. Задача вирішується наступним шляхом. Текст пропонується розділяти на різні типи: розмовна мова, белетристика, ділові тексти і т.д. В зв'язку з тим, що для кожного типу тексту однієї і тієї ж мови існує своя статистика підрядків, яку можна заздалегідь зібрати шляхом обробки декількох текстів одного типу, тобто в кожній таблиці знаходяться підрядки з найбільшою імовірністю для кожного типу тексту. Використання цих таблиць дозволяє уникнути побудови не ефективних кодових комбінацій на початковому етапі кодування. Для реалізації цього методу в пристрій стиску текстової інформації вводяться блок оцінки типу тексту, блок таблиць кодування типу тексту, реалізований за допомогою блока оперативних запам'ятовуючих пристроїв (ОЗП), рядків змінної довжини, що відповідають різним типам тексту, а також комутатор таблиць. На кресленні (Фіг.) зображена структурна схема даного пристрою стиску.The invention is based on the task of increasing the degree of compression and reducing the time of compression of text information, especially at the initial stage of effective text encoding. The problem is solved in the following way. It is proposed to divide the text into different types: spoken language, fiction, business texts, etc. Due to the fact that for each type of text of the same language there is its own statistics of substrings, which can be collected in advance by processing several texts of the same type, that is, in each table there are substrings with the highest probability for each type of text. The use of these tables allows you to avoid building inefficient code combinations at the initial stage of coding. To implement this method, a text type evaluation block, a block of text type coding tables, implemented using a block of random access memory (RAM), strings of variable length corresponding to different types of text, and a switch of tables are introduced into the text information compression device. The drawing (Fig.) shows the structural diagram of this compression device.

Пристрій містить блок оцінки типу тексту 1, вхідний регістр символів 2, генератор тактових імпульсів (ГТІ) 3, префіксний регістр 4, вихідний регістр 5, пристрій керування 6, комутатор таблиць 7, блок ОЗП 8, блок порівняння 9, блок кодових лічильників 10.The device includes a text type evaluation block 1, an input character register 2, a clock pulse generator (CTI) 3, a prefix register 4, an output register 5, a control device 6, a table switch 7, a RAM block 8, a comparison block 9, a block of code counters 10.

Перші входи блоку 1 і блоку 2 з'єднані між собою паралельно, і точка їх з'єднання є входом пристрою. Вихід блоку 1 з'єднаний із входом блоку 7, виходи якого з'єднані з блоком 8 і блоком 6. Виходи блоку 6 з'єднані з другими входами блоків 2, 4 і 10, а відповідні входи з виходами блоків 3, 9 і 7. Виходи блоку 2 з'єднані відповідно з входами блоків 4 і 9. Виходи блоку 10 з'єднані з блоком 8, виходи якого з'єднані з блоком 9. Відповідні виходи блоку 9 з'єднані з блоками 1 і 6. Виходи з блоку 4, є відповідно входами блоків 9 і 5. Вихід блоку 5 є виходом із пристрою стискання.The first inputs of block 1 and block 2 are connected in parallel, and the point of their connection is the input of the device. The output of block 1 is connected to the input of block 7, the outputs of which are connected to block 8 and block 6. The outputs of block 6 are connected to the second inputs of blocks 2, 4 and 10, and the corresponding inputs to the outputs of blocks 3, 9 and 7 The outputs of block 2 are connected to the inputs of blocks 4 and 9, respectively. The outputs of block 10 are connected to block 8, the outputs of which are connected to block 9. The corresponding outputs of block 9 are connected to blocks 1 and 6. The outputs of the block 4, are respectively the inputs of blocks 9 and 5. The output of block 5 is the output of the compression device.

Пристрій працює наступним чином.The device works as follows.

Символи, що кодують, надходять послідовно у вхідний регістр символів 2. Потім символ із блоку 2 переноситься шляхом зсуву в префіксний регістр 4, а у вхідний регістр 2 заноситься наступний символ вхідного потоку. Кодова комбінація, що знаходиться в блоці 4 є префікс, який разом із символом вхідного регістра утворять рядок. Блок оцінки типу тексту 1 здійснює послідовний пошук рядка в кожній з ОЗП блока таблиці. Таблиця, у якій виявлене найбільше число підрядків, що співпадають із підрядками, які кодуються (в експериментальній моделі, що була створена, через дтис. символів) призначається поточною. Індекс рядка таблиці є кодом рядка і записується у вихідний регістр. На початковому етапі кодування використовується узагальнена для всіх типів тексту таблиця.Encoding symbols are sequentially entered into input symbol register 2. Then, the symbol from block 2 is shifted into prefix register 4, and input register 2 is loaded with the next symbol of the input stream. The code combination located in block 4 is a prefix that, together with the character of the input register, will form a string. The text type evaluation unit 1 performs a sequential search for a line in each of the RAMs of the table unit. The table in which the largest number of substrings matching the substrings to be coded (in the experimental model that was created, through dths of characters) is determined to be the current one. The row index of the table is the row code and is written to the output register. At the initial stage of coding, a table generalized for all types of text is used.

Код знайденого рядка призначається новим префіксом і заноситься в префіксний регістр 4. Потім у вхідний регістр 2 вводиться наступний символ потоку, що з'єднується з префіксом і знову здійснюється пошук співпадаючого підрядка. Така операція продовжується до тих пір, поки не буде відзначено, що знову створений рядок відсутній в кожній з таблиць. Одночасно з цим виконується підрахунок кількості підрядків в одиницю часу.The code of the found string is assigned a new prefix and entered in prefix register 4. Then the next symbol of the stream is entered into input register 2, which is connected to the prefix, and a matching substring is searched for again. This operation continues until it is noted that the newly created row is missing in each of the tables. At the same time, the number of substrings per unit of time is calculated.

Якщо частота знову утвореного рядка перевищує граничну, то вона заноситься в поточну кодову таблицю в позицію першого вільного рядка.If the frequency of the newly formed line exceeds the limit, it is entered in the current code table in the position of the first free line.

Кодова комбінація з вихідного регістра 5 видається на вихід пристрою. Символ із вхідного регістра 2 пересовується в префіксний регістр і процедура кодування починається спочатку.The code combination from the output register 5 is issued to the output of the device. The character from the input register 2 is moved to the prefix register and the coding procedure starts over.

Після надходження на вхід компресора М символів, яких досить для надійної оцінки класу тексту, пошук порівнянного рядка в ОЗП таблиць, що кодують, здійснюється не шляхом послідовного їхнього перегляду, а в таблиці, що відповідає даному класу тексту.After receiving at the input of the compressor M symbols, which are enough for a reliable assessment of the text class, the search for a comparable line in the RAM of the encoding tables is not carried out by sequentially viewing them, but in the table corresponding to the given text class.

Таким чином, значно скорочується час пошуку порівнянного рядка і тим самим підвищується швидкодія компресора.Thus, the search time for a comparable line is significantly reduced and thus the speed of the compressor increases.

У процесі кодування символів, що надходять, оцінка класу тексту може змінюватися. Тим самим здійснюється зміна черговості перегляду ОЗП таблиць, що кодують, на предмет наявності співпадаючого підрядка.In the process of encoding incoming characters, the text class score may change. In this way, the order of viewing the RAM of the encoding tables is changed for the presence of a matching substring.

Пристрій керування, блок оцінки тексту, вхідний і вихідний регістри реалізовані на базі сигнального мікропроцесора ГТІ - на базі кварцового генератора тактових імпульсів. Блок ОЗП -- на базі мікросхем пам'яті.The control device, text evaluation unit, input and output registers are implemented on the basis of the GTI signal microprocessor - on the basis of the quartz clock pulse generator. The RAM block is based on memory chips.

Таким чином, застосування адаптивного пристрою стиску текстових повідомлень для систем передачцчі і збереження текстової інформації дозволяє збільшити ступінь стискання даних за рахунок введення блоку оцінки тексту та імовірнісних таблиць рядків змінної довжини для різних типів тексту. Це дозволяє скоротити час передачі текстових даних по каналах зв'язку, а також зменшити обсяги пам'яті, що займаються текстами при збереженні на магнітних і оптичних носіях. 4 ня ння нн нн няння для Е нн У Еесенннннтннннну ! шин ння і шин нн . пиши лижна, п їThus, the use of an adaptive compression device for text messages for systems of transmission and storage of text information allows to increase the degree of data compression due to the introduction of a text evaluation block and probability tables of strings of variable length for different types of text. This makes it possible to reduce the time of transmission of text data over communication channels, as well as to reduce the amount of memory that deals with texts when stored on magnetic and optical media. 4 babysitters nn nn babysitters for E nn U Eesennnnnntnnnnnu ! Tires and Tires. write ski, drink

Е | дення слдджккя потен нн ? Її 5 : ї З ШЕ ЗИ: ї ї ї сеххк жк сх і наE | day slddjkkya poten nn? Her 5: и Z ШЕ ЗИ: и и и сеххк жк шх и на

З ї х . Е сей й ще Бе ї ій ! в й Я кофе й ї їх.From her x. And this and more! in and I coffee and eat them.

Що ШИ Й ї і 5 о комина нн я З скид. пи З ї МНН: з поні Зі ві нннн ї іш: ї | :What SHY Y y and 5 o chimney nn I Z skid. pi Z i MNN: z poni Z vi nnnn i ish: i | :

Ї дней уогоооооогсооосто доссссооогооостсося: !It's been a long time since I've been doing it:

СА ; -SA; -

Фіг.Fig.

Claims

An adaptive text message compression device comprising an input character register to which a substring code formation prefix register is connected, the output of this register and a second output of the input character register are connected to corresponding inputs of a comparator unit which in turn is connected to by a block of operational storage devices, as well as a clock pulse generator, which differs in that between the input to the device and the block of operational storage devices, a serially connected text type evaluation block and a switching block of operational storage devices are introduced, the first output of the block the comparison is connected to the corresponding input of the text type evaluation block, and the second output is connected to the control block, which is connected to the input character register, the prefix register of the substring code formation, and also to the code counter block.