CN116032292A - Efficient big data storage method based on translation file - Google Patents
Efficient big data storage method based on translation file Download PDFInfo
- Publication number
- CN116032292A CN116032292A CN202310300380.XA CN202310300380A CN116032292A CN 116032292 A CN116032292 A CN 116032292A CN 202310300380 A CN202310300380 A CN 202310300380A CN 116032292 A CN116032292 A CN 116032292A
- Authority
- CN
- China
- Prior art keywords
- sequence
- character
- buffer
- phrase
- characters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention relates to the technical field of data processing for data compression, in particular to a high-efficiency storage method of big data based on a translation file, which comprises the following steps: obtaining a translation file, and preprocessing the translation file to obtain a translation sequence; compressing the translated sequence by combining the character searching buffer area and the phrase searching buffer area to obtain a compressed sequence, obtaining binary sequences corresponding to all the coding objects, and coding the compressed sequence according to the binary sequences corresponding to all the coding objects to obtain a coding sequence; and sending the coded sequence to a terminal at the conference site. The invention ensures the short execution time of compressing the translation and high compression efficiency by setting a smaller search buffer area, a character search buffer area and a phrase search buffer area for the LZ77 compression algorithm, solves the problem that the size of the search buffer area in the LZ77 compression algorithm contradicts the influence of the execution time and the compression efficiency, and realizes the efficient storage and the rapid acquisition of the translation file.
Description
Technical Field
The invention relates to the technical field of data processing for data compression, in particular to a high-efficiency storage method for big data based on a translation file.
Background
Simultaneous interpretation is a translation mode which requires a translator to give out corresponding translations while hearing the speech of a speaker, and has the greatest characteristics of high efficiency, no influence or interruption on the thought of the speaker, and guarantee of coherent speech of the speaker. Under certain specific situations, the translator cannot synchronize to the scene, at this time, the audio signals recorded by the speaker during speaking need to be sent to the translator, the translator gives corresponding translation sequences, and then the translation sequences are sent to the terminal at the conference scene.
Because simultaneous interpretation has high requirements on timeliness, the compression efficiency of the translation sequence is required to be high, meanwhile, the execution time is short, and a simple and rapid compression algorithm is required to be used for efficiently and rapidly obtaining the translation sequence in a conference site; also, since simultaneous interpretation is performed in real time, statistical characteristics of the translations cannot be known in advance, and thus the translations cannot be compressed by a compression algorithm based on the statistical characteristics, such as a huffman compression algorithm.
Based on the requirements, the translation is compressed through an LZ77 compression algorithm. The LZ77 compression algorithm takes the compressed data in the translation as a searching buffer area, takes the data to be compressed as a preceding buffer area, utilizes the locality of the data to realize high-efficiency compression, and the size of the searching buffer area in the LZ77 compression algorithm determines the execution time and the compression efficiency of the algorithm, so that the problem that the influence of the size of the searching buffer area on the execution time and the compression efficiency is contradictory exists, and the problem that how to efficiently and rapidly obtain a translation sequence on a conference site becomes urgent to be solved.
Disclosure of Invention
The invention provides a high-efficiency storage method of big data based on a translation file, which aims to solve the existing problems.
The invention discloses a high-efficiency storage method for big data based on a translation file, which adopts the following technical scheme:
the embodiment of the invention provides a method for efficiently storing big data based on a translation file, which comprises the following steps:
obtaining a translation file, and preprocessing the translation file to obtain a translation sequence;
compressing the translation sequence in combination with the character search buffer and the phrase search buffer to obtain a compressed sequence, comprising:
s1, filling a second preset length of empty characters in front of a translated text sequence, taking a sequence consisting of the characters with the first preset length in the translated text sequence as a sliding window, and setting an empty sequence as a compressed character sequence;
s2, taking a sequence formed by a first second preset length of characters in the sliding window as a character searching buffer area and taking a sequence formed by the rest characters in the sliding window as a preceding buffer area;
s3, obtaining a phrase searching buffer area according to the compressed character sequence;
s401, obtaining all character strings of the preceding buffer, judging whether the phrase searching buffer has the same phrase as the 1 st character string of the preceding buffer, and executing S402 if not; if so, searching in the phrase searching buffer area to obtain the maximum matching item of the preceding buffer area, and according to the maximum matching item, obtaining an output result and a sliding quantity, and executing S5;
s402, searching in a character searching buffer area to obtain a maximum matching item of a preceding buffer area, obtaining an output result and a sliding quantity according to the maximum matching item, and executing S5;
s5, obtaining a new sliding window according to the sliding quantity, and taking the sliding window after sliding as the new sliding window; re-executing from S2 according to the new sliding window until the obtained size of the new sliding window is smaller than the second preset length; the sequence formed by all the obtained output results according to the sequence is recorded as a compressed sequence of the translation sequence;
obtaining binary sequences corresponding to all the coding objects, and coding the compressed sequences according to the binary sequences corresponding to all the coding objects to obtain coding sequences; and temporarily storing the obtained code sequence and sending the code sequence to a terminal at the conference site.
Further, the preprocessing of the translation file to obtain a translation sequence includes the following specific steps:
the method comprises the steps of taking a first preset symbol as an identifier of the end of an English word, taking a second preset symbol as an identifier of the end of a sentence, replacing the identifiers of the end of all English words in a translation file with the first preset symbol, replacing the identifiers of the end of all sentences in the translation file with the second preset symbol, adding a first preset symbol at the beginning of the translation file, taking the first preset symbol, the second preset symbol and all English letters as characters, and recording a sequence formed by all the characters contained in the translation file according to a sequence as a translation sequence.
Further, the root obtains a phrase searching buffer area according to the compressed character sequence, and the method comprises the following specific steps:
a sequence formed by all English letters between two symbols in a compressed character sequence is recorded as a phrase, and one symbol in the compressed character sequence is also recorded as a phrase, wherein the symbols comprise two types, namely a first preset symbol and a second preset symbol; all phrases in the obtained compressed character sequence are arranged according to the sequence, and finallyThe sequence formed by the individual phrases is used as a phrase searching buffer area pair to be filled with the translated text sequence, wherein,representing a second preset length.
Further, the obtaining all the character strings of the look-ahead buffer includes the following specific steps:
dividing the look-ahead buffer into a plurality of subsequences according to symbols, marking each subsequence and each symbol as a character string, and obtaining all character strings of the look-ahead buffer, wherein the symbols comprise two kinds of symbols, namely a first preset symbol and a second preset symbol.
Further, the searching in the phrase searching buffer area to obtain the maximum matching item of the preceding buffer area, and obtaining the output result and the sliding quantity according to the maximum matching item, comprising the following specific steps:
word group searching buffer zone and 1 st character string of preceding buffer zoneThe same phrase is obtained, and the phrase searching buffer area and the character string are obtainedIdentical phraseWherein, the method comprises the steps of, wherein,representing an ith phrase in the phrase searching buffer area; if an integer is presentAnd causes the ith phrase in the phrase lookup bufferTo the (i+z) th phraseAll phrases in between and the 1 st character string of the advance buffer zoneTo the 1+z-th character stringAll strings in the buffer are identical, the 1 st string in the buffer is first bufferedTo the 1+z-th character stringThe sequence of all strings in between is denoted as the largest match, where s represents the position in the look-ahead bufferThe number of strings; searching phrase and character string in buffer areaIdentical phraseThe serial number i of (2) is marked as an offset, and the integer z is used as a matching length; taking a binary group consisting of the offset and the matching length as an output result; the 1 st character string of the advance buffer zoneTo the 1+z-th character stringThe sum of the lengths of all the character strings in between is used as the slip amount.
Further, the searching in the searching area to obtain the maximum matching item of the area to be processed, and obtaining the plaintext and the translation according to the maximum matching item comprises the following specific steps:
judging whether the character searching buffer area has the same character as the 1 st character of the first buffer area:
if the character lookup buffer does not have the 1 st character with the look-ahead bufferThe same character will be the 1 st character of the look-ahead bufferAs the maximum matching item, taking the maximum matching item as an output result, and taking the number of characters in the maximum matching item as a sliding quantity;
if the character search buffer exists with the 1 st character of the look-ahead bufferThe same character, obtain the same character in the character search bufferIdentical charactersWherein, the method comprises the steps of, wherein,representing a j-th character in the character lookup buffer; if an integer is presentAnd causing the jth character in the character lookup buffer toTo the j+r-th characterAll characters in between and character 1 of the look-ahead bufferUp to 1+r charactersAll characters in between are identical, 1 st character of the look-ahead buffer willUp to 1+r charactersThe sequence of all characters in between is noted as the largest match, where,indicating that the first preset length is a first preset length,representing a second preset length; look up the characters in the buffer with the charactersIdentical charactersThe sequence number j of (2) is marked as an offset, and the integer z is used as a matching length; binary component of offset and matching lengthThe group is taken as an output result; the matching length is taken as the slip quantity.
Further, the obtaining the binary sequences corresponding to all the encoding objects includes the following specific steps:
marking the first preset symbol, the second preset symbol and all English letters as characters, and marking all characters and all English letters as charactersAll integers in between are recorded as coding objects;
with a length ofWherein w represents the like number of all characters,the method comprises the following steps of representing the upward rounding and setting short binary sequences corresponding to all characters: the 1 st bit in the short binary sequence is marked as a distinguishing bit and is set to 0; will be any oneBinary data as a post-sequence of short binary valuesThe bit and the short binary sequences corresponding to any two characters are different;
with a length ofThe long binary sequence of (2) represents allAll integers in the middle, wherein,the step of representing the second preset length and setting the long binary sequence corresponding to all integers specifically comprises the following steps: the 1 st bit in the long binary sequence is marked as a distinguishing bit and is set as 1; corresponding the integerBit binary data as bits 2 to 2 of long binary sequenceA bit; the last 1 bit in the long binary sequence is recorded as a recognition bit, the recognition bit is set to 1 if the offset and the matching length are obtained according to the character searching buffer, and the recognition bit is set to 0 if the offset and the matching length are obtained according to the phrase searching buffer.
Further, the coding of the compressed sequence according to the binary sequences corresponding to all the coding objects to obtain the coding sequence comprises the following specific steps:
and obtaining short binary sequences corresponding to all characters in the compressed sequence and long binary sequences corresponding to all offsets and matching lengths, and recording the sequences formed by all the obtained short binary sequences and all the long binary sequences according to the sequence as a coding sequence.
The technical scheme of the invention has the beneficial effects that: the invention ensures the short execution time of the compression algorithm by setting a smaller searching buffer zone for the LZ77 compression algorithm; by setting the character searching buffer area and the phrase searching buffer area, the compression algorithm can realize high-efficiency compression by utilizing the locality of characters with shorter distances in the translation, can realize high-efficiency compression by utilizing the locality of characters with longer distances in the translation, and finally ensures that the execution time for compressing the translation is short, and meanwhile, the compression efficiency is high, thereby solving the problem that the influence of the size of the searching buffer area in the LZ77 compression algorithm on the execution time and the compression efficiency is contradictory, realizing the high-efficiency storage of translation files, and ensuring that a translation sequence can be obtained quickly and efficiently on a conference site.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of steps of a method for efficiently storing big data based on a translation file according to the present invention.
Detailed Description
In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following is a detailed description of a specific implementation, structure, characteristics and effects of the method for efficiently storing big data based on translation files according to the invention, which is provided by the invention, with reference to the accompanying drawings and the preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following specifically describes a specific scheme of the high-efficiency storage method of big data based on translation files.
Referring to fig. 1, a flowchart of a method for efficiently storing big data based on a translation file according to an embodiment of the present invention is shown, where the method includes the following steps:
s001, obtaining a translation file, and preprocessing the translation file to obtain a translation sequence.
In the embodiment, a terminal at a conference site receives speech signals of a talker from radio equipment and sends the received speech signals to a translation terminal; and the translator at the translation end carries out online real-time translation on the received voice signal to obtain translated text, and all the translated text in a preset time period form a translation file.
In order to ensure that the conference participants understand the speech throughout, simultaneous interpretation has high requirements on timeliness, and efficient and rapid acquisition of a translation sequence is required on the conference site, so in this embodiment, the preset time period is 5 seconds, and in other embodiments, the implementation personnel can set the preset time period as required.
It should be noted that, the translation file for this embodiment is an english version translation file, so the translation file is composed of an english word, an identifier at the end of the english word, and an identifier at the end of the sentence, where the english word is composed of a plurality of english letters; the method comprises the steps of taking a first preset symbol as an identifier of the end of an English word, taking a second preset symbol as an identifier of the end of a sentence, replacing the identifiers of the end of all English words in a translation file with the first preset symbol, replacing the identifiers of the end of all sentences in the translation file with the second preset symbol, adding a first preset symbol at the beginning of the translation file, taking the first preset symbol, the second preset symbol and all English letters as characters, and recording a sequence formed by all the characters contained in the translation file according to a sequence as a translation sequence, wherein the first preset symbol is "#", and the second preset symbol is "heat".
For example, in this embodiment, the translation file is "i have a dog. The dog", and the translation sequence obtained by replacing the identifier at the end of all english words and the identifier at the end of sentences in the translation file is "#i#have#a#dog & the#dog".
S002, compressing the translation sequence by combining the character searching buffer area and the phrase searching buffer area to obtain a compressed sequence.
It should be noted that, because simultaneous interpretation has high requirements on timeliness, the compression efficiency of the translation sequence is required to be high, and the execution time is short, so that a simple and rapid compression algorithm is required to ensure that the translation sequence is obtained efficiently and rapidly on the conference site; also, since simultaneous interpretation is performed in real time, statistical characteristics of the translations cannot be known in advance, and thus the translations cannot be compressed by a compression algorithm based on the statistical characteristics, such as a huffman compression algorithm. Therefore, the present embodiment compresses and transmits the translation by the LZ77 compression algorithm.
The LZ77 compression algorithm takes the compressed characters in the translation as a searching buffer area, takes the characters to be compressed as a preceding buffer area, and realizes high-efficiency compression by utilizing the locality of the characters; the size of the lookup buffer in the LZ77 compression algorithm determines the execution time and compression efficiency of the algorithm: when the translation is compressed by the LZ77 compression algorithm, the largest matching phrase in the searching buffer area is needed to be searched in the advance buffer area, so that the smaller the searching buffer area is in the LZ77 compression algorithm, the shorter the searching time is, namely the shorter the execution time of the algorithm is; because the LZ77 compression algorithm replaces repeated characters with compressed characters by searching whether characters to be compressed in a preceding buffer area appear in the compressed characters, and utilizes the locality of the characters in the translation to realize efficient compression, when the search buffer area in the LZ77 compression algorithm is smaller, the number of the compressed characters which can be utilized is smaller, the locality of the characters which are closer in the translation can only be utilized, the locality of the characters which are farther in the translation can not be utilized, the locality of the characters is not fully utilized, and efficient compression can not be realized, so that the smaller the search buffer area in the LZ77 compression algorithm is, the lower the compression efficiency of the algorithm is. In summary, the size of the lookup buffer of the LZ77 compression algorithm contradicts the impact of execution time and compression efficiency.
In order to ensure that a translation sequence is obtained efficiently and quickly on a conference site, a smaller search buffer is required to be arranged, but the smaller search buffer can lead to lower compression efficiency, so the invention is provided with two search buffers which are a character search buffer and a phrase search buffer respectively, wherein the character search buffer takes a single character as an object, the high-efficiency compression can be realized by utilizing the locality of characters with a smaller distance in the translation site, the phrase search buffer takes a phrase formed by a plurality of characters as an object, the high-efficiency compression can be realized by utilizing the locality of characters with a longer distance in the translation site, and finally the translation sequence can be obtained efficiently and quickly on the conference site.
1. The size of the sliding window and the size of the searching buffer area are set, and the size is specifically as follows: the size of the sliding window is set to be equal to the first preset length, and the size of the searching buffer area in the sliding window is set to be equal to the second preset length. In order to ensure that translations can be sent quickly, in this embodimentIn a first preset lengthA second preset lengthIn other embodiments, the practitioner may set the first preset length and the second preset length as desired, but must ensure that the first preset length is greater than the second preset length.
2. Obtaining a sliding window and a compressed character sequence, and filling in the sliding window before the translated text sequenceThe blank characters are used for translating the front part in the sequenceThe sequence of characters is used as a sliding window, and a null sequence is set as a compressed character sequence.
3. The method comprises the steps of obtaining a character searching buffer area and a look-ahead buffer area in a sliding window, wherein the character searching buffer area and the look-ahead buffer area are specifically as follows: will slide in front of windowA sequence consisting of the characters is used as a character searching buffer area, and the sliding window is arranged at the rearThe sequence of individual characters acts as a look-ahead buffer. For example, the number of the cells to be processed, for the filled translated sequence "$ $ $ $ $ $ $ $ #i#have#a#dog $ $ $ $ $ $ $ $ #&the #dog ", the first obtained sliding window is" $, $, $, $, $, $, $, #, I, #, h, a, v, e, # ", the character search buffer in the sliding window is" $, $, $, $, $, ", the preceding buffer in the sliding window" #, I, #, h, a, v, e, # ", and the embodiment uses" $ "to indicate an empty character.
4. The phrase searching buffer area is obtained according to the compressed character sequence, which is specifically as follows: a sequence formed by all English letters between two symbols in the compressed character sequence is recorded as a phrase, and a symbol in the compressed character sequence is also recorded as a phrase, wherein the symbols comprise two kinds of symbols which are respectively the firstA preset symbol and a second preset symbol; all phrases in the obtained compressed character sequence are arranged according to the sequence, and finallyThe sequence of individual phrases is used as a phrase searching buffer area.
5. Searching in the phrase searching buffer area to obtain the maximum matching item of the preceding buffer area, and obtaining an output result and a sliding quantity according to the maximum matching item, wherein the method specifically comprises the following steps:
(1) All the character strings of the advance buffer are obtained, specifically: dividing the look-ahead buffer into a plurality of subsequences according to symbols, marking each subsequence and each symbol as a character string, and obtaining all character strings of the look-ahead buffer, wherein the symbols comprise two kinds of symbols, namely a first preset symbol and a second preset symbol. For example, all the strings of the look-ahead buffers "#, i, #, h, a, v, e, #" are "#," "have", "#", respectively.
(2) Judging whether the 1 st character string of the phrase searching buffer area exists with the preceding buffer areaThe same phrase, there are two cases, respectively:
case 1: word group searching buffer zone and 1 st character string of preceding buffer zone are not existedAnd (3) executing the step (3) if the phrases are the same.
Case 2: word group searching buffer zone and 1 st character string of preceding buffer zoneThe same phrase is obtained, and the phrase searching buffer area and the character string are obtainedIdentical phraseWherein, the method comprises the steps of, wherein,representing an ith phrase in the phrase searching buffer area; if an integer is presentAnd causes the ith phrase in the phrase lookup bufferTo the (i+z) th phraseAll phrases in between and the 1 st character string of the advance buffer zoneTo the 1+z-th character stringAll strings in the buffer are identical, the 1 st string in the buffer is first bufferedTo the 1+z-th character stringThe sequence of all character strings in between is marked as the maximum matching item, wherein s represents the number of all character strings in the advance buffer; searching phrase and character string in buffer areaIdentical phraseThe serial number i of (2) is marked as an offset, and the integer z is used as a matching length; taking a binary group consisting of the offset and the matching length as an output result; the 1 st character string of the advance buffer zoneTo the 1+z-th character stringThe sum of the lengths of all the character strings in between is taken as the sliding quantity; step 6 is performed.
(3) Judging whether the 1 st character of the first buffer area exists in the character searching buffer areaThe same character, there are two cases, respectively:
case 1: character search buffer does not have character 1 with the look-ahead bufferThe same character will be the 1 st character of the look-ahead bufferAs the maximum matching item, taking the maximum matching item as an output result, and taking the number of characters in the maximum matching item as a sliding quantity; step 6 is performed.
Case 2: character search buffer presence and 1 st character of look-ahead bufferThe same character, obtain the same character in the character search bufferIdentical charactersWherein, the method comprises the steps of, wherein,representing a j-th character in the character lookup buffer; if an integer is presentAnd causing the jth character in the character lookup buffer toTo the j+r-th characterAll characters in between and look ahead buffering1 st character of regionUp to 1+r charactersAll characters in between are identical, 1 st character of the look-ahead buffer willUp to 1+r charactersThe sequence composed of all characters in between is marked as the maximum matching item; look up the characters in the buffer with the charactersIdentical charactersThe sequence number j of (2) is marked as an offset, and the integer z is used as a matching length; taking a binary group consisting of the offset and the matching length as an output result; taking the matching length as the sliding quantity; step 6 is performed.
6. Obtaining a new sliding window according to the sliding quantity, specifically: sliding the sliding window rightwards, wherein the sliding length is equal to the sliding amount, and taking the sliding window after sliding as a new sliding window; re-executing from step 3 according to the new sliding window until the obtained new sliding window is smaller than the second preset length。
And (3) marking the sequence formed by all the obtained output results according to the sequence as a compressed sequence of the translation sequence.
As shown in Table 1, the compressed sequences obtained by compressing the translated sequences "#i#have#a#dog & the#dog" according to the above-described steps are "#i, 7,1, h, a, v, e,6,1,5,1,4,1, d, o, g, & t, h, e,7,2".
TABLE 1
In the embodiment, a smaller searching buffer area is arranged for the LZ77 compression algorithm, so that the execution time of the compression algorithm is ensured to be short; by setting the character searching buffer area and the phrase searching buffer area, the compression algorithm can realize high-efficiency compression by utilizing the locality of characters with shorter distances in the translation, can realize high-efficiency compression by utilizing the locality of characters with longer distances in the translation, and finally ensures that the execution time for compressing the translation is short, and meanwhile, the compression efficiency is high, thereby solving the problem that the influence of the size of the searching buffer area in the LZ77 compression algorithm on the execution time and the compression efficiency is contradictory, realizing the high-efficiency storage of translation files, and ensuring that a translation sequence can be obtained quickly and efficiently on a conference site.
S003, coding the compressed sequence according to binary sequences corresponding to all coding objects to obtain a coding sequence.
The compressed sequence obtained in step S002 comprises a binary group of characters, an offset and a matching length, wherein the characters comprise a first preset symbol, a second preset symbol and all English letters, and the essence of the offset and the matching length is thatIntegers in between, therefore, the object to be encoded for encoding the compressed sequence includes all the characters andall integers in between, null characters do not have task semantics, but only to ensure the integrity of the look-ahead buffer, no encoding is required, and therefore null characters do not need to be considered.
The invention encodes the compressed sequence through a binary sequence, wherein the binary sequence consists of a plurality of 0 or 1, and the specific process of encoding is as follows:
1. and obtaining binary sequences corresponding to all the coding objects.
Between (a) and (b)All integers are decimal data, so binary data of each integer is obtained, specifically: due to the number of kinds of integers beingTherefore, only needThe binary data of bits can represent an integer number,representing an upward rounding.
W=28 characters including the first preset symbol, the second preset symbol and all english alphabets, and therefore, onlyThe binary data may represent the character.
It should be noted that, because the binary data corresponding to the integer and the binary data corresponding to the character are different in number of bits, and the integer and the character are mixed in the compressed sequence, in the encoded sequence obtained after the compressed sequence is encoded, the binary data corresponding to the integer and the binary data corresponding to the character are mixed, so that decoding cannot be performed. Therefore, the present invention contemplates setting a discrimination bit in the binary sequence corresponding to the integer and the character.
The integer contains all offset and matching length, in the compressed sequence, some offset and matching length are obtained according to the character searching buffer, some offset and matching length are obtained according to the phrase searching buffer, in order to ensure that the coded sequence can be accurately decoded, an identification bit is required to be set in the binary sequence corresponding to the character.
In summary, the length isThe short binary sequence of (2) represents all characters, and then all characters in the compressed sequence are encoded according to the short binary sequence, and the step of setting the short binary sequence corresponding to all characters is specificThe method comprises the following steps: the 1 st bit in the short binary sequence is marked as a distinguishing bit and is set to 0; will be any oneBinary data as a post-sequence of short binary valuesIt should be noted that the short binary sequences corresponding to any two characters are not identical.
With a length ofThe long binary sequences of (2) represent all integers, and then all offset and matching lengths in the compressed sequence are encoded according to all the long binary sequences, and the steps of setting the long binary sequences corresponding to all the integers specifically include: the 1 st bit in the long binary sequence is marked as a distinguishing bit and is set as 1; corresponding the integerBit binary data as bits 2 to 2 of long binary sequenceA bit; the last 1 bit in the long binary sequence is recorded as a recognition bit, the recognition bit is set to 1 if the offset and the matching length are obtained according to the character searching buffer, and the recognition bit is set to 0 if the offset and the matching length are obtained according to the phrase searching buffer.
It should be noted that, the correspondence between all the encoding objects and the binary sequences needs to be stored in the terminal and the translation end of the conference site respectively.
2. And coding the compressed sequence according to the binary sequences corresponding to all the coding objects to obtain a coding sequence.
And (3) obtaining short binary sequences corresponding to all characters in the compressed sequence and long binary sequences corresponding to all offsets and matching lengths, and marking the sequences formed by all the obtained short binary sequences and all the long binary sequences according to the sequence as coding sequences to realize the coding of the compressed sequence.
S004, the coding sequence is sent to a terminal on the conference site, and a translation sequence is obtained by decoding and decompressing the coding sequence.
The coding sequence obtained by the translation terminal is sent to a terminal of the conference site in real time, and the terminal of the conference site obtains the translation sequence by decoding and decompressing the coding sequence, so that the conference site can obtain the translation sequence quickly and efficiently; splitting all English words according to all first preset symbols in the translated text sequence, splitting all sentences according to all second preset symbols in the translated text sequence, recording the split sequence as a translation file, processing the translation file into audio by a terminal on a conference site, and transmitting the translation file and the audio to all participants.
1. And decoding the coding sequence according to the corresponding relation between the coding object and the binary sequence to obtain a compressed sequence.
(1) Decoding the coding sequence according to the corresponding relation between the coding object and the binary sequence, specifically: the 1 st data in the coding sequence is obtained, and in the corresponding relation between the coding object and the binary sequence, no matter the short binary sequence corresponding to the character or the long binary sequence corresponding to the integer, the 1 st bit is the distinguishing bit, so if the 1 st data in the coding sequence is 0, the front part of the coding sequence is providedThe sequence formed by the data is used as a short binary sequence, and in the corresponding relation between the coded object and the binary sequence, the character corresponding to the short binary sequence is obtained; if the 1 st data in the coding sequence is 1, the front of the coding sequence is providedAnd the sequence formed by the data is used as a long binary sequence, and in the corresponding relation between the coded object and the binary sequence, the integer corresponding to the long binary sequence is obtained.
(2) The obtained short binary sequence or long binary sequence is removed from the coding sequence to obtain a new coding sequence.
(3) And (3) re-executing from the step (1) according to the new coding sequence until the obtained new coding sequence is empty.
And (3) marking the sequence formed by all the obtained characters and integers according to the sequence as a compressed sequence.
2. Decompressing the compressed sequence to obtain a translation sequence.
(1) Setting a blank sequence as a translated sequence, and leading the translated sequenceThe sequence of characters acts as a sliding window.
(2) Will slide in front of windowA sequence consisting of the characters is used as a character searching buffer area, and the sliding window is arranged at the rearThe sequence of individual characters acts as a look-ahead buffer.
(3) A sequence formed by all English letters between two symbols in a translation sequence is recorded as a phrase, and a symbol in a compressed character sequence is also recorded as a phrase, wherein the symbols comprise two types, namely a first preset symbol and a second preset symbol; all phrases in the obtained compressed character sequence are arranged according to the sequence, and finallyThe sequence of individual phrases is used as a phrase searching buffer area.
(4) For the v data in the compressed sequence, if the v data is a character, the character is directly filled in the last of the translated sequence; if the v-th data is an integer, taking the integer and an adjacent integer on the right side as a binary group, taking the first integer in the binary group as an offset p, and taking the second length in the binary group as a matching length q; if the identification bit in the long binary sequence corresponding to the offset and the matching length is 0, filling the character string formed by the p+q-th phrase from the p-th phrase in the phrase searching buffer zone at the end of the translation sequence, and taking the length of the character string formed by the p-th character to the p+q-th character in the phrase searching buffer zone as the sliding quantity; if the identification bit in the long binary sequence corresponding to the offset and the matching length is 1, filling the character string consisting of the p-th character to the p+q-th character in the character searching buffer area at the last of the translation sequence, and taking the matching length q as the sliding quantity.
(5) Sliding the sliding window rightwards, wherein the sliding length is equal to the sliding amount, and taking the sliding window after sliding as a new sliding window; re-executing from step 2 according to the new sliding window until the obtained new sliding window is smaller than the second preset lengthThe translation sequence at this time is the decompression result of the compressed sequence.
The invention ensures the short execution time of the compression algorithm by setting a smaller searching buffer zone for the LZ77 compression algorithm; by setting the character searching buffer area and the phrase searching buffer area, the compression algorithm can realize high-efficiency compression by utilizing the locality of characters with shorter distances in the translation, can realize high-efficiency compression by utilizing the locality of characters with longer distances in the translation, and finally ensures that the execution time for compressing the translation is short, and meanwhile, the compression efficiency is high, thereby solving the problem that the influence of the size of the searching buffer area in the LZ77 compression algorithm on the execution time and the compression efficiency is contradictory, realizing the high-efficiency storage of translation files, and ensuring that a translation sequence can be obtained quickly and efficiently on a conference site.
It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. The processes depicted in the accompanying drawings do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.
Claims (8)
1. The method for efficiently storing big data based on the translation file is characterized by comprising the following steps:
obtaining a translation file, and preprocessing the translation file to obtain a translation sequence;
compressing the translation sequence in combination with the character search buffer and the phrase search buffer to obtain a compressed sequence, comprising:
s1, filling a second preset length of empty characters in front of a translated text sequence, taking a sequence consisting of the characters with the first preset length in the translated text sequence as a sliding window, and setting an empty sequence as a compressed character sequence;
s2, taking a sequence formed by a first second preset length of characters in the sliding window as a character searching buffer area and taking a sequence formed by the rest characters in the sliding window as a preceding buffer area;
s3, obtaining a phrase searching buffer area according to the compressed character sequence;
s401, obtaining all character strings of the preceding buffer, judging whether the phrase searching buffer has the same phrase as the 1 st character string of the preceding buffer, and executing S402 if not; if so, searching in the phrase searching buffer area to obtain the maximum matching item of the preceding buffer area, and according to the maximum matching item, obtaining an output result and a sliding quantity, and executing S5;
s402, searching in a character searching buffer area to obtain a maximum matching item of a preceding buffer area, obtaining an output result and a sliding quantity according to the maximum matching item, and executing S5;
s5, obtaining a new sliding window according to the sliding quantity, and taking the sliding window after sliding as the new sliding window; re-executing from S2 according to the new sliding window until the obtained size of the new sliding window is smaller than the second preset length; the sequence formed by all the obtained output results according to the sequence is recorded as a compressed sequence of the translation sequence;
obtaining binary sequences corresponding to all the coding objects, coding the compressed sequences according to the binary sequences corresponding to all the coding objects to obtain coding sequences, temporarily storing the obtained coding sequences, and sending the obtained coding sequences to a terminal on a conference site.
2. The method for efficiently storing big data based on a translation file according to claim 1, wherein the step of preprocessing the translation file to obtain a translation sequence comprises the following specific steps:
the method comprises the steps of taking a first preset symbol as an identifier of the end of an English word, taking a second preset symbol as an identifier of the end of a sentence, replacing the identifiers of the end of all English words in a translation file with the first preset symbol, replacing the identifiers of the end of all sentences in the translation file with the second preset symbol, adding a first preset symbol at the beginning of the translation file, taking the first preset symbol, the second preset symbol and all English letters as characters, and recording a sequence formed by all the characters contained in the translation file according to a sequence as a translation sequence.
3. The method for efficiently storing big data based on a translation file according to claim 1, wherein the step of obtaining a phrase search buffer according to a compressed character sequence comprises the following specific steps:
a sequence formed by all English letters between two symbols in a compressed character sequence is recorded as a phrase, and one symbol in the compressed character sequence is also recorded as a phrase, wherein the symbols comprise two types, namely a first preset symbol and a second preset symbol; all phrases in the obtained compressed character sequence are arranged according to the sequence, and finallyThe sequence composed of the individual phrases is used as a phrase searching buffer area pair to be filled with the translated text sequence, wherein ∈>Representing a second preset length.
4. The method for efficiently storing big data based on a translation file according to claim 1, wherein the step of obtaining all the strings in the look-ahead buffer comprises the following specific steps:
dividing the look-ahead buffer into a plurality of subsequences according to symbols, marking each subsequence and each symbol as a character string, and obtaining all character strings of the look-ahead buffer, wherein the symbols comprise two kinds of symbols, namely a first preset symbol and a second preset symbol.
5. The method for efficiently storing big data based on a translation file according to claim 1, wherein the searching in the phrase searching buffer area obtains the largest matching item of the look-ahead buffer area, and the outputting result and the sliding amount are obtained according to the largest matching item, comprising the following specific steps:
word group searching buffer zone and 1 st character string of preceding buffer zoneThe same phrase is obtained, and the phrase searching buffer area is provided with the same character string +.>Identical phrase->Wherein->Representing an ith phrase in the phrase searching buffer area; if an integer is presentAnd causes the i-th phrase +_ in the phrase look-up buffer>To the (i+z) th wordGroup->All phrases in between and 1 st character string of the look-ahead buffer +.>To the 1+z character string +.>All the strings in the buffer are identical, the 1 st string of the preceding buffer is +.>To the 1+z character string +.>The sequence of all character strings in between is marked as the maximum matching item, wherein s represents the number of all character strings in the advance buffer; find the phrase and the character string in the buffer area +.>Identical phrase->The serial number i of (2) is marked as an offset, and the integer z is used as a matching length; taking a binary group consisting of the offset and the matching length as an output result; the 1 st character string of the preceding buffer zone +.>To the 1+z character string +.>The sum of the lengths of all the character strings in between is used as the slip amount.
6. The method for efficiently storing big data based on a translation file according to claim 1, wherein the searching in the searching area is performed to obtain a maximum matching item of the area to be processed, and the plaintext and the translation amount are obtained according to the maximum matching item, comprising the following specific steps:
judging whether the character searching buffer area has the same character as the 1 st character of the first buffer area:
if the character lookup buffer does not have the 1 st character with the look-ahead bufferThe same character, 1 st character of the preceding buffer zone +>As the maximum matching item, taking the maximum matching item as an output result, and taking the number of characters in the maximum matching item as a sliding quantity;
if the character search buffer exists with the 1 st character of the look-ahead bufferThe same character, obtain the same character +.>Identical character->Wherein->Representing a j-th character in the character lookup buffer; if an integer is presentAnd causes the j-th character +_in the character lookup buffer>To the j+r-th characterAll characters and antecedents therebetween1 st character of buffer->Up to 1+r th character->All characters in the buffer are identical, the 1 st character of the look-ahead buffer is +.>Up to 1+r th character->The sequence of all characters in between is noted as the largest match, wherein +.>Representing a first preset length,/a>Representing a second preset length; look up the AND character in the buffer>Identical character->The sequence number j of (2) is marked as an offset, and the integer z is used as a matching length; taking a binary group consisting of the offset and the matching length as an output result; the matching length is taken as the slip quantity.
7. The method for efficiently storing big data based on a translation file according to claim 1, wherein the step of obtaining binary sequences corresponding to all the encoded objects comprises the following specific steps:
marking the first preset symbol, the second preset symbol and all English letters as characters, and marking all characters and all English letters as charactersAll integers in between are recorded as coding objects;
with a length ofWherein w represents the like number of all characters, and the step of setting the short binary sequence corresponding to all characters is specifically: the 1 st bit in the short binary sequence is marked as a distinguishing bit and is set to 0; either one is +.>Binary data of bits as a short binary sequence>The bit and the short binary sequences corresponding to any two characters are different;
with a length ofThe long binary sequence of (2) represents all +.>All integers in between, wherein ∈ ->The step of representing the second preset length and setting the long binary sequence corresponding to all integers specifically comprises the following steps: the 1 st bit in the long binary sequence is marked as a distinguishing bit and is set as 1; corresponding integer +.>Bit binary data as bits 2 to 2 of long binary sequenceA bit; the last 1 bit in the long binary sequence is recorded as an identification bit, if the offset and the matching length are obtained according to the character searching buffer, the identification bit is set to be 1, and if the offset and the matching length areAnd setting the identification bit to 0 according to the phrase searching buffer area.
8. The method for efficiently storing big data based on a translation file according to claim 7, wherein the encoding the compressed sequence according to the binary sequences corresponding to all the encoding objects to obtain the encoded sequence comprises the following specific steps:
and obtaining short binary sequences corresponding to all characters in the compressed sequence and long binary sequences corresponding to all offsets and matching lengths, and recording the sequences formed by all the obtained short binary sequences and all the long binary sequences according to the sequence as a coding sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310300380.XA CN116032292B (en) | 2023-03-27 | 2023-03-27 | Efficient big data storage method based on translation file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310300380.XA CN116032292B (en) | 2023-03-27 | 2023-03-27 | Efficient big data storage method based on translation file |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116032292A true CN116032292A (en) | 2023-04-28 |
CN116032292B CN116032292B (en) | 2023-06-09 |
Family
ID=86089448
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310300380.XA Active CN116032292B (en) | 2023-03-27 | 2023-03-27 | Efficient big data storage method based on translation file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116032292B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116683916A (en) * | 2023-08-03 | 2023-09-01 | 山东五棵松电气科技有限公司 | Disaster recovery system of data center |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1992002989A1 (en) * | 1990-08-09 | 1992-02-20 | Telcor Systems Corporation | Compounds adaptive data compression system |
US20020140583A1 (en) * | 2000-12-22 | 2002-10-03 | Cilys 53 Inc. | System and method for compressing and decompressing data in real time |
JP2007257188A (en) * | 2006-03-22 | 2007-10-04 | Casio Comput Co Ltd | Dictionary search device and its control program |
CN202931289U (en) * | 2012-11-14 | 2013-05-08 | 无锡芯响电子科技有限公司 | Hardware LZ 77 compression implement system |
JP2013162474A (en) * | 2012-02-08 | 2013-08-19 | Tamura Seisakusho Co Ltd | Data compression method and device |
US20170373702A1 (en) * | 2016-06-22 | 2017-12-28 | Fujitsu Limited | Data compression device and data decompression device |
US20180102789A1 (en) * | 2016-10-06 | 2018-04-12 | Fujitsu Limited | Computer-readable recording medium, encoding apparatus, and encoding method |
CN108768403A (en) * | 2018-05-30 | 2018-11-06 | 中国人民解放军战略支援部队信息工程大学 | Lossless data compression, decompressing method based on LZW and LZW encoders, decoder |
US20220121770A1 (en) * | 2020-10-19 | 2022-04-21 | Duality Technologies, Inc. | Efficient secure string search using homomorphic encryption |
CN114567331A (en) * | 2022-01-29 | 2022-05-31 | 山东云海国创云计算装备产业创新中心有限公司 | LZ 77-based compression method, device and medium thereof |
-
2023
- 2023-03-27 CN CN202310300380.XA patent/CN116032292B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1992002989A1 (en) * | 1990-08-09 | 1992-02-20 | Telcor Systems Corporation | Compounds adaptive data compression system |
US20020140583A1 (en) * | 2000-12-22 | 2002-10-03 | Cilys 53 Inc. | System and method for compressing and decompressing data in real time |
JP2007257188A (en) * | 2006-03-22 | 2007-10-04 | Casio Comput Co Ltd | Dictionary search device and its control program |
JP2013162474A (en) * | 2012-02-08 | 2013-08-19 | Tamura Seisakusho Co Ltd | Data compression method and device |
CN202931289U (en) * | 2012-11-14 | 2013-05-08 | 无锡芯响电子科技有限公司 | Hardware LZ 77 compression implement system |
US20170373702A1 (en) * | 2016-06-22 | 2017-12-28 | Fujitsu Limited | Data compression device and data decompression device |
US20180102789A1 (en) * | 2016-10-06 | 2018-04-12 | Fujitsu Limited | Computer-readable recording medium, encoding apparatus, and encoding method |
CN108768403A (en) * | 2018-05-30 | 2018-11-06 | 中国人民解放军战略支援部队信息工程大学 | Lossless data compression, decompressing method based on LZW and LZW encoders, decoder |
US20220121770A1 (en) * | 2020-10-19 | 2022-04-21 | Duality Technologies, Inc. | Efficient secure string search using homomorphic encryption |
CN114567331A (en) * | 2022-01-29 | 2022-05-31 | 山东云海国创云计算装备产业创新中心有限公司 | LZ 77-based compression method, device and medium thereof |
Non-Patent Citations (2)
Title |
---|
D. R. VASANTHI, R. ANUSHA AND B. K. VINAY: "Implementation of Robust Compression Technique Using LZ77 Algorithm on Tensilica\'s Xtensa Processor", 《2016 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY (ICIT)》, pages 148 - 153 * |
满天星: "改进的LZ系列压缩文本上的搜索算法", 《中国优秀硕士学位论文全文数据库信息科技辑》, pages 138 - 353 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116683916A (en) * | 2023-08-03 | 2023-09-01 | 山东五棵松电气科技有限公司 | Disaster recovery system of data center |
CN116683916B (en) * | 2023-08-03 | 2023-10-10 | 山东五棵松电气科技有限公司 | Disaster recovery system of data center |
Also Published As
Publication number | Publication date |
---|---|
CN116032292B (en) | 2023-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5006849A (en) | Apparatus and method for effecting data compression | |
FI114051B (en) | Procedure for compressing dictionary data | |
US9223765B1 (en) | Encoding and decoding data using context model grouping | |
US4597057A (en) | System for compressed storage of 8-bit ASCII bytes using coded strings of 4 bit nibbles | |
CN116032292B (en) | Efficient big data storage method based on translation file | |
US20060071822A1 (en) | Method and apparatus for adaptive data compression | |
CN101783788A (en) | File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device | |
US11669553B2 (en) | Context-dependent shared dictionaries | |
US4295124A (en) | Communication method and system | |
CN110518917A (en) | LZW data compression method and system based on Huffman coding | |
CN115840799B (en) | Intellectual property comprehensive management system based on deep learning | |
CN116610265B (en) | Data storage method of business information consultation system | |
CN101388731B (en) | Low rate equivalent speech water sound communication technique | |
JPS59231683A (en) | Data compression system | |
CN101534124A (en) | Compression algorithm for short natural language | |
CN105630755A (en) | Source encoding and decoding methods and devices for expanding information quantity transmission of Beidou-satellite short message | |
CN116645971A (en) | Semantic communication text transmission optimization method based on deep learning | |
CN115099244A (en) | Voice translation method, and method and device for training voice translation model | |
CN114595698A (en) | Semantic communication method based on CCSK and deep learning | |
CN114491597A (en) | Text carrierless information hiding method based on Chinese character component combination | |
KR20050053996A (en) | Method and apparatus for decoding huffman code effectively | |
RU2437148C1 (en) | Method to compress and to restore messages in systems of text information processing, transfer and storage | |
Shanmugasundaram et al. | Text preprocessing using enhanced intelligent dictionary based encoding (EIDBE) | |
JPH0546358A (en) | Compressing method for text data | |
JPH0546357A (en) | Compressing method and restoring method for text data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |