CN114048710A - Text compression method, text decompression method, text compression device, text decompression device, computer equipment and storage medium - Google Patents

Text compression method, text decompression method, text compression device, text decompression device, computer equipment and storage medium Download PDF

Info

Publication number
CN114048710A
CN114048710A CN202111355474.4A CN202111355474A CN114048710A CN 114048710 A CN114048710 A CN 114048710A CN 202111355474 A CN202111355474 A CN 202111355474A CN 114048710 A CN114048710 A CN 114048710A
Authority
CN
China
Prior art keywords
text
word
compressed
target
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111355474.4A
Other languages
Chinese (zh)
Inventor
黄泼
刘知胜
罗桦槟
肖佳威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Storlead Technology Co ltd
Original Assignee
Shenzhen Storlead Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Storlead Technology Co ltd filed Critical Shenzhen Storlead Technology Co ltd
Priority to CN202111355474.4A priority Critical patent/CN114048710A/en
Publication of CN114048710A publication Critical patent/CN114048710A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a text compression method, a text decompression method, a text compression device, a text decompression device, a computer device and a storage medium. The method comprises the following steps: the method comprises the steps of preprocessing a text to be compressed to obtain word vectors of a plurality of target participles, summarizing the word vectors of the target participles to obtain semantic vectors corresponding to the whole text, wherein the semantic vectors include the meaning of text expression, compressing the text by using the semantic vectors of the text to obtain a more compact representation form, and greatly improving the compression ratio compared with compressing the text according to character frequency and a simple arrangement rule.

Description

Text compression method, text decompression method, text compression device, text decompression device, computer equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for compressing and decompressing text, a computer device, and a storage medium.
Background
The traditional compression algorithm is mainly divided into statistical coding and dictionary coding, wherein the statistical coding mainly utilizes imbalance of symbol statistical frequency and repetition of context characters, and the dictionary coding mainly utilizes surface frequency and arrangement information of the characters. However, when the text is compressed by the above two encoding methods, a compressed file with a high compression ratio cannot be obtained.
Disclosure of Invention
In order to solve the technical problem, the application provides a text compression method, a text decompression method, a text compression device, a text decompression device, a computer device and a storage medium.
In a first aspect, the present application provides a text compression method, including:
performing text preprocessing on a text to be compressed to obtain word vectors of a plurality of target word segments;
generating semantic vectors corresponding to the texts to be compressed based on the word vectors of the target word segments;
and compressing the semantic vector corresponding to the text to be compressed to generate a compressed text.
In a second aspect, the present application provides a text decompression method, including:
carrying out vector decoding processing on a text to be decompressed to obtain semantic vectors corresponding to the text to be decompressed;
dividing semantic vectors corresponding to the text to be decompressed to generate a plurality of sub-vectors;
and carrying out numbering decoding processing on each sub-vector to generate a decompressed text.
In a third aspect, the present application provides a text compression apparatus comprising:
the preprocessing module is used for preprocessing texts of texts to be compressed to obtain word vectors of a plurality of target word segments;
the generating module is used for generating semantic vectors corresponding to the texts to be compressed based on the word vectors of the target word segmentations;
and the compression module is used for compressing the semantic vector corresponding to the text to be compressed to generate a compressed text.
In a fourth aspect, the present application provides a text decompression apparatus, comprising:
the first decoding module is used for carrying out vector decoding processing on a text to be decompressed to obtain a semantic vector corresponding to the text to be decompressed;
the dividing module is used for dividing the semantic vectors corresponding to the texts to be decompressed to generate a plurality of sub-vectors;
and the second decoding module is used for carrying out numbering decoding processing on each sub-vector to generate a decompressed text.
In a fifth aspect, the present application provides a computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:
performing text preprocessing on a text to be compressed to obtain word vectors of a plurality of target word segments;
generating semantic vectors corresponding to the texts to be compressed based on the word vectors of the target word segments;
and compressing the semantic vector corresponding to the text to be compressed to generate a compressed text.
In a sixth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
performing text preprocessing on a text to be compressed to obtain word vectors of a plurality of target word segments;
generating semantic vectors corresponding to the texts to be compressed based on the word vectors of the target word segments;
and compressing the semantic vector corresponding to the text to be compressed to generate a compressed text.
Based on the text compression method, text preprocessing is carried out on a text to be compressed to obtain word vectors of a plurality of target participles, then the word vectors of the target participles are summarized to obtain semantic vectors corresponding to the whole text, the semantic vectors include the meaning of text expression, the semantic vectors of the text are used for compression, a more compact representation form can be obtained, and compared with the method that the text is compressed according to the character frequency and the simple arrangement rule, the compression ratio can be greatly improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow diagram illustrating a method for text compression in one embodiment;
FIG. 2 is a diagram illustrating an iterative learning process for semantic vectors according to an embodiment;
FIG. 3 is a schematic diagram illustrating an embodiment of a process for superimposing probabilities of occurrence;
FIG. 4 is a flowchart illustrating a text decompression method according to an embodiment;
FIG. 5 is a flowchart illustrating a text decompression method according to an embodiment;
FIG. 6 is a block diagram of program modules of the text compression apparatus in one embodiment;
FIG. 7 is a block diagram of program modules of the text decompression apparatus according to an embodiment;
FIG. 8 is a block diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In an embodiment, fig. 1 is a flowchart illustrating a text compression method in an embodiment, and referring to fig. 1, a text compression method is provided. The embodiment is mainly illustrated by applying the method to the server 120, and the text compression method specifically includes the following steps:
step S210, text preprocessing is carried out on the text to be compressed to obtain word vectors of a plurality of target word segments.
Specifically, the text to be compressed refers to a text represented by characters without being compressed, the text to be compressed includes a plurality of target participles, each target participle in the text to be compressed is converted into a word vector by word embedding, the word vector includes a plurality of vector values for indicating vector dimensions, in this embodiment, the vector dimension of the word vector is 256, that is, one participle is represented by 256 vector values. The use of word vectors to represent words greatly reduces the amount of data that is calculated and stored.
Generally, a method of converting a participle into a word vector includes: the first is to calculate the probability of two words appearing simultaneously in a large corpus and map the words appearing frequently at the same time to the similar positions in a vector space; second, the possible adjacent words are predicted according to a word or a plurality of words, and the word vector corresponding to the word is naturally learned in the prediction process.
Word vectors that have been source trained, such as those of the genim library, may also be utilized.
Step S220, generating a semantic vector corresponding to the text to be compressed based on the word vectors of the target word segmentation.
Specifically, the semantic vector contains the meaning of text expression, and the relevance and similarity between words in the text to be compressed can be solved through the semantic vector, so that subsequent keyword search or text recommendation and other functional calculation processing can be conveniently carried out according to the semantic vector.
Step S230, performing compression processing on the semantic vector corresponding to the text to be compressed to generate a compressed text.
Specifically, a more compact representation form can be obtained by compressing the text by using the semantic vector of the text, and compared with compressing the text according to the character frequency and the simple arrangement rule, the compression ratio can be greatly improved.
In one embodiment, the text preprocessing is performed on the text to be compressed to obtain word vectors of a plurality of target word segments, and the method includes: constructing and generating a coding dictionary based on a plurality of participles in the word embedding table; performing word segmentation processing on the text to be compressed based on the coding dictionary to obtain a plurality of target word segments; wherein each target participle carries a corresponding word code; and determining a word vector corresponding to each target participle based on the word embedding table.
The word embedding table includes W participles and word vectors corresponding to the participles, for example, [ word _1 (0.234, 0.252., 0.234); word _2 (0.254, 0.227.,. 0.284); ...; word W (0.256, 0.297, 9, 384) ], word 1 and word 2 correspond to a participle respectively, a coding dictionary is constructed by using the participles in the word embedding table, the coding dictionary comprises all target participles in a text to be compressed, the total length of the coding dictionary is N, namely the coding dictionary comprises N participles, each participle in the coding dictionary corresponds to a word code, for example, { text: 1, compression: 2, based on 3, Chinese: 4, us: 5, one: 6, system: 7, word: 8, embedding: 9, compression: 10, recurrent neural network: 11, 12, algorithm: 13, 1.
Specifically, the word segmentation processing may be performed on the text to be compressed by using an open-source word segmentation device based on the coding dictionary, the word segmentation device may be a Hanlp or jieba word segmentation device, for example, the text to be compressed is "a word embedding-based chinese text compression algorithm", and the word segmentation is performed on the text to be compressed to obtain "a | chinese | text | compression | algorithm | based on | word | embedding", so as to obtain a plurality of target word segmentation, that is, any two target word segmentation are separated by a separator, and the word code corresponding to each target word segmentation may be found based on the coding dictionary, for example, the word code corresponding to "one kind" may be found to be 6 based on the coding dictionary, the word code corresponding to "based on" may be 3, so as to find the word code corresponding to each target word.
Inquiring a word vector corresponding to the target word segmentation in a word embedding table, generating a word vector corresponding to the target word segmentation based on the word vector corresponding to the target word segmentation and word coding, wherein the whole text to be compressed corresponds to a coding matrix, namely S ═ S (S ═1,S2,…,SM) Wherein S is1And S2Respectively corresponding to different target participles based on the text to be compressed in the above example, S1Matrix number for indicating the first target word in the text to be compressed, the numerical value corresponding to the matrix number is the word code of the target word, i.e. S1=6,SMThe encoding matrix corresponding to the text to be compressed is an M × 256 matrix, and in this embodiment, based on the text to be compressed in the above example, M is 9.
In one embodiment, the generating a semantic vector corresponding to the text to be compressed based on the word vectors of the plurality of target segmented words includes: performing word segmentation and division on the target word segmentations to obtain word segments; generating semantic vectors corresponding to the word segments based on the word vector of the target word segmentation in each word segment; and generating a semantic vector corresponding to the text to be compressed based on the semantic vector corresponding to each word segment.
Specifically, the plurality of target participles are divided into a plurality of word segments according to the specified number, each word segment comprises the specified number of the target participles, the specified number is X, namely, the target participles are divided once according to every X target participles, namely the number of the divided word segments is M/X, the word vector corresponding to each target word segmentation in each word segment is utilized to carry out learning training, the semantic vector representing the word segment can be obtained, which is equivalent to compressing the X word vectors corresponding to the X target participles into a semantic vector capable of expressing the X target participles, that is, the quantity relationship is to compress the X word vectors into a semantic vector, synthesize all the word segments corresponding to the text to be compressed, and generate the semantic vector corresponding to the text to be compressed, that is, the semantic vector corresponding to the text to be compressed is composed of the semantic vectors corresponding to the D word segments. The method has the advantages that the semantic vector is more concise by utilizing the characteristic that the word embedding table catches the text semantic information, and the compression rate of the text in the specified field can be adaptively improved by training the word embedding table related to the specified field.
In one embodiment, the generating a semantic vector corresponding to each of the word segments based on the word vector of the target participle in each of the word segments includes: determining a semantic vector of a preceding said target participle in said segment; and determining the semantic vector of the target participle in the word segment after the semantic vector of the target participle based on the semantic vector of the target participle before the semantic vector of the target participle.
Specifically, referring to fig. 2, iterative learning is performed on word vectors corresponding to X target participles in each word segment, a preceding target participle and a succeeding target participle in the word segment are two adjacent target participles, and the preceding target participle is located before the succeeding target participle, that is, the matrix number of the preceding target participle is smaller than the matrix number of the succeeding target participle, X _ (t-1) in fig. 2 is used for indicating a semantic vector of the preceding target participle relative to X _ t, X _ t is used for indicating a semantic vector of the succeeding target participle relative to X _ (t-1), a semantic vector of the preceding target participle is obtained after the preceding target participle is learned, and iterative learning is performed by combining the semantic vector of the preceding target participle when the succeeding target participle is learned, so as to generate the semantic vector of the succeeding target participle. The method can be realized by an LSTM network model or a GRU model.
And combining the learning results of all the previous target participles in the learning process of each subsequent target participle until the last target participle in the word segment is learned to generate the semantic vector corresponding to the word segment, namely associating the semantic vectors of all the target participles in the word segment through the iterative learning process so as to obtain the semantic vector containing the semantic information corresponding to the target participles. Generating semantic vectors corresponding to the text to be compressed based on the semantic vectors corresponding to the word segments, wherein the semantic vectors corresponding to the text to be compressedThe amount is marked as C, C ═ C1,C2,…,Ci,…,CD) Wherein, CiFor indicating the semantic vector corresponding to the ith word segment in the text to be compressed, CiComprising l vector strings, each vector string comprising b digits, i.e. CiIs a vector of l x b numbers, e.g. b is 3, l 3, Ci(123, 245, 356). And for the D word segments of the text to be compressed, obtaining a vector which is composed of D, l and b numbers and corresponds to the semantic vector of the text to be compressed.
In one embodiment, the compressing the semantic vector corresponding to the text to be compressed to generate a compressed text includes: determining the occurrence probability of each character in the semantic vector corresponding to each word segment; sorting the occurrence probabilities in a descending order according to numerical values to generate a probability descending order arrangement table; sequentially overlapping the subsequent occurrence probability and the first occurrence probability based on the probability descending order arrangement table to obtain an overlapping numerical value; when the superposition numerical value reaches a preset numerical value, marking the subsequent occurrence probability and the prior occurrence probability respectively by using preset characters; generating a compressed code corresponding to the word segment based on each preset character corresponding to the occurrence probability related to the word segment; and generating the compressed text based on the compressed codes corresponding to the word segments.
Specifically, the occurrence probability of each character in the semantic vector corresponding to the word segment is counted, the semantic vector includes l × b numeric characters, that is, the occurrence frequency of the number is counted in the semantic vector corresponding to all the word segments, each number is sorted in a descending order according to the numeric value of the corresponding occurrence probability, so as to obtain a descending probability arrangement table, that is, the descending probability arrangement table includes the number and the occurrence probability corresponding to the number, for example, 6 word segments and the semantic vector corresponding to each word segment are obtained by splitting the text to be compressed, each semantic vector is composed of any number from 0 to 9, and the descending probability arrangement table generated based on 10 numbers is shown as the following table:
semantic vector Probability of occurrence
2 0.4
1 0.2
3 0.1
4 0.1
6 0.06
5 0.04
0 0.04
9 0.03
7 0.02
8 0.01
Wherein, the latter occurrence probability and the former occurrence probability are two probabilities with the smallest values among all the occurrence probabilities, and the former occurrence probability is greater than or equal to the latter occurrence probability, that is, the probability descending ranking table is referred to, 0.01 and 0.02 are two probability values with the smallest values among all the occurrence probabilities, that is, 0.01 is the latter occurrence probability of 0.02, 0.02 is the former occurrence probability relative to 0.01, the latter occurrence probability and the former occurrence probability are superposed to obtain a superposed value, that is, the superposed value is 0.01+0.02 ═ 0.03, the obtained superposed value is used as a new occurrence probability, the two smallest occurrence probability values are selected from the occurrence probabilities which are not involved in the operation and the new occurrence probability for addition, that is, the two smallest occurrence probability values are selected from (0.03, 0.03, 0.04, 0.04, 0.06, 0.1, 0.1, 0.2, 0.4) for addition, the result is 0.03+ 0.03-0.06, and then 0.06 is taken as the new occurrence probability, and the two occurrence probabilities with the smallest probability values are reselected and added in (0.04, 0.04, 0.06, 0.06, 0.1, 0.1, 0.2, 0.4) to obtain 0.04+ 0.04-0.08, and so on until the superimposed value reaches the preset value, which is usually set to 1, thereby forming the tree structure shown in fig. 3.
In a tree structure formed by output probabilities, each output probability corresponds to a node, each node is sequentially branched downwards along the tree structure and is divided into a father node and a child node, namely, a child node is obtained after the father node is branched, each branched child node is respectively marked, the child node positioned on the left side after branching is marked with 0, the child node positioned on the right side after branching is marked with 1, or the child node positioned on the left side after branching is marked with 1, and the child node positioned on the right side after branching is marked with 0.
As shown in fig. 3, the node on the left side of the bifurcation in the tree structure is marked as 0, the node on the right side of the bifurcation is marked as 1, and the compression code corresponding to the number 7 includes a mark 00000 corresponding to its parent node and a mark 0 corresponding to itself, that is, 000000. The compression code corresponding to the number 8 includes the label 00000 corresponding to its parent node and the label 1 corresponding to itself, i.e. 000001. Similarly, the compression code of the number 9 is 00001, the compression code of the number 6 is 0001, the compression code of the number 3 is 001, the compression code of the number 0 is 01000, the compression code of the number 5 is 01001, the compression code of the number 4 is 0101, the compression code of the number 1 is 011, and the compression code of the number 2 is 1, and the compressed text is generated based on the compression codes corresponding to the respective numbers.
In this way, semantic vectors corresponding to a text to be compressed are compressed, less-bit characters are adopted for representing corresponding semantic vectors of a word segment with higher occurrence probability, more-bit characters are adopted for representing corresponding semantic vectors of a word segment with lower occurrence probability, the number of representing bits of corresponding semantic vectors of a part of word segments in the text to be compressed is reduced, the number of representing bits of corresponding semantic vectors of a part of word segments is increased, and the corresponding data volume of the compressed text is smaller than the corresponding data volume of the text to be compressed because the number of the corresponding semantic vectors with reduced number of the representing bits is larger than the number of the corresponding semantic vectors with increased number of the representing bits.
In one embodiment, referring to fig. 4, there is provided a text decompression method, further comprising:
step S310, carrying out vector decoding processing on the text to be decompressed to obtain a semantic vector corresponding to the text to be decompressed.
In this embodiment, the text to be decompressed refers to a text after compression processing, the text to be decompressed is a text formed by a plurality of compression codes, the compression codes are binary character strings formed by 0 and 1, and the text to be decompressed is subjected to vector decoding processing based on a huffman coding algorithm, that is, the binary character strings formed by 0 and 1 are converted into a number sequence, the length of the number sequence is D1 b, and the number sequence is used for indicating a semantic vector corresponding to the text to be decompressed.
Step S320, dividing the semantic vector corresponding to the text to be decompressed to generate a plurality of sub-vectors.
In this embodiment, the corresponding semantic vector of the text to be decompressed is divided, that is, the digital sequence is divided into D sub-vectors, each sub-vector is a word corresponding to a word segmentMeaning vector, namely the corresponding semantic vector of the decomposed text with decompression is C ═ C (C)1,C2,…,Ci,…,CD) Wherein, CiFor indicating the semantic vector corresponding to the ith word segment in the text to be compressed, CiComprising l vector strings, each vector string comprising b digits, i.e. CiIs a vector of l x b numbers, e.g. b is 3, l 3, (b) isi=(123,245,356)。
And step S330, performing numbering decoding processing on each sub-vector to generate a decompressed text.
In this embodiment, referring to fig. 5, each sub-vector is subjected to number decoding processing, that is, the sub-vector corresponding to a word segment is input into a cyclic neural network and converted to output word codes corresponding to a plurality of participles, each word segment includes X participles, so that the number decoding processing is performed on the sub-vector corresponding to the word segment, X word codes are obtained, a corresponding participle is searched in an encoding dictionary according to each word code, for example, the word code is 5, and the participle corresponding to 5 is found in the encoding dictionary as "i". And decoding and converting the corresponding sub-vectors of the D word segments in such a way, so as to obtain a plurality of word segments corresponding to the text to be decompressed by decoding, namely decompressing to generate the decompressed text.
Fig. 1 and 4 are schematic flow diagrams of a text compression and decompression method in one embodiment. It should be understood that although the steps in the flowcharts of fig. 1 and 4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 and 4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 6, there is provided a text compression apparatus including:
the preprocessing module 410 is configured to perform text preprocessing on a text to be compressed to obtain word vectors of a plurality of target word segments;
a generating module 420, configured to generate a semantic vector corresponding to the text to be compressed based on the word vectors of the target word segmentations;
and the compressing module 430 is configured to perform compression processing on the semantic vector corresponding to the text to be compressed to generate a compressed text.
In one embodiment, the preprocessing module 410 is further configured to:
constructing and generating a coding dictionary based on a plurality of participles in the word embedding table; the word embedding table comprises a plurality of participles and word vectors corresponding to the participles, and each participle in the coding dictionary corresponds to one word code;
performing word segmentation processing on the text to be compressed based on the coding dictionary to obtain a plurality of target words; wherein each target participle carries a corresponding word code;
and determining a word vector corresponding to each target word segmentation based on the word embedding table.
In one embodiment, the generating module 420 is further configured to:
performing word segmentation and division on the target word segmentations to obtain word segments; wherein each word segment comprises at least two consecutive target participles;
generating semantic vectors corresponding to the word segments based on the word vectors of the target participles in the word segments;
and generating a semantic vector corresponding to the text to be compressed based on the semantic vector corresponding to each word segment.
In one embodiment, the generating module 420 is further configured to:
determining a semantic vector of a preceding said target participle in said segment;
and determining the semantic vector of the target participle in the word segment based on the semantic vector of the target participle.
In one embodiment, the compression module 430 is further configured to:
determining the occurrence probability of each character in the corresponding semantic vector of each word segment;
sorting the occurrence probabilities in a descending order according to numerical values to generate a probability descending order arrangement table;
sequentially overlapping the subsequent occurrence probability and the prior occurrence probability based on the probability descending order arrangement table to obtain an overlapped numerical value;
when the superposition numerical value reaches a preset numerical value, marking the subsequent occurrence probability and the prior occurrence probability respectively by using preset characters; generating a compressed code corresponding to the word segment based on each preset character corresponding to the occurrence probability related to the word segment;
and generating the compressed text based on the compressed codes corresponding to the word segments.
In one embodiment, referring to fig. 7, there is provided a text decompression apparatus, the method further comprising:
the first decoding module 510 is configured to perform vector decoding processing on a text to be decompressed to obtain a semantic vector corresponding to the text to be decompressed;
a dividing module 520, configured to divide the semantic vector corresponding to the text to be decompressed to generate a plurality of sub-vectors;
and a second decoding module 530, configured to perform numbering decoding processing on each of the sub-vectors to generate a decompressed text.
FIG. 8 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may be a server. As shown in fig. 8, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program which, when executed by the processor, causes the processor to implement the text compression and decompression method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform the method for compressing and decompressing text. Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the text compression and decompression apparatus provided in the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 8. The memory of the computer device may store various program modules constituting the text compression and decompression apparatus, such as the preprocessing module 410, the generation module 420, and the compression module 430 shown in fig. 6. The program modules constitute computer programs to make the processors execute the steps of the text compression and decompression methods of the embodiments of the present application described in the present specification.
The computer device shown in fig. 8 can perform text preprocessing on the text to be compressed through the preprocessing module 410 in the text compressing and decompressing apparatus shown in fig. 6, so as to obtain word vectors of a plurality of target word segments. The computer device may execute, by the generating module 420, a word vector based on the plurality of target participles to generate a semantic vector corresponding to the text to be compressed. The computer device can perform compression processing on the semantic vector corresponding to the text to be compressed through the compression module 430 to generate a compressed text.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method according to any of the above embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the method of any of the above embodiments.
It will be understood by those skilled in the art that all or part of the processes for implementing the methods of the embodiments described above may be implemented by instructing the relevant hardware through a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double-rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).
It is noted that, herein, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is merely exemplary of the invention, which can be understood and carried into effect by those skilled in the art. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of text compression, the method comprising:
performing text preprocessing on a text to be compressed to obtain word vectors of a plurality of target word segments;
generating semantic vectors corresponding to the texts to be compressed based on the word vectors of the target word segments;
and compressing the semantic vector corresponding to the text to be compressed to generate a compressed text.
2. The method of claim 1, wherein the pre-processing the text to be compressed to obtain word vectors of a plurality of target word segments comprises:
constructing and generating a coding dictionary based on a plurality of participles in the word embedding table; the word embedding table comprises a plurality of participles and word vectors corresponding to the participles, and each participle in the coding dictionary corresponds to one word code;
performing word segmentation processing on the text to be compressed based on the coding dictionary to obtain a plurality of target word segments; wherein each target participle carries a corresponding word code;
and determining a word vector corresponding to each target word segmentation based on the word embedding table.
3. The method according to claim 2, wherein the generating a semantic vector corresponding to the text to be compressed based on the word vectors of the plurality of target participles comprises:
performing word segmentation and division on the target word segmentations to obtain word segments; wherein each word segment comprises at least two consecutive target participles;
generating semantic vectors corresponding to the word segments based on the word vectors of the target participles in the word segments;
and generating a semantic vector corresponding to the text to be compressed based on the semantic vector corresponding to each word segment.
4. The method of claim 3, wherein generating semantic vectors corresponding to the respective word segments based on the word vectors of the target participles in each of the word segments comprises:
determining a semantic vector of a preceding said target participle in said segment;
determining a semantic vector of a later target participle in the word segment based on the semantic vector of the previous target participle.
5. The method according to claim 3, wherein the compressing the semantic vector corresponding to the text to be compressed to generate a compressed text comprises:
determining the occurrence probability of each character in the corresponding semantic vector of each word segment;
sorting the occurrence probabilities in a descending order according to numerical values to generate a probability descending order arrangement table;
sequentially overlapping the subsequent occurrence probability and the prior occurrence probability based on the probability descending order arrangement table to obtain an overlapped numerical value;
when the superposition numerical value reaches a preset numerical value, marking the subsequent occurrence probability and the prior occurrence probability respectively by using preset characters;
generating a compressed code corresponding to the word segment based on each preset character corresponding to the occurrence probability related to the word segment;
and generating the compressed text based on the compressed codes corresponding to the word segments.
6. A method of text decompression, the method further comprising:
carrying out vector decoding processing on a text to be decompressed to obtain a semantic vector corresponding to the text to be decompressed;
dividing semantic vectors corresponding to the text to be decompressed to generate a plurality of sub-vectors;
and carrying out numbering decoding processing on each sub-vector to generate a decompressed text.
7. A text compression apparatus, the apparatus comprising:
the preprocessing module is used for preprocessing texts of texts to be compressed to obtain word vectors of a plurality of target word segments;
the generating module is used for generating semantic vectors corresponding to the texts to be compressed based on the word vectors of the target word segmentations;
and the compression module is used for compressing the semantic vector corresponding to the text to be compressed to generate a compressed text.
8. A text decompression apparatus, characterized in that the apparatus comprises:
the first decoding module is used for carrying out vector decoding processing on a text to be decompressed to obtain a semantic vector corresponding to the text to be decompressed;
the dividing module is used for dividing the semantic vectors corresponding to the texts to be decompressed to generate a plurality of sub-vectors;
and the second decoding module is used for carrying out numbering decoding processing on each sub-vector to generate a decompressed text.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 6 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202111355474.4A 2021-11-16 2021-11-16 Text compression method, text decompression method, text compression device, text decompression device, computer equipment and storage medium Pending CN114048710A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111355474.4A CN114048710A (en) 2021-11-16 2021-11-16 Text compression method, text decompression method, text compression device, text decompression device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111355474.4A CN114048710A (en) 2021-11-16 2021-11-16 Text compression method, text decompression method, text compression device, text decompression device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114048710A true CN114048710A (en) 2022-02-15

Family

ID=80209376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111355474.4A Pending CN114048710A (en) 2021-11-16 2021-11-16 Text compression method, text decompression method, text compression device, text decompression device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114048710A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115310409A (en) * 2022-06-29 2022-11-08 杭州似然数据有限公司 Data encoding method, system, electronic device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115310409A (en) * 2022-06-29 2022-11-08 杭州似然数据有限公司 Data encoding method, system, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN109947912B (en) Model method based on intra-paragraph reasoning and joint question answer matching
CN114048711A (en) Text compression method, text decompression method, text compression device, text decompression device, computer equipment and storage medium
EP3499384A1 (en) Word and sentence embeddings for sentence classification
EP4131076A1 (en) Serialized data processing method and device, and text processing method and device
CN109753661B (en) Machine reading understanding method, device, equipment and storage medium
CN110020420B (en) Text processing method, device, computer equipment and storage medium
CN112084435A (en) Search ranking model training method and device and search ranking method and device
CN113591457A (en) Text error correction method, device, equipment and storage medium
CN113704416A (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN111462751A (en) Method, apparatus, computer device and storage medium for decoding voice data
CN112580346A (en) Event extraction method and device, computer equipment and storage medium
CN111241298B (en) Information processing method, apparatus, and computer-readable storage medium
CN112667780A (en) Comment information generation method and device, electronic equipment and storage medium
CN114048710A (en) Text compression method, text decompression method, text compression device, text decompression device, computer equipment and storage medium
CN112836506A (en) Information source coding and decoding method and device based on context semantics
CN114445808A (en) Swin transform-based handwritten character recognition method and system
CN113158667B (en) Event detection method based on entity relationship level attention mechanism
Granados et al. Discovering data set nature through algorithmic clustering based on string compression
WO2020146784A1 (en) Converting unstructured technical reports to structured technical reports using machine learning
CN116244442A (en) Text classification method and device, storage medium and electronic equipment
CN114638229A (en) Entity identification method, device, medium and equipment of record data
CN109299260B (en) Data classification method, device and computer readable storage medium
CN112507126B (en) Entity linking device and method based on recurrent neural network
CN113257239B (en) Voice recognition method and device, electronic equipment and storage medium
CN115510854B (en) Entity relation extraction method and system based on reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination