CN117240304B

CN117240304B - Electronic invoice data processing method and system

Info

Publication number: CN117240304B
Application number: CN202311490708.5A
Authority: CN
Inventors: 李洪波; 石文博; 米杰; 毛伟
Original assignee: Hunan Zhongsi Information Technology Co ltd
Current assignee: Hunan Zhongsi Information Technology Co ltd
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2024-01-26
Anticipated expiration: 2043-11-10
Also published as: CN117240304A

Abstract

The invention relates to the technical field of data processing, in particular to a method and a system for processing electronic invoice data, comprising the following steps: acquiring electronic invoice data, and obtaining codes corresponding to each type of characters in the character sequence according to the electronic invoice data; obtaining a risk priority sequence according to risk coefficients of codes corresponding to any two types of characters in the character sequence, obtaining a character-code mapping sequence according to codes corresponding to each type of characters in the character sequence, and obtaining a compression loss rate according to the risk priority sequence and the character-code mapping sequence; and obtaining a final character-code mapping sequence according to the compression loss rate, and carrying out code compression on the electronic invoice data according to the final character-code mapping sequence. The invention analyzes the codes of the characters in the data and adjusts the codes through the risk priority of the codes, thereby achieving the beneficial effects of reducing coding errors and reducing data loss in the data coding process.

Description

Electronic invoice data processing method and system

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a system for processing electronic invoice data.

Background

With the rapid development of electronic commerce and digital technology, electronic invoices gradually replace traditional paper invoices and become a standard record form of transactions and purchase and sales. Due to the large number of electronic invoices, data transmission thereof not only requires time, but also may generate a large amount of communication cost. The electronic invoice contains key information of ticket purchasing personnel, wherein the key information comprises a large amount of digital and text information, the safety and the integrity of the electronic invoice need to be ensured in the transmission and storage processes, and therefore, the data needs to be compressed, so that the possibility of data loss is reduced in the data transmission and storage processes.

A conventional huffman code is a prefix code, any of which may not be a prefix of other codes. This property is the basis of huffman coding, which also leads to error propagation problems in the huffman coding process. In the process of compressing electronic invoice data, if one-bit errors occur, a series of errors can occur later, while in the electronic invoice data, important information such as invoice numbers, ticket purchase prices, dates and the like exists, and if errors occur, the validity and the completeness of the invoice can be seriously affected.

Disclosure of Invention

In order to solve the problems, the invention provides a method and a system for processing electronic invoice data.

The invention discloses a method and a system for processing electronic invoice data, which adopts the following technical scheme:

one embodiment of the invention provides an electronic invoice data processing method, which comprises the following steps:

collecting electronic invoice data, wherein the electronic invoice data comprises a plurality of characters;

obtaining a character sequence according to the electronic invoice data, and obtaining codes corresponding to each type of characters in the character sequence according to the character sequence;

obtaining the types of all binary character combinations in the electronic invoice data according to the electronic invoice data, obtaining a first character sequence according to the types of all binary character combinations in the electronic invoice data, and obtaining risk coefficients of codes corresponding to any two types of characters in the character sequence according to the character sequence, the codes corresponding to each type of characters in the character sequence and the first character sequence;

obtaining a risk priority sequence according to risk coefficients of codes corresponding to any two types of characters in the character sequence, obtaining a character-code mapping sequence according to codes corresponding to each type of characters in the character sequence, wherein the character-code mapping sequence is a two-dimensional sequence, the first dimension is different characters in the character sequence, the second dimension is codes corresponding to the characters, obtaining a first character-code mapping sequence according to the risk priority sequence and the character-code mapping sequence, and obtaining a compression loss rate after adjustment of the mapping sequence according to the code length difference of the corresponding characters in the character-code mapping sequence and the first character-code mapping sequence;

and obtaining a final character-code mapping sequence according to the compression loss rate after the mapping sequence is adjusted, and carrying out code compression on the electronic invoice data according to the final character-code mapping sequence.

Further, the step of obtaining a character sequence according to the electronic invoice data and obtaining codes corresponding to each type of characters in the character sequence according to the character sequence comprises the following specific steps:

traversing all characters in the electronic invoice according to the sequence from left to right to obtain the types of all the characters in the electronic invoice data, counting the frequency corresponding to each type of characters, sequencing all the types of the characters in the electronic invoice data according to the frequency corresponding to each type of characters from large to small to obtain a character sequence of the electronic invoice data, and marking the character sequence as a character sequence;

and constructing a Huffman tree by utilizing a Huffman coding algorithm according to the character sequence, and obtaining codes corresponding to each type of characters in the character sequence according to the Huffman tree.

Further, the obtaining the types of all binary character combinations in the electronic invoice data according to the electronic invoice data, and obtaining the first character sequence according to the types of all binary character combinations in the electronic invoice data, includes the following specific steps:

traversing and acquiring binary character combinations formed by each character and the right nearest neighbor character in the electronic invoice data according to the sequence from left to right to obtain all binary character combinations in the electronic invoice data, counting the frequency corresponding to each type of binary character combinations in the electronic invoice data, removing the binary character combination with the frequency of 1 corresponding to the binary character combinations, arranging all the remaining binary character combinations according to the sequence acquired in the electronic invoice data to obtain a binary character combination sequence, and marking the binary character combination sequence as a first character sequence.

Further, the risk coefficient of the codes corresponding to any two types of characters in the character sequence is obtained according to the character sequence, the codes corresponding to each type of characters in the character sequence and the first character sequence, and the specific steps are as follows:

any two types of characters in the character sequence are acquired and respectively recorded as a first type of character and a second type of character, codes corresponding to the first type of character are acquired, and codes corresponding to the second type of character are acquired;

in the method, in the process of the invention,coding the ith bit in the codes corresponding to the first type of characters,>coding the ith bit in the codes corresponding to the second type character,>to take absolute value, +.>The length of the code corresponding to the first character is the number of codes contained in the code, < >>Length of code corresponding to the second type character, < >>The acquisition method of (1) is as follows: will->And->The minimum value of (2) is marked +.>，/>Is composed of the first character sequence consisting of the first character and the second characterFrequency of occurrence of binary character combinations, +.>For the frequency of occurrence of the first type of character in the character sequence, and (2)>The frequency of occurrence of the second type of character in the character sequence,as an exponential function based on natural constants, < +.>And correspondingly coding risk coefficients for the first type of characters and the second type of characters.

Further, the risk priority sequence is obtained according to the risk coefficient of the corresponding codes of any two types of characters in the character sequence, and the specific steps are as follows:

the risk coefficient of the codes corresponding to the first type of characters and the characters in the character sequence is obtained, the average value of the risk coefficient of the codes corresponding to the first type of characters and the characters in the character sequence is used as the risk priority of the codes corresponding to the first type of characters, the risk priority of the codes corresponding to the characters in the character sequence is obtained, and the risk priority of the codes corresponding to the characters in the character sequence is sequenced according to the sequence from large to small, so that the risk priority sequence is obtained.

Further, the step of obtaining the first character-code mapping sequence according to the risk priority sequence and the character-code mapping sequence includes the following specific steps:

marking any code in the risk priority sequence as a target code;

in the method, in the process of the invention,for character-to-code mapping sequence character correspondence codingTotal number of codes->Index encoded for a target in a risk priority sequence,/->For the preset base number, < >>A step size of the movement encoded for the object, +.>Representing a round up->Expressed as +.>A logarithmic function of the base;

the character position in the character-code mapping sequence is fixed, the code corresponding to the first risk priority in the risk priority sequence is obtained and is marked as the first code, and the code corresponding to the first code in the character-code mapping sequence is shifted to the rightPosition(s)>For the moving step length of the first code, acquiring a code corresponding to a second risk priority in the risk priority sequence, marking the code as the second code, and right-shifting the code corresponding to the second code in the character-code mapping sequence by +.>Position(s)>The moving step length of the second code is set until the code corresponding to each risk priority in the risk priority sequence is moved to the right in the character-code mapping sequence, and finally the adjusted character-code mapping sequence is obtained and recorded as a first wordSymbol-code mapping sequences.

Further, the method for obtaining the compression loss rate after the mapping sequence adjustment according to the coding length difference of the corresponding characters in the character-coding mapping sequence and the first character-coding mapping sequence comprises the following specific steps:

in the method, in the process of the invention,for the length of the code corresponding to the j-th character in the character-code mapping sequence,/for the code corresponding to the j-th character>For the length of the code corresponding to the j-th character in the first character-code mapping sequence,/for the code corresponding to the j-th character in the first character-code mapping sequence>For the frequency corresponding to the j-th character in the character-code mapping sequence, < >>For the total number of character types in the character-to-code mapping sequence,/->To take absolute value, +.>The post compression loss rate is adjusted for the mapping sequence.

Further, the final character-code mapping sequence is obtained by adjusting the compression loss rate according to the mapping sequence, which comprises the following specific steps:

presetting an experience threshold value, which is recorded asWhen->In the case of->Obtaining a second character-code mapping sequence according to the risk priority sequence and the first character-code mapping sequence, obtaining a new mapping sequence according to the code length difference of corresponding characters in the character-code mapping sequence and the second character-code mapping sequence, obtaining a new mapping sequence adjusted compression loss rate, recording the new mapping sequence adjusted compression loss rate as a first compression loss rate, comparing the first compression loss rate with a preset experience threshold, if the first compression loss rate is still smaller than the preset experience threshold, continuing to obtain the new mapping sequence, obtaining a new compression loss rate and comparing the new mapping sequence with the preset experience threshold until the new compression loss rate is larger than or equal to the preset experience threshold, stopping, recording the character-code mapping sequence corresponding to the previous compression loss rate when the stopping condition is met as a final character-code mapping sequence, and recording the final character-code mapping sequence when the stopping condition is met>When the first character-to-code mapping sequence is used as the final character-to-code mapping sequence.

Further, the electronic invoice data is encoded and compressed according to the final character-encoding mapping sequence, and the method comprises the following specific steps:

and coding each character in the electronic invoice data according to the codes corresponding to the characters in the final character-code mapping sequence to obtain electronic invoice coding data, and compressing the electronic invoice coding data by using an LZW algorithm.

The invention also provides an electronic invoice data processing system which comprises a memory and a processor, wherein the processor executes a computer program stored in the memory so as to realize the steps of the electronic invoice data processing method.

The technical scheme of the invention has the beneficial effects that: in conventional huffman coding, since some codes are similar, coding errors may occur in the coding process during data, and if coding errors occur, the errors may spread over the whole coding sequence.

The invention obtains the risk coefficient of each code by calculating the difference between each code and all other codes, calculates the risk priority of each code according to the risk coefficient of each code, adjusts the code according to the risk priority of each code, reduces the frequency of the code in data, reduces the possibility of error of the code, and can reduce the possibility of error of electronic invoice data in the compression process, and the beneficial effects of reducing the coding error and reducing the data loss in the data coding process.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of steps of a method for processing electronic invoice data according to an embodiment of the present invention.

Detailed Description

In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following detailed description refers to the specific implementation, structure, characteristics and effects of the electronic invoice data processing method and system according to the invention with reference to the attached drawings and the preferred embodiment. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The following specifically describes a specific scheme of the electronic invoice data processing method provided by the invention with reference to the accompanying drawings.

Referring to fig. 1, a flowchart of steps of a method for processing electronic invoice data according to an embodiment of the present invention is shown, the method includes the following steps:

and S001, collecting electronic invoice data.

It should be noted that, this embodiment is an electronic invoice data processing method, before starting processing, corresponding electronic invoice data needs to be collected first, and electronic invoice refers to invoice information stored and transmitted in an electronic form.

Specifically, an API interface is established with the electronic invoice platform, character data in the electronic invoice is obtained in a mode of calling the API interface and is recorded as electronic invoice data, wherein the electronic invoice data comprises a plurality of characters. It should be noted that, the electronic invoice data is one-dimensional data, and the electronic invoice data includes but is not limited to the following contents: including invoice codes, invoice numbers, invoicing dates, tax payer identification numbers of sellers and buyers, names, addresses, telephones, and other relevant information.

Thus, electronic invoice data is obtained.

And step S002, obtaining a character sequence according to the electronic invoice data, and obtaining codes corresponding to each type of characters in the character sequence according to the character sequence.

It should be noted that, since the conventional huffman coding is a variable length coding method in which a character with a higher frequency uses a shorter code, a character with a lower frequency uses a longer code to achieve a higher compression ratio. However, when a coding error occurs, i.e., one or more codewords are incorrectly decoded into other characters, subsequent codes are all misplaced, thereby creating an error propagation problem. The higher the frequency of the character is, the greater the possibility of being mistakenly encoded, so that the sub-nodes of the Huffman tree need to be adaptively adjusted according to the risk coefficient of encoding, and the higher the risk coefficient of encoding, the lower the frequency of the corresponding character sub-nodes of the high-frequency sub-nodes in the Huffman tree is.

It should be noted that, in the electronic invoice data, some data names, such as characters corresponding to "invoice, number, person" and the like, appear in the data for many times, and the number of occurrences of specific person names is less, so that some characters with higher frequency and some characters with lower frequency exist in the data, and the shorter the coding length of the characters with higher frequency is, the shorter the coding length is, the more similar codes are, and the greater the possibility of coding errors are, so that the electronic invoice data needs to be traversed, and a traditional huffman tree is constructed according to the frequencies of various characters in the electronic invoice data, so as to obtain all codes.

Specifically, a character sequence is obtained according to the electronic invoice data, and the character sequence is specifically as follows:

traversing all characters in the electronic invoice according to the sequence from left to right to obtain the types of all the characters in the electronic invoice data, counting the frequency corresponding to each type of characters, sequencing all the types of the characters in the electronic invoice data according to the frequency corresponding to each type of characters from large to small to obtain a character sequence of the electronic invoice data, and marking the character sequence as the character sequence.

It should be noted that, the frequency corresponding to each type of character in the character sequence may be obtained by dividing the frequency of occurrence of each type of character in the electronic invoice data by the frequency of occurrence of all characters in the electronic invoice data, which is not described in detail in this embodiment.

Further, the codes corresponding to each type of characters in the character sequence are obtained according to the character sequence, and the codes are specifically as follows:

It should be noted that, the codes corresponding to each type of character in the character sequence are obtained according to the constructed huffman tree.

So far, the codes corresponding to each type of characters in the character sequence are obtained.

Step S003, obtaining the types of all binary character combinations in the electronic invoice data according to the electronic invoice data, obtaining a first character sequence according to the types of all binary character combinations in the electronic invoice data, and obtaining risk coefficients of codes corresponding to any two types of characters in the character sequence according to the character sequence, the codes corresponding to each type of characters in the character sequence and the first character sequence.

It should be noted that, in huffman coding, due to its prefix nature, if a single bit error occurs in the coding during transmission, the entire sequence error may be caused next. For example, in electronic invoice data, the occurrence frequency of the characters 'people' and 'numbers' is higher, the codes corresponding to the characters 'people' and 'numbers' are very similar, only one-two-bit difference exists, if one of the codes has one-bit error in the transmission process, for example, the code length of the code is longer, the character with longer code length is more easy to lose bits in the coding process, namely, the condition of losing the code occurs, when one-bit code is lost, the code error occurs, the code error is misinterpreted as the other character, and thus the decoding error of the subsequent whole sequence is caused. And, this situation occurs continuously during data transmission, and errors are accumulated continuously, so that error propagation is formed. Therefore, the risk coefficients between codes need to be calculated, and each sub-node is adaptively adjusted according to the risk coefficients between codes, so that the frequency of the character sub-node corresponding to the higher-frequency sub-node with the higher coding risk coefficient in the Huffman tree is smaller, and the possibility of coding errors is reduced.

It should be noted that, in the constructed huffman tree, the smaller the difference between two codes closer to each other, the more similar the codes are, the greater the possibility that the codes are compiled in error, and thus the greater the risk coefficient of the codes, so that the two codes calculate the risk coefficients of the two codes according to the difference between the corresponding positions and the difference between the lengths of the codes.

It should be further noted that, because the electronic invoice data needs to meet a certain format requirement, the existence of the fixed characters can ensure the consistency and the readability of the data. For example, the characters "send" and "ticket" usually appear together, so that the frequency of occurrence of the two characters is relatively close, the closer the frequency of occurrence of the two characters is, the higher the similarity is, the greater the probability of error occurrence in the compiling process is, and therefore, the frequency of occurrence of character combinations in the data and the frequency of occurrence of various single characters in the character combinations need to be considered when calculating risk coefficients.

Specifically, the types of all binary character combinations in the electronic invoice data are obtained according to the electronic invoice data, and the first character sequence is obtained according to the types of all binary character combinations in the electronic invoice data, specifically as follows:

traversing and acquiring binary character combinations (namely, two adjacent characters) formed by each character and the nearest adjacent characters on the right side in the electronic invoice data according to the sequence from left to right, obtaining all binary character combinations in the electronic invoice data, counting the frequency corresponding to each type of binary character combinations in the electronic invoice data, removing the binary character combination with the frequency of 1 corresponding to the binary character combination, arranging all the remaining binary character combinations according to the acquired sequence in the electronic invoice data, obtaining a binary character combination sequence, and marking the binary character combination sequence as a first character sequence. It should be noted that, the last character in the electronic invoice data has no nearest neighbor character on the right side, so that the corresponding binary character combination is not acquired.

Further, according to the character sequence, the codes corresponding to the characters in the character sequence and the first character sequence, risk coefficients of codes corresponding to any two types of characters in the character sequence are obtained, specifically as follows:

any two types of characters in the character sequence are acquired and respectively recorded as a first type of character and a second type of character, codes corresponding to the first type of character are acquired, and codes corresponding to the second type of character are acquired.

In the method, in the process of the invention,coding the ith bit in the codes corresponding to the first type of characters,>coding the ith bit in the codes corresponding to the second type character,>to take absolute value, +.>The length of the code corresponding to the first character is the number of codes contained in the code, < >>Length of code corresponding to the second type character, < >>The acquisition method of (1) is as follows: will->And->The minimum value of (2) is marked +.>，/>For the frequency of occurrence of binary character combinations consisting of preceding characters of the first type and following characters of the second type in the first character sequence,/for the first character sequence>For the frequency of occurrence of the first type of character in the character sequence, and (2)>The occurrence frequency of the second type of characters in the character sequence is obtained; it should be noted that, the frequency of occurrence of the character may be obtained by multiplying the frequency corresponding to the character by the total number of characters in the electronic invoice data, which is not described in detail in this embodiment, and if the two-hospital character combination corresponding to the two types of characters does not exist in the first character sequence, the corresponding frequency of occurrence is 0; />As an exponential function based on natural constants, < +.>And correspondingly coding risk coefficients for the first type of characters and the second type of characters.

It should be noted that the number of the substrates,representing the difference of corresponding codes of the first type character and the second type character under the same coding length, wherein the larger the difference is, the +.>The larger the value of +.>Representing the difference between the coding length corresponding to the first type character and the coding length corresponding to the second type character; under the same coding length, the smaller the difference between the codes corresponding to the first type characters and the second type characters is, the smaller the difference between the codes corresponding to the first type characters and the codes corresponding to the second type characters is, and the larger the risk coefficient of the codes corresponding to the first type characters and the second type characters is.

The larger the value of the binary character combination is, the closer the frequency of the binary character combination is to the average value of the frequency of the two single characters, the higher the probability that the binary character combination is a fixed combination in the electronic invoice data is, the more similar the codes are, and the higher the probability that the codes are wrong is, namely the higher the risk coefficient is.

So far, any two kinds of characters in the character sequence correspond to the coded risk coefficients.

Step S004, according to risk coefficients of codes corresponding to any two types of characters in the character sequence, a risk priority sequence is obtained, according to codes corresponding to each type of characters in the character sequence, a character-code mapping sequence is obtained, according to the risk priority sequence and the character-code mapping sequence, a first character-code mapping sequence is obtained, and according to the code length difference of the corresponding characters in the character-code mapping sequence and the first character-code mapping sequence, the compression loss rate after the mapping sequence is adjusted is obtained.

When the risk coefficient of the code corresponding to one character and the code corresponding to the other characters are high, the probability of error occurring when compiling the code is high. Therefore, in order to reduce the probability of occurrence of errors, it is necessary to reduce the frequency of characters corresponding to the code, and even if the number of times of occurrence of the code in the electronic invoice data is reduced, the probability of occurrence of the code in the electronic invoice data is reduced as the number of times of occurrence of the code in the electronic invoice data is reduced. Therefore, the risk priority of the codes corresponding to each type of character needs to be calculated according to the risk coefficients of the codes corresponding to each type of character and the codes corresponding to other characters.

Specifically, according to risk coefficients of corresponding codes of any two types of characters in the character sequence, a risk priority sequence is obtained, and the method specifically comprises the following steps:

It should be noted that, the larger the risk priority of the code corresponding to the character in the risk priority sequence, the smaller the difference between the code and other codes, and the larger the probability of coding error in the coding process, so that the character with lower frequency needs to be allocated to the code to reduce the probability of error.

It should be further noted that, because the codes corresponding to each type of character are the sum of the optimal weighted paths of the huffman tree, that is, the situation that the compression effect is optimal, when the characters corresponding to the codes are adjusted according to the risk priority of the codes, the compression effect is inevitably reduced, and the purpose of compressing the data is lost if the compression effect is reduced too much, so that when the characters corresponding to the codes are adjusted, the situation that the compression effect changes in the adjustment process is also considered.

Specifically, according to the codes corresponding to each type of characters in the character sequence, a character-code mapping sequence is obtained. It should be noted that, since the character and the code corresponding to the character are obtained by constructing the huffman tree in step S002, the character-code mapping sequence corresponds to the huffman tree constructed in step S002, and in particular, the character-code mapping sequence is a two-dimensional sequence, the first dimension is different characters in the character sequence, and the second dimension is the code corresponding to the character, wherein the character sequence is arranged according to the frequency of the characters from large to small, and the character-code mapping sequence is also arranged according to the frequency of the characters from large to small.

In the risk priority sequence, the higher the priority of the code that is the earlier, the larger the corresponding risk coefficient, so that the earlier code in the risk priority sequence needs to be preferentially adjusted, so that the frequency of characters corresponding to the earlier code in the risk priority sequence is reduced, even if the frequency of occurrence of the code with the higher risk coefficient in the compression process is lower, and the probability of transmission errors is reduced.

Specifically, according to the risk priority sequence and the character-code mapping sequence, a first character-code mapping sequence is obtained, specifically as follows:

any code in the risk priority sequence is recorded as a target code.

In the method, in the process of the invention,for the total number of codes corresponding to the characters in the character-code mapping sequence,/->For purposes in risk priority sequenceThe coded indexes are marked, and the coded indexes are coded in the order of the risk priority sequence,/I>For the preset base, the present embodiment uses +.>To make a description of->A step size of the movement encoded for the object, +.>Representing a round up->Expressed as +.>A logarithmic function of the base. It should be noted that, the risk priority sequence is the coded risk priority, and the total number of codes in the risk priority sequence and the character-code mapping sequence is the same.

The character position in the character-code mapping sequence is fixed, the code corresponding to the first risk priority in the risk priority sequence is obtained and is marked as the first code, and the code corresponding to the first code in the character-code mapping sequence is shifted to the rightPosition(s)>For the moving step length of the first code, acquiring a code corresponding to a second risk priority in the risk priority sequence, marking the code as the second code, and right-shifting the code corresponding to the second code in the character-code mapping sequence by +.>Position(s)>And (3) moving the step length of the second code until the code corresponding to each risk priority in the risk priority sequence is moved to the right in the character-code mapping sequence, and finally obtaining an adjusted character-code mapping sequence, and recording the adjusted character-code mapping sequence as a first character-code mapping sequence. It should be noted that, when a new code needs to be moved right, if there is a fixed code position in the range of the new code moving step, the position needs to be skipped, for example, if there is a fixed code position in the range of the new code moving step, but the new code can be moved right, that is, the moving does not exceed the range of the sequence, the new code needs to be moved right by 6 positions, that is, the fixed code position is skipped, if there is a fixed code position in the range of the new code moving step, and the new code cannot be moved right, that is, the moving does not exceed the range of the sequence, then the new code is moved right to the last position of the character-code mapping sequence, and the fixed code is not available. Meanwhile, in the process of moving the codes, other codes can be moved, namely moved forwards in the code sequence, if the fixed code position is encountered in the process of moving forwards, the position is skipped and the codes continue to move forwards to the corresponding position, namely, the other codes are moved forwards in the process of moving backwards, so that the whole code number in the code sequence is not changed and free positions are not generated, and meanwhile, the fixed code position is skipped in the moving process.

After the mapping relation between the codes and the characters is adjusted, the occurrence frequency of the codes with higher risk coefficients in the compression process is lower, so that the probability of transmission errors is reduced, the compression effect is continuously changed in the process of continuously right shifting, if the effect becomes too bad, the purpose of compression is lost, therefore, the compression effect is measured in the process of continuously right shifting the codes, the right shifting termination condition is set, and the special explanation is needed that the character-code mapping sequence after adjustment is obtained by one right shifting.

Specifically, according to the coding length difference of the corresponding character in the character-coding mapping sequence and the first character-coding mapping sequence, the compression loss rate after the mapping sequence is adjusted is obtained, and the method specifically comprises the following steps:

in the method, in the process of the invention,for the length of the code corresponding to the j-th character in the character-code mapping sequence,/for the code corresponding to the j-th character>For the length of the code corresponding to the j-th character in the first character-code mapping sequence,/for the code corresponding to the j-th character in the first character-code mapping sequence>For the frequency corresponding to the j-th character in the character-code mapping sequence, namely the frequency corresponding to the j-th character in the character sequence,/the frequency corresponding to the j-th character in the character sequence is the same as the frequency corresponding to the j-th character in the character sequence>For the total number of character types in the character-to-code mapping sequence,/->To take absolute value, +.>The post compression loss rate is adjusted for the mapping sequence.

It should be noted that the number of the substrates,the compression ratio difference before and after the mapping relation between the characters and the codes is regulated is represented, and the value is continuously increased along with the continuous right movement of the codes, and the larger the value is, the larger the compression loss of the electronic invoice data is, and the worse the compression effect is; />Representing compression rate for data compression using conventional huffman codingSize of the product.

So far, the compression loss rate after the mapping sequence adjustment is obtained.

And S005, obtaining a final character-code mapping sequence according to the compression loss rate after the mapping sequence is adjusted, and carrying out code compression on the electronic invoice data according to the final character-code mapping sequence.

It should be noted that, after the mapping sequence adjustment, the compression loss rate is obtained, then a final character-code mapping sequence is obtained by setting a proper right shift termination condition, and the electronic invoice data is encoded and compressed according to the final character-code mapping sequence.

Specifically, the final character-code mapping sequence is obtained according to the compression loss rate after the mapping sequence is adjusted, and the specific steps are as follows:

presetting an experience threshold value, which is recorded asIn this embodiment->For example, when->When the compression loss rate is considered to be smaller, a second character-code mapping sequence is obtained according to the risk priority sequence and the first character-code mapping sequence; it should be noted that, the method for obtaining the second character-code mapping sequence is the same as the method for obtaining the first character-code mapping sequence, and this embodiment will not be described again; obtaining a new compression loss rate after adjustment of the mapping sequence according to the coding length difference of the corresponding characters in the character-coding mapping sequence and the second character-coding mapping sequence, recording the new mapping sequence as a first compression loss rate, comparing the first compression loss rate with a preset experience threshold, continuously obtaining the new mapping sequence if the first compression loss rate is still smaller than the preset experience threshold, obtaining the new mapping sequence, comparing the new compression loss rate with the preset experience threshold, stopping until the new compression loss rate is larger than or equal to the preset experience threshold, recording the character-coding mapping sequence corresponding to the previous compression loss rate when the stopping condition is metFor the final character-to-code mapping sequence, when +.>When the first character-to-code mapping sequence is used as the final character-to-code mapping sequence.

Further, the electronic invoice data is encoded and compressed according to the final character-encoding mapping sequence, and the method is specifically as follows:

and coding each character in the electronic invoice data according to the codes corresponding to the characters in the final character-code mapping sequence to obtain electronic invoice coding data, and compressing the electronic invoice coding data by using an LZW algorithm. It should be noted that, encoding the electronic invoice data according to the final character-to-code mapping sequence may reduce the possibility of encoding errors, thereby reducing the possibility of electronic invoice data loss.

It should be noted that, since the characters and the codes corresponding to the characters are obtained by constructing huffman trees in step S002, the character-code mapping sequence corresponds to the huffman tree constructed in step S002, and the final character-code mapping sequence corresponds to a new huffman tree, and the new huffman tree is different from the huffman tree constructed in step S002, and the electronic invoice data needs to be encoded, so that the encoding can be completed according to the final character-code mapping sequence, and the final character-code mapping sequence is also a two-dimensional sequence, wherein the first dimension is different characters in the character sequence, and the second dimension is the code corresponding to the characters.

Through the steps, the electronic invoice data processing method is completed.

Another embodiment of the present invention provides an electronic invoice data processing system, the system including a memory and a processor which, when executing a computer program stored in the memory, performs the following operations:

collecting electronic invoice data, wherein the electronic invoice data comprises a plurality of characters; obtaining a character sequence according to the electronic invoice data, and obtaining codes corresponding to each type of characters in the character sequence according to the character sequence; obtaining the types of all binary character combinations in the electronic invoice data according to the electronic invoice data, obtaining a first character sequence according to the types of all binary character combinations in the electronic invoice data, and obtaining risk coefficients of codes corresponding to any two types of characters in the character sequence according to the character sequence, codes corresponding to the characters in the character sequence and the first character sequence; obtaining a risk priority sequence according to risk coefficients of codes corresponding to any two types of characters in the character sequence, obtaining a character-code mapping sequence according to codes corresponding to each type of characters in the character sequence, wherein the character-code mapping sequence is a two-dimensional sequence, the first dimension is different characters in the character sequence, the second dimension is codes corresponding to the characters, obtaining a first character-code mapping sequence according to the risk priority sequence and the character-code mapping sequence, and obtaining a compression loss rate after adjustment of the mapping sequence according to the code length difference of the corresponding characters in the character-code mapping sequence and the first character-code mapping sequence; and obtaining a final character-code mapping sequence according to the compression loss rate after the mapping sequence is adjusted, and carrying out code compression on the electronic invoice data according to the final character-code mapping sequence.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the invention, but any modifications, equivalent substitutions, improvements, etc. within the principles of the present invention should be included in the scope of the present invention.

Claims

1. An electronic invoice data processing method is characterized by comprising the following steps:

the final character-coding mapping sequence is obtained according to the compression loss rate after the mapping sequence is adjusted, and the electronic invoice data is subjected to coding compression according to the final character-coding mapping sequence;

according to the character sequence, the codes corresponding to each type of characters in the character sequence and the first character sequence, the risk coefficient of the codes corresponding to any two types of characters in the character sequence is obtained, and the specific steps are as follows:

in the method, in the process of the invention,coding the ith bit in the codes corresponding to the first type of characters,>coding the ith bit in the codes corresponding to the second type character,>to take absolute value, +.>The length of the code corresponding to the first character is the number of codes contained in the code, < >>Length of code corresponding to the second type character, < >>The acquisition method of (1) is as follows: will->And->The minimum value of (2) is marked +.>，/>For the frequency of occurrence of binary character combinations consisting of preceding characters of the first type and following characters of the second type in the first character sequence,/for the first character sequence>For the frequency of occurrence of the first type of character in the character sequence, and (2)>The frequency of occurrence of the second type of character in the character sequence,as an exponential function based on natural constants, < +.>The risk coefficients of the corresponding codes of the first type of characters and the second type of characters are obtained;

the risk priority sequence is obtained according to the risk coefficient of the corresponding codes of any two types of characters in the character sequence, and the specific steps are as follows:

acquiring risk coefficients of codes corresponding to each type of characters in the first type of characters and the character sequence, taking the average value of the risk coefficients of codes corresponding to each type of characters in the first type of characters and the character sequence as the risk priority of the codes corresponding to the first type of characters, acquiring the risk priority of the codes corresponding to each type of characters in the character sequence, and sequencing the risk priority of the codes corresponding to each type of characters in the character sequence according to the sequence from large to small to obtain a risk priority sequence;

the first character-code mapping sequence is obtained according to the risk priority sequence and the character-code mapping sequence, and comprises the following specific steps:

marking any code in the risk priority sequence as a target code;

in the method, in the process of the invention,for the total number of codes corresponding to the characters in the character-code mapping sequence,/->Index encoded for a target in a risk priority sequence,/->For the preset base number, < >>A step size of the movement encoded for the object, +.>Representing a round up->Expressed as +.>A logarithmic function of the base;

the character position in the character-code mapping sequence is fixed, the code corresponding to the first risk priority in the risk priority sequence is obtained and is marked as the first code, and the code corresponding to the first code in the character-code mapping sequence is shifted to the rightPosition(s)>For the moving step length of the first code, acquiring a code corresponding to a second risk priority in the risk priority sequence, marking the code as the second code, and right-shifting the code corresponding to the second code in the character-code mapping sequence by +.>Position(s)>The moving step length of the second code is set until the code corresponding to each risk priority in the risk priority sequence is moved to the right in the character-code mapping sequence, and finally the adjusted character-code mapping sequence is obtained and recorded as a first character-code mapping sequence;

the compression loss rate after the mapping sequence adjustment is obtained according to the coding length difference of the corresponding characters in the character-coding mapping sequence and the first character-coding mapping sequence, and the specific steps are as follows:

in the method, in the process of the invention,for the length of the code corresponding to the j-th character in the character-code mapping sequence,/for the code corresponding to the j-th character>For the length of the code corresponding to the j-th character in the first character-code mapping sequence,/for the code corresponding to the j-th character in the first character-code mapping sequence>For the frequency corresponding to the j-th character in the character-to-code mapping sequence,for the total number of character types in the character-to-code mapping sequence,/->To take absolute value, +.>The compression loss rate is adjusted for the mapping sequence;

the final character-coding mapping sequence is obtained by adjusting the compression loss rate according to the mapping sequence, and the method comprises the following specific steps:

presetting an experience threshold value, which is recorded asWhen->In the case of->Obtaining a second character-code mapping sequence according to the risk priority sequence and the first character-code mapping sequence, obtaining a new mapping sequence according to the code length difference of corresponding characters in the character-code mapping sequence and the second character-code mapping sequence, recording the new mapping sequence adjusted compression loss rate as a first compression loss rate, comparing the first compression loss rate with a preset experience threshold, and continuously obtaining the new mapping sequence if the first compression loss rate is still smaller than the preset experience threshold, and obtaining the new mapping sequenceComparing the new compression loss rate with a preset experience threshold value, stopping until the new compression loss rate is greater than or equal to the preset experience threshold value, marking the character-code mapping sequence corresponding to the previous compression loss rate when the stopping condition is met as a final character-code mapping sequence, and whenWhen the first character-to-code mapping sequence is used as the final character-to-code mapping sequence.

2. The method for processing electronic invoice data according to claim 1, wherein the steps of obtaining a character sequence from the electronic invoice data, and obtaining a code corresponding to each type of character in the character sequence from the character sequence, comprise the following specific steps:

3. The method for processing electronic invoice data according to claim 1, wherein the steps of obtaining the types of all binary character combinations in the electronic invoice data according to the electronic invoice data, and obtaining the first character sequence according to the types of all binary character combinations in the electronic invoice data, comprise the following specific steps:

4. The electronic invoice data processing method according to claim 1, wherein the electronic invoice data is encoded and compressed according to a final character-to-code mapping sequence, comprising the specific steps of:

5. An electronic invoice data processing system, the system comprising a memory and a processor, wherein the processor executes a computer program stored in the memory to carry out the steps of a method of electronic invoice data processing as claimed in any one of claims 1 to 4.