CN113055018B

CN113055018B - Semantic coding lossless compression system and method based on heuristic linear transformation

Info

Publication number: CN113055018B
Application number: CN202110289154.7A
Authority: CN
Inventors: 裴正奇; 王树徽; 黄梓忱; 朱斌斌; 于秋鑫; 段必超
Original assignee: Shenzhen Qianhai Heidun Technology Co ltd
Current assignee: Shenzhen Qianhai Heidun Technology Co ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2023-05-12
Anticipated expiration: 2041-03-18
Also published as: CN113055018A

Abstract

The invention belongs to the technical field of semantic coding lossless compression, and particularly relates to a semantic coding lossless compression system based on heuristic linear transformation. The semantic coding lossless compression system and the method based on heuristic linear transformation utilize a deep learning language model to encode (encoding) texts to obtain coding characterization (empedding) of each text, utilize methods such as an included angle cosine value, euclidean distance and the like to obtain semantic similarity between search sentences and each candidate text (text to be searched by the search sentences), and rank according to the semantic similarity to obtain semantic search results.

Description

Semantic coding lossless compression system and method based on heuristic linear transformation

Technical Field

The invention relates to the technical field of semantic coding lossless compression, in particular to a semantic coding lossless compression system and a semantic coding lossless compression method based on heuristic linear transformation.

Background

The existing semantic search and encoding technology cannot achieve both content lossless and compression amplitude, the compression amplitude is limited, and the original semantic content can be greatly lost after compression, for example, the LSH technology is only suitable for a scene with low accuracy requirements on specific ranking of output results, such as a recommendation system, and if the accurate ranking is required, the LSH technology cannot be adequate.

The method can not be flexibly adjusted according to the scene demand conditions, for example, some scenes are more focused on compression amplitude/running speed, some scenes are more focused on accuracy/content lossless degree, accurate iteration can not be carried out on the model according to the index demand of the scenes, and the scene demand can be achieved without bias.

The current related technology is solidified and random, lacks of pertinence to scenes, in fact, different scenes have different semantic coding compression modes, the same text is coded in different scenes (for example, different scenes such as 'library book retrieval', 'intelligent customer service', 'knowledge question answer') and the like) and the compression mechanism are different, and therefore the optimal quantization effect can be achieved under the condition of limited computational power resources.

The current related technology cannot perform efficient iteration according to real-time feedback of a user, and along with the use of the user, the situation that the search result is not ideal is inevitably encountered, some of the related technologies are due to the semantic coding or the search mechanism, some of the related technologies are due to external environment (such as the change of knowledge points), and the related technologies cannot perform targeted efficient update for the 'non-ideal situation', so that the related technologies need to be redesigned.

Disclosure of Invention

The invention aims to provide a semantic coding lossless compression system and a semantic coding lossless compression method based on heuristic linear transformation, so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions: the semantic coding lossless compression system based on heuristic linear transformation comprises a search text set and a candidate text set, wherein a search text set transmitting end signal and a candidate text set transmitting end signal are connected with a deep learning language model receiving end, a search text Q and a candidate text D receiving end are connected with a deep learning language model transmitting end signal, a candidate text D transmitting end signal is connected with a coding storage receiving end, the search text Q and the coding storage transmitting end signal are connected with a compression matrix T receiving end, the compression matrix T transmitting end signal is connected with a compression plate DT and a compression plate QT, the compression plate DT and the compression plate QT transmitting end signal is connected with a similarity calculation function module receiving end, the similarity calculation function module transmitting end signal is connected with a principal component analysis module receiving end, the principal component analysis module transmitting end signal is connected with an initial compression matrix T receiving end, the initial compression matrix T transmitting end signal is connected with a hierarchical screening system receiving end, and the hierarchical screening system transmitting end signal is connected with a compression matrix T receiving end.

Preferably, the code storage transmitting end is connected with a receiving end of the principal component analysis module in a signal manner, and the principal component analysis module and the similarity calculation function module are both provided with a transmitting end and a receiving end.

Another technical problem to be solved by the present invention is to provide a semantic coding lossless compression method based on heuristic linear transformation, so as to solve the problem set forth in the above background art;

in order to achieve the above purpose, the present invention provides the following technical solutions: the method comprises the following steps:

s1, coding processing

And (3) coding all candidate texts by using a deep learning language model, converting each title into a K-dimensional vector, and storing the K-dimensional vector in a proper data form.

S2, building system

Setting up a search result comparison evaluation system, wherein the input of the system comprises the following steps: candidate text D, search text Q, compression matrix T; wherein the compression matrix T epsilon R (Kxr), the value R represents the coding dimension after compression, and the compression codes obtained by compressing the candidate text D and the search text Q by T are QT epsilon R ^(M×r) ,DT∈R^ ^(N×r) 。

S3, constructing an iteration mechanism

And constructing a generated compression matrix iteration mechanism, and adjusting and optimizing the compression matrix T according to the search text Q matrix which changes in real time.

S4, iterative upgrade

By using the iteration generation method, the compression matrix T is iteratively upgraded, and T is calculated ^(best) As the final compression matrix.

S5, constructing a screening system

Constructing a hierarchical screening system; for different compression dimensions r _a ，r _b ，r _c …, different compression matrices are generated, respectively labeled as

The core idea of hierarchical screening is: although the search results after compression may deviate from before compression, the magnitude of the deviation is limited, e.g., the results ranked 10 th before compression may rank 18 th after compression, rather than 2000 th so as to be "far", assuming the user is only focusing on the top L of the ranking, the compressed security deviation value G (L) is:

G(L)＝max([sort(q _i T，DT).index(item)for item in sort(q _i ，D)[：L]])

g (L) may be understood as the largest ranking bias value within the top L names; after obtaining G (L), 1.5G (L) is taken as a safety threshold value, and

is recomputed with their uncompressed encoded form. The concept is similar to "sea selection", where the compression matrix described above is the initial sea selection, which enables "excellent players" to be selected, but if one wants to rank these selected "excellent players" specifically, one still needs to resort to a more complete and complex selection method (uncompressed coded form). However, since "sea selection" has filtered out the vast majority of "players," even if the remaining "excellent players" are all operated on in uncompressed coded form, there is not much additional computation time cost. The expected acceleration multiple of a compression matrix is

The specific compression dimension may be set by expectTC.

S6, determining a final result

By Sim _α () And a hierarchical screening mechanism is used for operating the search text Q and the candidate text D before and after compression to obtain a final semantic search result.

Preferably, in the step S1, the deep learning language model may be BERT of google open source, and the storage may be directly stored in the system memory, or may be stored in a system hard disk in a file format such as numpy, jackle, etc. for subsequent reading and calling, so as to obtain quantized forms of the candidate text D and the search text Q:

/>

the semantic search scenario may then be described as

sort(q _i ，D)＝[d _i1 ，d _i2 ，d _i3 ...d _iN ]

So that the function Sim is calculated for a particular similarity _α () Always have

Sim _α (q _i ，d _ix )≥Sim _α (q _i ，d _i，x+1 )。

Preferably, the similarity calculation function Sim _α () Mainly adopts cosine value calculation method cosine or Euclidean distance method euclidean, i.e

Preferably, in the step S2, the system is embedded with a similarity calculation function Sim for evaluating the search results _α () The input is two groups

Wherein lambda is _i The principle of the ranking coefficient is as follows: top rankingThe more important the search result presentation is, the more serious the first ranked result is wrong than the tenth ranked result. Calculating the function Sim by relying on the similarity _a () The complete search result comparison evaluation mechanism can be realized, and the performance evaluation calculation method is as follows for the candidate text D, the search text Q and the compression matrix T (the candidate text D does not change in the actual scene, belongs to constant, and therefore the parameter is omitted in the following calculation

Wherein, the liquid crystal display device comprises a liquid crystal display device,

is also equivalent to the degree of lossless of the search performance brought by the compression matrix T, +.>

The higher the search results before and after compression, the less the performance loss after compression.

Note that if the usage scenario focuses only on the top ten of the search results, λ _i＞10 ＝0。

The compression matrix T can be randomly initialized, but the performance of the randomly generated T is general, and the linear algebra method is used for initializing the compression matrix T, and the main principle is that in the case that the search text Q is unknown or imperfect, the candidate text D can be temporarily substituted for the search text Q, so long as the structural relation between codes of each text in the candidate text D is ensured to be unchanged after the compression transformation of T, the compression matrix T can be considered to be mastered by the semantic structure of the candidate text D

By means of a variant of Principal Component Analysis (PCA) in the linear algebra domain (incarnative PCA), an optimal compressed form of the candidate text D for the compressed dimension r can be obtained

In addition, the pseudo-inverse matrix D of the candidate text D is obtained by a Moore-Penrose Pseudoinverse method of a Moore pseudo-inverse matrix in the linear algebra field ⁺ And obtains an initialization compression matrix T by means of iPCA (D)

/>

Preferably, in the step S3, a specific generation mechanism is as follows:

wherein mu _1～6 Not less than 0, T at the time of initialization ^(best) ＝T ^(worst) ＝T ⁽⁰⁾ ＝T ^(-1) And T is ^(rand) Is a randomly generated compressed matrix, and

preferably, in the step S5, the expected acceleration multiple of the compression matrix is

The specific compression dimension may be set by expectTC.

Compared with the prior art, the invention has the beneficial effects that:

1. the semantic coding lossless compression system and the method based on heuristic linear transformation utilize a deep learning language model to encode (encoding) texts to obtain coding characterization (empedding) of each text, utilize methods such as an included angle cosine value, euclidean distance and the like to obtain semantic similarity between search sentences and each candidate text (text to be searched by the search sentences), and rank according to the semantic similarity to obtain semantic search results.

2. According to the semantic code lossless compression system and method based on heuristic linear transformation, the semantic codes which are already generated by the candidate text are cached, and repeated generation is not needed in the real-time retrieval process.

3. According to the semantic coding lossless compression system and method based on heuristic linear transformation, the linear transformation matrix (compression matrix) is utilized to reduce the dimension of the coding representation of the search sentence and each candidate text, so that the compression effect is realized, and the speed of calculating the semantic similarity is improved.

4. The semantic coding lossless compression system and the semantic coding lossless compression method based on heuristic linear transformation calculate the deviation degree of search results before and after compression, and form a method capable of measuring the performance of a compression matrix.

5. The semantic coding lossless compression system and method based on heuristic linear transformation initialize a compression matrix by using methods in linear algebraic fields such as Principal Component Analysis (PCA) and variants thereof (incarnative PCA), a mole-Penrose pseudo-inverse matrix (Moore-Penrose pseudoinverse) and the like.

6. According to the semantic coding lossless compression system and method based on heuristic linear transformation, different compression dimensions are selected to initialize compression matrixes, the performance of each compression matrix is evaluated one by one, and the optimal compression dimension setting and the relevant parameter setting of a hierarchical screening method are obtained by utilizing the thought of the hierarchical screening method.

7. According to the semantic coding lossless compression system and method based on heuristic linear transformation, the compression matrix is iteratively upgraded according to search sentences (which can be understood as 'search cases') which are expected to realize lossless compression by a user, and the compression matrix is continuously upgraded as more search sentences which are expected to realize lossless compression by the user are increased.

8. The semantic coding lossless compression system and the semantic coding lossless compression method based on heuristic linear transformation adopt a hierarchical screening method to perform hierarchical sea selection on the retrieval process, and fully strengthen the local lossless compression degree under the condition of ensuring that the speed is not reduced.

Drawings

Fig. 1 is a schematic diagram of the system as a whole.

In the figure: 1. retrieving a text set; 2. guide sleeve; 3. retrieving the text Q; 4. candidate text D; 5. storing codes; 6. a principal component analysis module; 7. a hierarchical screening system; 8. initially compressing the matrix T; 9. compressing the matrix T; 10. a compressed version DT; 11. compression plate QT; 12. a similarity calculation function module; 13. a candidate text set.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, the present invention provides a technical solution: the semantic coding lossless compression system based on heuristic linear transformation comprises a search text set 1 and a candidate text set 13, wherein the search text set 1 and the candidate text set 13 are connected with a deep learning language model 2 receiving end through transmitting end signals, the deep learning language model 2 transmitting end is connected with a search text Q3 and a candidate text D4 receiving end through transmitting end signals, the candidate text D4 transmitting end is connected with a coding storage 5 receiving end, the search text Q3 and the coding storage 5 transmitting end are connected with a compression matrix T9 receiving end, the compression matrix T9 transmitting end is connected with a compression plate DT11 and a compression plate QT10 through transmitting end signals, the receiving end is connected with a similarity calculation function module 12 receiving end through transmitting end signals, the similarity calculation function module 12 is connected with a main component analysis module 6 receiving end through transmitting end signals, the main component analysis module 6 transmitting end is connected with an initial compression matrix T8 receiving end through transmitting end, the initial compression matrix T8 transmitting end is connected with a level screening system 7 receiving end, the level screening system 7 transmitting end is connected with a compression matrix T9 receiving end through transmitting end, the storage 5 transmitting end is connected with a main component analysis module 6 receiving end and the main component analysis module 6 receiving end is provided with a similarity calculation function module 12 receiving end.

Example two

For a library semantic retrieval system, the candidate text to be retrieved is a book name (accumulated about thirty-six thousand books), for example: peak of wave, relative theory of differential geometry and generalized, law of the Feima's theorem, etc.; it is required that the optimal book names can be searched according to the search sentence of the user, for example, the user searches "what the mathematical principle of the generalized relativity theory is", and then the system should preferentially output the book names related on the semantic level like "differential geometry and generalized relativity theory entrance".

All titles are encoded by using a deep learning language model (e.g. google open source BERT), each title is changed into 768-dimensional vectors, a matrix with dimensions of 360000×768 is formed by summarization, and the matrix is cached in the form of global variables.

According to step 2, an initialized compressed matrix T is obtained ⁽⁰⁾ The parameter setting method comprises the following steps:

according to

steps

3 and 4, the matrix T is compressed ⁽⁰⁾ Iterating according to the actual search text Q to obtain a final compression matrix T, wherein each parameter of the 5 th step is that

μ ₁ ＝0.9

μ ₂ ＝0.1

μ ₃ ＝0.05

μ ₄ ＝0.05

μ ₅ ＝0.1

μ ₆ ＝0.1

The expected acceleration times of T generated by different compression dimensions are different, and according to the step 6 of the invention scheme, different acceleration times are obtained

G(L)、expectTC。

The following table is for the case of n=10000

Finally, 20 dimensions were chosen as compression dimensions, G (L) =g (10), and the actual measurement results of the speed comparison were as follows (under the condition that the first L names of the search results before and after compression were kept identical).

The following table shows the case where N is different in value

The semantic coding lossless compression system and the method based on heuristic linear transformation utilize a deep learning language model to encode (encoding) texts to obtain coding characterization (empedding) of each text, utilize methods such as an included angle cosine value, euclidean distance and the like to obtain semantic similarity between search sentences and each candidate text (text to be searched by the search sentences), and rank according to the semantic similarity to obtain semantic search results.

And caching the generated semantic codes of the candidate texts, and repeatedly generating is not needed in the real-time retrieval process.

And (3) reducing the dimension of the coding representation of the search sentence and each candidate text by using a linear transformation matrix (compression matrix), and realizing the compression effect, thereby improving the speed of calculating the semantic similarity.

The degree of deviation of the search results before and after compression is calculated, and a method capable of measuring the performance of the compression matrix is formed by the degree of deviation.

The compressed matrix is initialized by using methods in the linear algebra fields such as Principal Component Analysis (PCA) and its variants (incarnate PCA), and Moore-Penrose pseudo-inverse matrix (Moore-Penrose pseudoinverse).

And selecting different compression dimensions to initialize the compression matrixes, evaluating the performance of each compression matrix one by one, and obtaining the optimal compression dimension setting and the related parameter setting of the hierarchical screening method by utilizing the thought of the hierarchical screening method.

The compression matrix is iteratively upgraded according to search sentences (which can be understood as "search cases") that the user desires to achieve lossless compression, and as more search sentences are expected to achieve lossless compression by the user, the compression matrix is continuously upgraded,

hierarchical screening is adopted to perform hierarchical sea selection on the retrieval process, and the degree of local lossless compression is fully enhanced under the condition of ensuring that the speed is not reduced.

In the semantic search scene of a large-scale text, under the condition of ensuring the controllable quality deviation rate, the high-dimensional semantic coding/vector is reduced by ten times, so that the search speed is improved by several orders of magnitude (for example, a certain search scene has 100 ten thousands of candidate texts to be searched in total, and the search speed after local lossless compression can be improved by 27 times compared with that of uncompressed and cached and is nearly three thousand times compared with that of uncompressed and uncached).

Under the condition of ensuring that the speed lifting amplitude before and after compression is not less than 10 times, the lossless proportion can be maintained above 10-20 (namely, the first 10-20 of search results do not have any change before and after compression), and the common semantic search scene is completely satisfied.

The user can continuously iterate the parameters in the compression method aiming at the search sentences with non-ideal search results at any time, ensures that the performance of semantic search can be continuously optimized along with the perfection of scene cases,

it is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A semantic coding lossless compression system based on heuristic linear transformation, comprising a set of search texts (1) and a set of candidate texts (13), characterized in that: the method comprises the steps that a search text set (1) and a candidate text set (13) transmitting end signal are connected with a deep learning language model (2) receiving end, the deep learning language model (2) transmitting end signal is connected with a search text Q (3) and a candidate text D (4) receiving end, the candidate text D (4) transmitting end signal is connected with a coding storage (5) receiving end, the search text Q (3) and the coding storage (5) transmitting end signal are connected with a compression matrix T (9) receiving end, the compression matrix T (9) transmitting end signal is connected with a compression plate DT (11) and a compression plate QT (10) receiving end, the compression plate DT (11) and the compression plate QT (10) transmitting end signal are connected with a similarity calculation function module (12) receiving end, the similarity calculation function module (12) transmitting end signal is connected with a main component analysis module (6) receiving end, the main component analysis module (6) transmitting end signal is connected with an initial compression matrix T (8) receiving end, the initial compression matrix T (8) transmitting end signal is connected with a compression matrix T (9) receiving end, and the compression matrix T (7) level system receiving end is connected with a compression matrix T (7) receiving end.

2. A heuristically linear transformation based semantically encoded lossless compression system according to claim 1, wherein: the code storage (5) transmitting end is in signal connection with a main component analysis module (6) receiving end, and the main component analysis module (6) and the similarity calculation function module (12) are both provided with a transmitting end and a receiving end.

3. A semantic coding lossless compression method based on heuristic linear transformation is characterized by comprising the following steps:

s1, coding processing

Coding all candidate texts by using a deep learning language model, changing each title into a K-dimensional vector, and storing the K-dimensional vector in a proper data form;

s2, building system

Setting up a search result comparison evaluation system, wherein the input of the system comprises the following steps: candidate text D (4), search text Q (3), compression matrix T (9); wherein the compression matrix T epsilon R (Kxr), the value R represents the coding dimension after being compressed, and the compression codes obtained by compressing the candidate text D (4) and the search text Q (3) by T are QT epsilon R ^(M×r) ，DT∈R^ ^(N×r) ；

S3, constructing an iteration mechanism

Constructing a generated compression matrix iteration mechanism, and adjusting and optimizing a compression matrix T (9) according to a search text Q (3) matrix which changes in real time;

s4, iterative upgrade

By using the iterative generation method in the step S3, the compression matrix T (9) is iteratively upgraded, and T is calculated ^(best) T ^(best) As a final compression matrix;

s5, constructing a screening system

Constructing a hierarchical screening system; for different compression dimensions r _a ，r _b ，r _c ...r _a ，r _b ，r _c .., different compression matrices are generated, labeled as respectively

S6, determining a final result

By Sim _α ()Sim _α () And a hierarchical screening mechanism is used for running the search text Q (3) and the candidate text D (4) before and after compression to obtain a final semantic search result.

4. A method of semantically encoded lossless compression based on heuristic linear transformation according to claim 3, wherein: in the step S1, the deep learning language model is BERT of google open source, and the generated word vector is stored in the memory of the system directly, or is stored in the system hard disk in numpy and pick file format for subsequent reading and calling, so as to obtain quantized forms of candidate text D (4) and search text Q (3):

/>

the semantic search scenario is then described as

sort(q _i ，D)＝[d _i1 ，d _i2 ，d _i3 ...d _iN ]

So that the function Sim is calculated for a particular similarity _α ()Sim _α () Always have

Sim _α (q _i ，d _ix )≥Sim _α (q _i ，d _i，x+1 )Sim _α (q _i ，d _ix )≥Sim _α (q _i ，d _i，x+1 )。

5. The semantic coding lossless compression method based on heuristic linear transformation according to claim 4The method is characterized in that: the similarity calculation function Sim _α ()Sim _α () Mainly adopts cosine value calculation method cosine or Euclidean distance method euclidean, i.e

6. A method of semantically encoded lossless compression based on heuristic linear transformation according to claim 3, wherein: in step S2, the system is embedded with a similarity calculation function Sim for evaluating search results _α ()Sim _α () The input is two groups

Wherein lambda is _i λ _i Is a ranking coefficient.

7. A method of semantically encoded lossless compression based on heuristic linear transformation according to claim 3, wherein: in the step S3, a specific generation mechanism is as follows:

wherein mu _1～6 ≥0μ _1～6 Not less than 0, T at the time of initialization ^(best) ＝T ^(worst) ＝T ⁽⁰⁾ ＝T ^(-1) T ^(best) ＝T ^(worst) ＝T ⁽⁰⁾ ＝T ^(-1) And T is ^(rand) T ^(rand) Is a randomly generated compressed matrix, and

if

T ^(best) ＝T ^(t)

8. a method of semantically encoded lossless compression based on heuristic linear transformation according to claim 3, wherein: in the step S5, the expected acceleration multiple of the compression matrix is

The specific compression dimension is set by expectTC.