CN113055018A

CN113055018A - Semantic coding lossless compression system and method based on heuristic linear transformation

Info

Publication number: CN113055018A
Application number: CN202110289154.7A
Authority: CN
Inventors: 裴正奇; 王树徽; 黄梓忱; 朱斌斌; 于秋鑫; 段必超
Original assignee: Shenzhen Qianhai Heidun Technology Co ltd
Current assignee: Shenzhen Qianhai Heidun Technology Co ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2021-06-29
Anticipated expiration: 2041-03-18
Also published as: CN113055018B

Abstract

The invention belongs to the technical field of lossless compression of semantic codes, and particularly relates to a lossless compression system of semantic codes based on heuristic linear transformation. The semantic coding lossless compression system and method based on heuristic linear transformation encode (encoding) texts by using a deep learning language model to obtain encoding representation (embedding) of each text, solve semantic similarity between a retrieval sentence and each candidate text (text to be retrieved by the retrieval sentence) by using methods such as an included angle cosine value, an Euclidean distance and the like, and rank the semantic similarity to obtain a semantic search result.

Description

Semantic coding lossless compression system and method based on heuristic linear transformation

Technical Field

The invention relates to the technical field of semantic code lossless compression, in particular to a semantic code lossless compression system and a semantic code lossless compression method based on heuristic linear transformation.

Background

The existing semantic search and coding technology cannot achieve the combination of content lossless and compression amplitude, the compression amplitude is limited, and the original semantic content is greatly lost after compression, for example, the LSH technology is only suitable for a scene with low accuracy requirement on the specific ranking of an output result, such as a recommendation system, and if the LSH technology needs to be accurately ranked, the LSH technology cannot be competent.

For example, some scenes are weighted more, the compression amplitude/the running speed is increased, some scenes are weighted more, the accuracy rate/the content lossless degree is increased, accurate iteration can not be performed on the model according to the index requirements of the scenes, and the scene requirements can be met without bias.

The current related technologies are relatively solid and random, and lack pertinence to scenes, in fact, different scenes should have different semantic coding compression modes, and the same text has different coding forms and compression mechanisms in different scenes (such as different scenes of 'library book retrieval', 'intelligent customer service', 'knowledge question and answer'), so that the optimal quantization effect can be realized under limited computational resources.

The current related technologies cannot perform efficient iteration according to real-time feedback of a user, and cannot avoid situations with unsatisfactory retrieval results along with use of the user, some situations are caused by semantic coding or a retrieval mechanism, and some situations are caused by external environments (such as change of knowledge points), and the current technologies cannot perform targeted efficient updating aiming at the "unsatisfactory situations", so that redesign is needed.

Disclosure of Invention

The invention aims to provide a semantic coding lossless compression system and method based on heuristic linear transformation, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a semantic coding lossless compression system based on heuristic linear transformation comprises a retrieval text set and a candidate text set, wherein a transmitting end of the retrieval text set and the candidate text set is in signal connection with a deep learning language model receiving end, a transmitting end of the deep learning language model is in signal connection with a retrieval text Q and a candidate text D receiving end, a transmitting end of the candidate text D is in signal connection with a coding storage receiving end, the retrieval text Q and the coding storage transmitting end are in signal connection with a compression matrix T receiving end, the compression matrix T transmitting end is in signal connection with a compression version DT and a compression version QT, the receiving end, the compression version DT and compression version QT transmitting ends are in signal connection with a similarity calculation function module receiving end, a transmitting end of the similarity calculation function module is in signal connection with a principal component analysis module receiving end, a transmitting end of the principal component analysis module is in signal connection with an, the initial compression matrix T transmitting end is in signal connection with a level screening system receiving end, and the level screening system transmitting end is in signal connection with a compression matrix T receiving end.

Preferably, the code storage transmitting end is in signal connection with a principal component analysis module receiving end, and the principal component analysis module and the similarity calculation function module are both provided with a transmitting end and a receiving end.

Another technical problem to be solved by the present invention is to provide a semantic coding lossless compression method based on heuristic linear transformation, so as to solve the problems proposed in the background art;

in order to achieve the purpose, the invention provides the following technical scheme: the method comprises the following steps:

s1, encoding processing

And coding all candidate texts by using a deep learning language model, converting each book name into a K-dimensional vector, and storing the K-dimensional vector in a proper data form.

S2 building system

Building a search result comparison and evaluation system, wherein the input of the system comprises: candidate text D, retrieval text Q and a compression matrix T;wherein the compression matrix T belongs to R (K multiplied by R), the value R represents the coding dimension after being compressed, and the compressed codes obtained after the candidate text D and the search text Q are respectively QT belonging to R^(M×r),DT∈R^^(N×r)。

S3, building an iteration mechanism

And establishing a generating type compression matrix iteration mechanism, and adjusting and optimizing the compression matrix T according to the retrieval text Q matrix which changes in real time.

S4, iterative upgrade

By utilizing the iteration generation method, the compression matrix T is iteratively upgraded, and the compression matrix T is updated^(best)As the final compression matrix.

S5, building a screening system

Building a hierarchical screening system; for different compression dimensions r_a，r_b，r_c.., different compression matrices are generated, labeled separately

The core idea of hierarchical screening is as follows: although the search results after compression may be biased compared to the search results before compression, the magnitude of the bias is limited, for example, the results ranked 10 th before compression may be ranked 18 th instead of "far away" as in 2000 th after compression, and assuming that the user only focuses on the top L names of the rankings, the compressed security bias value g (L) is:

G(L)＝max([sort(q_iT，DT).index(item)for item in sort(q_i，D)[：L]])

g (L) can be understood as the largest ranking bias value in the top L; after G (L) is obtained, taking 1.5G (L) as a safety threshold, and using sort (q)_iT，DT)[：1.5G(L)]The similarity is recalculated once with their uncompressed encoded form. The idea is similar to "sea election", and the compression matrix mentioned above is the initial sea election, which can select "superior players", but if the selected "superior players" are to be ranked specifically, a more complete and more complicated selection method (uncompressed coding form) is still needed. However, becauseThe majority of 'players' are filtered out from 'sea election', and even if the rest 'excellent players' are all operated in an uncompressed coding form, the cost of extra computing time is not increased too much. An expected speed-up multiple of a compression matrix is

The specific compression dimension may be set by expectTC.

S6, determining the final result

Using Sim_α() And a hierarchical screening mechanism is used for operating the retrieval text Q and the candidate text D before and after compression to obtain a final semantic search result.

Preferably, in step S1, the deep learning language model may be BERT derived from google, and the storage may be directly stored in the system memory, or may be stored in the system hard disk in file formats such as numpy and pickle for subsequent reading and calling, so as to obtain the quantization forms of the candidate text D and the search text Q:

the semantic search scenario described above may then be described as

sort(q_i，D)＝[d_i1，d_i2，d_i3...d_iN]

So that the function Sim is calculated for a particular similarity_α() Always have

Sim_α(q_i，d_ix)≥Sim_α(q_i，d_i，x+1)。

Preferably, the similarity calculation function Sim_α() Using cosine value calculation (cosine) or Euclidean distanceMethod (eutlidean), i.e.

Preferably, in the step S2, the system is embedded with a similarity calculation function Sim for evaluating the search results_α() Input as two arrays

Wherein λ is_iThe principle of the ranking coefficient is as follows: the more top ranking, the more important the search results presentation, the more serious the consequences are that the first ranked result is wrong than the tenth ranked result. Calculating function Sim based on the similarity_α() The method can realize a perfect search result comparison and evaluation mechanism, and for the candidate text D, the retrieval text Q and the compression matrix T, the performance evaluation and calculation method is as follows (since the candidate text D does not change in the actual scene and belongs to a constant which is not changed, the following calculation can omit the parameter)

Wherein the content of the first and second substances,

and is equivalent to the degree of lossless compression matrix T for search performance,

the higher the compression, the closer the search results before and after compression, the less performance loss after compression.

Note that if the usage scenario focuses only on the top ten of the search results, then λ_i＞10＝0。

The compression matrix T can be initialized randomly, but the performance of the randomly generated T is general, the compression matrix T is initialized by using a linear algebra method, the main principle is that in the case that the retrieval text Q is unknown or incomplete, the retrieval text Q can be temporarily replaced by the candidate text D, as long as the structural relationship between the codes of the texts in the candidate text D is ensured to be kept unchanged after the compression transformation of T, the compression matrix T can be considered to 'master' the semantic structure of the candidate text D

By means of a variant of Principal Component Analysis (PCA) in the linear algebra domain (incemental PCA), an optimal compressed form of the candidate text D for the compressed dimension r can be obtained

In addition, a Moore-Penrose Pseudoinverse (Moore-Penrose Pseudoinverse) method in the linear algebra field is used to obtain a Pseudoinverse D of the candidate text D⁺And obtaining the initialized compression matrix T by means of iPCA (D)

Preferably, in step S3, the specific generation mechanism is as follows:

wherein, mu_1～6Not less than 0, at initialization, T^(best)＝T^(worst)＝T⁽⁰⁾＝T^(-1)And T is^(rand)Is a compression matrix generated randomly, and

preferably, in the step S5, the expected speed-up multiple of the compression matrix is

The specific compression dimension may be set by expectTC.

Compared with the prior art, the invention has the beneficial effects that:

1. the semantic coding lossless compression system and method based on heuristic linear transformation encode (encoding) texts by using a deep learning language model to obtain encoding representation (embedding) of each text, solve semantic similarity between a retrieval sentence and each candidate text (text to be retrieved by the retrieval sentence) by using methods such as an included angle cosine value, an Euclidean distance and the like, and rank the semantic similarity to obtain a semantic search result.

2. The semantic code lossless compression system and method based on heuristic linear transformation caches the generated semantic codes of the candidate texts, and repeated generation is not needed in the real-time retrieval process.

3. The semantic coding lossless compression system and method based on heuristic linear transformation utilize a linear transformation matrix (compression matrix) to reduce the dimension of the retrieval sentence and the coding representation of each candidate text, realize the compression effect and further improve the speed of calculating the semantic similarity.

4. The semantic coding lossless compression system and method based on heuristic linear transformation calculate the deviation degree of the search results before and after compression, and form a method capable of measuring the performance of a compression matrix.

5. The semantic coding lossless compression system and method based on heuristic linear transformation utilize methods in linear algebraic fields such as Principal Component Analysis (PCA) and its variants (incrimental PCA), Moore-Penrose pseudo inverse matrix (Moore-Penrose pseudo inverse) and the like to initialize a compression matrix.

6. According to the semantic coding lossless compression system and method based on heuristic linear transformation, different compression dimensions are selected to initialize compression matrixes, the performance of each compression matrix is evaluated one by one, and the optimal compression dimension setting and related parameter setting of a hierarchical screening method are obtained by using the thought of the hierarchical screening method.

7. According to the semantic coding lossless compression system and method based on heuristic linear transformation, iterative upgrading is carried out on a compression matrix according to a retrieval sentence (which can be understood as a 'retrieval case') which is expected by a user to realize lossless compression, and the compression matrix can be continuously upgraded along with the fact that the number of the retrieval sentences which are expected by the user to realize lossless compression is more and more.

8. The semantic coding lossless compression system and method based on heuristic linear transformation adopt a hierarchical screening method to carry out hierarchical 'sea-election' on the retrieval process, and fully strengthen the degree of local lossless compression under the condition of ensuring that the speed is not reduced.

Drawings

FIG. 1 is a schematic view of the system as a whole.

In the figure: 1. retrieving a text set; 2. a guide sleeve; 3. retrieving a text Q; 4. a candidate text D; 5. coding and storing; 6. a principal component analysis module; 7. a hierarchical screening system; 8. an initial compression matrix T; 9. compressing the matrix T; 10. a compressed version DT; 11. a compressed version QT; 12. a similarity calculation function module; 13. and (5) candidate text sets.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1, the present invention provides a technical solution: a semantic coding lossless compression system based on heuristic linear transformation comprises a retrieval text set 1 and a candidate text set 13, wherein the transmitting ends of the retrieval text set 1 and the candidate text set 13 are in signal connection with a deep learning language model 2 receiving end, the transmitting end of the deep learning language model 2 is in signal connection with a retrieval text Q3 and a candidate text D4 receiving end, the transmitting end of the candidate text D4 is in signal connection with a coding storage 5 receiving end, the transmitting ends of the retrieval text Q3 and the coding storage 5 are in signal connection with a compression matrix T9 receiving end, the transmitting end of the compression matrix T9 is in signal connection with a compression version DT11 and a compression version 10, the receiving ends, the transmitting ends of the compression version DT11 and the compression version QT10 are in signal connection with a similarity calculation function module 12 receiving end, the transmitting end of the similarity calculation function module 12 is in signal connection with a principal component analysis module 6 receiving end, the transmitting end of the principal component, the initial compression matrix T8 transmitting end is in signal connection with the receiving end of the hierarchical screening system 7, the hierarchical screening system 7 transmitting end is in signal connection with the receiving end of the compression matrix T9, the code storage 5 transmitting end is in signal connection with the receiving end of the principal component analysis module 6, and the principal component analysis module 6 and the similarity calculation function module 12 are both provided with the transmitting end and the receiving end.

Example two

For a semantic search system in a library, the candidate text to be searched is a book name (about thirty-six thousand books accumulated), for example: peak of wave, differential geometry and generalized relativistic entry, Fermat theorem, etc.; it is required to search for the optimal book name according to the user's search sentence, for example, the user searches "what is the mathematical principle of the generalized relativity theory", the system should preferentially output the book name related at semantic level like "entry of differential geometry and generalized relativity theory".

All book names are encoded by using a deep learning language model (such as BERT of Google open source), each book name is changed into a vector with 768 dimensions, a matrix with dimensions of 360000 x 768 is formed in a gathering mode, and the matrix is cached in a global variable mode.

According to the step 2, obtaining an initialized compression matrix T⁽⁰⁾The parameter setting method comprises the following steps:

according to

steps

3 and 4, the compression matrix T is aligned⁽⁰⁾Iterating according to the actual retrieval text Q to obtain a final compression matrix T, wherein each parameter in the step 5 is

μ₁＝0.9

μ₂＝0.1

μ₃＝0.05

μ₄＝0.05

μ₅＝0.1

μ₆＝0.1

The expected acceleration multiples of T generated by different compression dimensions are different, and different T are obtained according to the step 6 of the invention scheme

G(L)、expectTC。

The following table shows the case where N is 10000

Finally, 20 dimensions are selected as the compression dimension, G (L) ═ G (10), and the actual measurement results of the velocity comparison are as follows (under the condition that the top L names of the search results before and after compression are completely matched).

The following table shows the case where N is different

The semantic coding lossless compression system and method based on heuristic linear transformation encode (encoding) texts by using a deep learning language model to obtain encoding representation (embedding) of each text, solve semantic similarity between a retrieval sentence and each candidate text (text to be retrieved by the retrieval sentence) by using methods such as an included angle cosine value, an Euclidean distance and the like, and rank the semantic similarity to obtain a semantic search result.

And caching the semantic codes generated by the candidate texts, and repeatedly generating the semantic codes in the real-time retrieval process.

And reducing the dimension of the retrieval sentence and the coding representation of each candidate text by utilizing a linear transformation matrix (compression matrix), realizing a compression effect and further improving the speed of calculating the semantic similarity.

And calculating the deviation degree of the search results before and after compression, and forming a method capable of measuring the performance of the compression matrix.

The compression matrix is initialized by methods in the linear algebraic domain such as Principal Component Analysis (PCA) and its variants (incrimental PCA), Moore-Penrose pseudo-inverse matrix (Moore-Penrose pseudo-inverse).

Different compression dimensions are selected to initialize the compression matrixes, the performance of each compression matrix is evaluated one by one, and the optimal compression dimension setting and the related parameter setting of the hierarchical screening method are obtained by utilizing the thought of the hierarchical screening method.

The compression matrix is iteratively upgraded according to the retrieval sentences (which can be understood as "retrieval cases") which the user desires to realize the lossless compression, and as the retrieval sentences which the user desires to realize the lossless compression are more and more, the compression matrix is also continuously upgraded,

and a hierarchical screening method is adopted to carry out hierarchical 'sea picking' on the retrieval process, and the degree of local lossless compression is fully strengthened under the condition of ensuring that the speed is not reduced.

In a semantic search scene of a large-scale text, under the condition of ensuring that the quality deviation rate is controllable, the high-dimensional semantic code/vector can be reduced by tens of times, so that the retrieval speed is increased by a plurality of orders of magnitude (for example, a certain search scene has 100 ten thousand candidate texts to be retrieved in total, and the retrieval speed after local lossless compression can be increased by 27 times compared with that of an uncompressed text but cached text and can be increased by three thousand times compared with that of the uncompressed text but not cached text).

Under the condition of ensuring that the speed increasing amplitude before and after compression is not less than 10 times, the lossless proportion can be maintained above 10-20 (namely, before and after compression, the first 10-20 of the search results do not have any change), and the common semantic search scene is completely met.

The user can continuously iterate the parameters in the compression method at any time aiming at the retrieval sentences with unsatisfactory search results, the semantic search performance can be continuously optimized along with the improvement of the scene cases,

it is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A semantic code lossless compression system based on heuristic linear transformation, comprising a search text set (1) and a candidate text set (13), characterized in that: the retrieval text set (1) and the candidate text set (13) are connected with a receiving end of a deep learning language model (2) through signals at a transmitting end, the deep learning language model (2) is connected with a retrieval text Q (3) and a candidate text D (4) through signals at a transmitting end, the candidate text D (4) is connected with a coding storage (5) through signals at a receiving end, the retrieval text Q (3) and the coding storage (5) are connected with a receiving end of a compression matrix T (9) through signals at a transmitting end, the compression matrix T (9) is connected with a compression version DT (11) and a compression version QT (10) through signals at a transmitting end, the receiving end, the compression version DT (11) and the compression version QT (10) are connected with a similarity calculation function module (12) through signals at a transmitting end, and the similarity calculation function module (12) is connected with a main component analysis module (6) through signals at a transmitting, the principal component analysis module (6) is connected with an initial compression matrix T (8) receiving end through signals at a transmitting end, the initial compression matrix T (8) is connected with a hierarchy screening system (7) receiving end through signals at the transmitting end, and the hierarchy screening system (7) is connected with a compression matrix T (9) receiving end through signals at the transmitting end.

2. The semantic coding lossless compression system based on the heuristic linear transformation as claimed in claim 1, wherein: the code storage (5) is connected with a transmitting end through a signal and a receiving end through a principal component analysis module (6), and the principal component analysis module (6) and the similarity calculation function module (12) are both provided with the transmitting end and the receiving end.

3. A semantic coding lossless compression method based on heuristic linear transformation is characterized by comprising the following steps:

s1, encoding processing

S2 building system

Building a search result comparison and evaluation system, wherein the input of the system comprises: candidate texts D (4), search texts Q (3) and a compression matrix T (9); wherein the compression matrix T ∈ R ^ (K x R), the value R represents the coding dimension after being compressed, the candidate textD (4) and the search text Q (3) are compressed and coded by QT ∈ R ^ respectively^(M×r)，DT∈R^^(N×r)。

S3, building an iteration mechanism

And constructing a generating type compression matrix iteration mechanism, and adjusting and optimizing a compression matrix T (9) according to a retrieval text Q (3) matrix which changes in real time.

S4, iterative upgrade

By utilizing the iteration generation method, the compression matrix T (9) is iteratively upgraded, and T is updated^(best)As the final compression matrix.

S5, building a screening system

Building a hierarchical screening system; for different compression dimensions r_a，r_b，r_c…, different compression matrices are generated, labeled separately

S6, determining the final result

4. The semantic coding lossless compression method based on the heuristic linear transformation as claimed in claim 3, wherein: in step S1, the deep learning language model may be BERT derived from google, and the deep learning language model may be stored in the system memory directly, or may be stored in the system hard disk in a numpy or pickle file format for subsequent reading and calling, so as to obtain the quantization forms of the candidate text D and the search text Q:

the semantic search scenario described above may then be described as

sort(q_i，D)＝[d_i1，d_i2，d_i3…d_uN]

Sim_α(q_i，d_ix)≥Sim_α(q_i，d_i，x+1)。

5. The semantic coding lossless compression method based on the heuristic linear transformation as claimed in claim 4, wherein: the similarity calculation function Sim_α() Using cosine-value calculation (cosine) or Euclidean distance (euclidean), i.e.

6. The semantic coding lossless compression method based on the heuristic linear transformation as claimed in claim 3, wherein: in the step S2, the system is embedded with a similarity calculation function Sim for evaluating the search results_a() Input as two arrays

Wherein λ is_iIs a ranking coefficient.

7. The semantic coding lossless compression method based on the heuristic linear transformation as claimed in claim 3, wherein: in step S3, the specific generation mechanism is as follows:

8. the semantic coding lossless compression method based on the heuristic linear transformation as claimed in claim 3, wherein: in the step S5, the expected speed-up multiple of the compression matrix is

The specific compression dimension may be set by expectTC.