CN113055018B - Semantic coding lossless compression system and method based on heuristic linear transformation - Google Patents

Semantic coding lossless compression system and method based on heuristic linear transformation Download PDF

Info

Publication number
CN113055018B
CN113055018B CN202110289154.7A CN202110289154A CN113055018B CN 113055018 B CN113055018 B CN 113055018B CN 202110289154 A CN202110289154 A CN 202110289154A CN 113055018 B CN113055018 B CN 113055018B
Authority
CN
China
Prior art keywords
compression
matrix
search
text
receiving end
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110289154.7A
Other languages
Chinese (zh)
Other versions
CN113055018A (en
Inventor
裴正奇
王树徽
黄梓忱
朱斌斌
于秋鑫
段必超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Qianhai Heidun Technology Co ltd
Original Assignee
Shenzhen Qianhai Heidun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Qianhai Heidun Technology Co ltd filed Critical Shenzhen Qianhai Heidun Technology Co ltd
Priority to CN202110289154.7A priority Critical patent/CN113055018B/en
Publication of CN113055018A publication Critical patent/CN113055018A/en
Application granted granted Critical
Publication of CN113055018B publication Critical patent/CN113055018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention belongs to the technical field of semantic coding lossless compression, and particularly relates to a semantic coding lossless compression system based on heuristic linear transformation. The semantic coding lossless compression system and the method based on heuristic linear transformation utilize a deep learning language model to encode (encoding) texts to obtain coding characterization (empedding) of each text, utilize methods such as an included angle cosine value, euclidean distance and the like to obtain semantic similarity between search sentences and each candidate text (text to be searched by the search sentences), and rank according to the semantic similarity to obtain semantic search results.

Description

Semantic coding lossless compression system and method based on heuristic linear transformation
Technical Field
The invention relates to the technical field of semantic coding lossless compression, in particular to a semantic coding lossless compression system and a semantic coding lossless compression method based on heuristic linear transformation.
Background
The existing semantic search and encoding technology cannot achieve both content lossless and compression amplitude, the compression amplitude is limited, and the original semantic content can be greatly lost after compression, for example, the LSH technology is only suitable for a scene with low accuracy requirements on specific ranking of output results, such as a recommendation system, and if the accurate ranking is required, the LSH technology cannot be adequate.
The method can not be flexibly adjusted according to the scene demand conditions, for example, some scenes are more focused on compression amplitude/running speed, some scenes are more focused on accuracy/content lossless degree, accurate iteration can not be carried out on the model according to the index demand of the scenes, and the scene demand can be achieved without bias.
The current related technology is solidified and random, lacks of pertinence to scenes, in fact, different scenes have different semantic coding compression modes, the same text is coded in different scenes (for example, different scenes such as 'library book retrieval', 'intelligent customer service', 'knowledge question answer') and the like) and the compression mechanism are different, and therefore the optimal quantization effect can be achieved under the condition of limited computational power resources.
The current related technology cannot perform efficient iteration according to real-time feedback of a user, and along with the use of the user, the situation that the search result is not ideal is inevitably encountered, some of the related technologies are due to the semantic coding or the search mechanism, some of the related technologies are due to external environment (such as the change of knowledge points), and the related technologies cannot perform targeted efficient update for the 'non-ideal situation', so that the related technologies need to be redesigned.
Disclosure of Invention
The invention aims to provide a semantic coding lossless compression system and a semantic coding lossless compression method based on heuristic linear transformation, so as to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions: the semantic coding lossless compression system based on heuristic linear transformation comprises a search text set and a candidate text set, wherein a search text set transmitting end signal and a candidate text set transmitting end signal are connected with a deep learning language model receiving end, a search text Q and a candidate text D receiving end are connected with a deep learning language model transmitting end signal, a candidate text D transmitting end signal is connected with a coding storage receiving end, the search text Q and the coding storage transmitting end signal are connected with a compression matrix T receiving end, the compression matrix T transmitting end signal is connected with a compression plate DT and a compression plate QT, the compression plate DT and the compression plate QT transmitting end signal is connected with a similarity calculation function module receiving end, the similarity calculation function module transmitting end signal is connected with a principal component analysis module receiving end, the principal component analysis module transmitting end signal is connected with an initial compression matrix T receiving end, the initial compression matrix T transmitting end signal is connected with a hierarchical screening system receiving end, and the hierarchical screening system transmitting end signal is connected with a compression matrix T receiving end.
Preferably, the code storage transmitting end is connected with a receiving end of the principal component analysis module in a signal manner, and the principal component analysis module and the similarity calculation function module are both provided with a transmitting end and a receiving end.
Another technical problem to be solved by the present invention is to provide a semantic coding lossless compression method based on heuristic linear transformation, so as to solve the problem set forth in the above background art;
in order to achieve the above purpose, the present invention provides the following technical solutions: the method comprises the following steps:
s1, coding processing
And (3) coding all candidate texts by using a deep learning language model, converting each title into a K-dimensional vector, and storing the K-dimensional vector in a proper data form.
S2, building system
Setting up a search result comparison evaluation system, wherein the input of the system comprises the following steps: candidate text D, search text Q, compression matrix T; wherein the compression matrix T epsilon R (Kxr), the value R represents the coding dimension after compression, and the compression codes obtained by compressing the candidate text D and the search text Q by T are QT epsilon R (M×r) ,DT∈R^ (N×r)
S3, constructing an iteration mechanism
And constructing a generated compression matrix iteration mechanism, and adjusting and optimizing the compression matrix T according to the search text Q matrix which changes in real time.
S4, iterative upgrade
By using the iteration generation method, the compression matrix T is iteratively upgraded, and T is calculated (best) As the final compression matrix.
S5, constructing a screening system
Constructing a hierarchical screening system; for different compression dimensions r a ,r b ,r c …, different compression matrices are generated, respectively labeled as
Figure GDA0003900791140000031
The core idea of hierarchical screening is: although the search results after compression may deviate from before compression, the magnitude of the deviation is limited, e.g., the results ranked 10 th before compression may rank 18 th after compression, rather than 2000 th so as to be "far", assuming the user is only focusing on the top L of the ranking, the compressed security deviation value G (L) is:
G(L)=max([sort(q i T,DT).index(item)for item in sort(q i ,D)[:L]])
g (L) may be understood as the largest ranking bias value within the top L names; after obtaining G (L), 1.5G (L) is taken as a safety threshold value, and
Figure GDA0003900791140000032
is recomputed with their uncompressed encoded form. The concept is similar to "sea selection", where the compression matrix described above is the initial sea selection, which enables "excellent players" to be selected, but if one wants to rank these selected "excellent players" specifically, one still needs to resort to a more complete and complex selection method (uncompressed coded form). However, since "sea selection" has filtered out the vast majority of "players," even if the remaining "excellent players" are all operated on in uncompressed coded form, there is not much additional computation time cost. The expected acceleration multiple of a compression matrix is
Figure GDA0003900791140000033
The specific compression dimension may be set by expectTC.
S6, determining a final result
By Sim α () And a hierarchical screening mechanism is used for operating the search text Q and the candidate text D before and after compression to obtain a final semantic search result.
Preferably, in the step S1, the deep learning language model may be BERT of google open source, and the storage may be directly stored in the system memory, or may be stored in a system hard disk in a file format such as numpy, jackle, etc. for subsequent reading and calling, so as to obtain quantized forms of the candidate text D and the search text Q:
Figure GDA0003900791140000041
/>
Figure GDA0003900791140000042
the semantic search scenario may then be described as
sort(q i ,D)=[d i1 ,d i2 ,d i3 ...d iN ]
So that the function Sim is calculated for a particular similarity α () Always have
Sim α (q i ,d ix )≥Sim α (q i ,d i,x+1 )。
Preferably, the similarity calculation function Sim α () Mainly adopts cosine value calculation method cosine or Euclidean distance method euclidean, i.e
Figure GDA0003900791140000043
Figure GDA0003900791140000045
Preferably, in the step S2, the system is embedded with a similarity calculation function Sim for evaluating the search results α () The input is two groups
Figure GDA0003900791140000044
Wherein lambda is i The principle of the ranking coefficient is as follows: top rankingThe more important the search result presentation is, the more serious the first ranked result is wrong than the tenth ranked result. Calculating the function Sim by relying on the similarity a () The complete search result comparison evaluation mechanism can be realized, and the performance evaluation calculation method is as follows for the candidate text D, the search text Q and the compression matrix T (the candidate text D does not change in the actual scene, belongs to constant, and therefore the parameter is omitted in the following calculation
Figure GDA0003900791140000051
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure GDA0003900791140000052
is also equivalent to the degree of lossless of the search performance brought by the compression matrix T, +.>
Figure GDA0003900791140000053
The higher the search results before and after compression, the less the performance loss after compression.
Figure GDA0003900791140000054
Note that if the usage scenario focuses only on the top ten of the search results, λ i>10 =0。
The compression matrix T can be randomly initialized, but the performance of the randomly generated T is general, and the linear algebra method is used for initializing the compression matrix T, and the main principle is that in the case that the search text Q is unknown or imperfect, the candidate text D can be temporarily substituted for the search text Q, so long as the structural relation between codes of each text in the candidate text D is ensured to be unchanged after the compression transformation of T, the compression matrix T can be considered to be mastered by the semantic structure of the candidate text D
Figure GDA0003900791140000055
By means of a variant of Principal Component Analysis (PCA) in the linear algebra domain (incarnative PCA), an optimal compressed form of the candidate text D for the compressed dimension r can be obtained
Figure GDA0003900791140000056
In addition, the pseudo-inverse matrix D of the candidate text D is obtained by a Moore-Penrose Pseudoinverse method of a Moore pseudo-inverse matrix in the linear algebra field + And obtains an initialization compression matrix T by means of iPCA (D)
Figure GDA0003900791140000061
/>
Preferably, in the step S3, a specific generation mechanism is as follows:
Figure GDA0003900791140000062
wherein mu 1~6 Not less than 0, T at the time of initialization (best) =T (worst) =T (0) =T (-1) And T is (rand) Is a randomly generated compressed matrix, and
Figure GDA0003900791140000063
Figure GDA0003900791140000064
preferably, in the step S5, the expected acceleration multiple of the compression matrix is
Figure GDA0003900791140000065
The specific compression dimension may be set by expectTC.
Compared with the prior art, the invention has the beneficial effects that:
1. the semantic coding lossless compression system and the method based on heuristic linear transformation utilize a deep learning language model to encode (encoding) texts to obtain coding characterization (empedding) of each text, utilize methods such as an included angle cosine value, euclidean distance and the like to obtain semantic similarity between search sentences and each candidate text (text to be searched by the search sentences), and rank according to the semantic similarity to obtain semantic search results.
2. According to the semantic code lossless compression system and method based on heuristic linear transformation, the semantic codes which are already generated by the candidate text are cached, and repeated generation is not needed in the real-time retrieval process.
3. According to the semantic coding lossless compression system and method based on heuristic linear transformation, the linear transformation matrix (compression matrix) is utilized to reduce the dimension of the coding representation of the search sentence and each candidate text, so that the compression effect is realized, and the speed of calculating the semantic similarity is improved.
4. The semantic coding lossless compression system and the semantic coding lossless compression method based on heuristic linear transformation calculate the deviation degree of search results before and after compression, and form a method capable of measuring the performance of a compression matrix.
5. The semantic coding lossless compression system and method based on heuristic linear transformation initialize a compression matrix by using methods in linear algebraic fields such as Principal Component Analysis (PCA) and variants thereof (incarnative PCA), a mole-Penrose pseudo-inverse matrix (Moore-Penrose pseudoinverse) and the like.
6. According to the semantic coding lossless compression system and method based on heuristic linear transformation, different compression dimensions are selected to initialize compression matrixes, the performance of each compression matrix is evaluated one by one, and the optimal compression dimension setting and the relevant parameter setting of a hierarchical screening method are obtained by utilizing the thought of the hierarchical screening method.
7. According to the semantic coding lossless compression system and method based on heuristic linear transformation, the compression matrix is iteratively upgraded according to search sentences (which can be understood as 'search cases') which are expected to realize lossless compression by a user, and the compression matrix is continuously upgraded as more search sentences which are expected to realize lossless compression by the user are increased.
8. The semantic coding lossless compression system and the semantic coding lossless compression method based on heuristic linear transformation adopt a hierarchical screening method to perform hierarchical sea selection on the retrieval process, and fully strengthen the local lossless compression degree under the condition of ensuring that the speed is not reduced.
Drawings
Fig. 1 is a schematic diagram of the system as a whole.
In the figure: 1. retrieving a text set; 2. guide sleeve; 3. retrieving the text Q; 4. candidate text D; 5. storing codes; 6. a principal component analysis module; 7. a hierarchical screening system; 8. initially compressing the matrix T; 9. compressing the matrix T; 10. a compressed version DT; 11. compression plate QT; 12. a similarity calculation function module; 13. a candidate text set.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, the present invention provides a technical solution: the semantic coding lossless compression system based on heuristic linear transformation comprises a search text set 1 and a candidate text set 13, wherein the search text set 1 and the candidate text set 13 are connected with a deep learning language model 2 receiving end through transmitting end signals, the deep learning language model 2 transmitting end is connected with a search text Q3 and a candidate text D4 receiving end through transmitting end signals, the candidate text D4 transmitting end is connected with a coding storage 5 receiving end, the search text Q3 and the coding storage 5 transmitting end are connected with a compression matrix T9 receiving end, the compression matrix T9 transmitting end is connected with a compression plate DT11 and a compression plate QT10 through transmitting end signals, the receiving end is connected with a similarity calculation function module 12 receiving end through transmitting end signals, the similarity calculation function module 12 is connected with a main component analysis module 6 receiving end through transmitting end signals, the main component analysis module 6 transmitting end is connected with an initial compression matrix T8 receiving end through transmitting end, the initial compression matrix T8 transmitting end is connected with a level screening system 7 receiving end, the level screening system 7 transmitting end is connected with a compression matrix T9 receiving end through transmitting end, the storage 5 transmitting end is connected with a main component analysis module 6 receiving end and the main component analysis module 6 receiving end is provided with a similarity calculation function module 12 receiving end.
Example two
For a library semantic retrieval system, the candidate text to be retrieved is a book name (accumulated about thirty-six thousand books), for example: peak of wave, relative theory of differential geometry and generalized, law of the Feima's theorem, etc.; it is required that the optimal book names can be searched according to the search sentence of the user, for example, the user searches "what the mathematical principle of the generalized relativity theory is", and then the system should preferentially output the book names related on the semantic level like "differential geometry and generalized relativity theory entrance".
All titles are encoded by using a deep learning language model (e.g. google open source BERT), each title is changed into 768-dimensional vectors, a matrix with dimensions of 360000×768 is formed by summarization, and the matrix is cached in the form of global variables.
According to step 2, an initialized compressed matrix T is obtained (0) The parameter setting method comprises the following steps:
Figure GDA0003900791140000091
according to steps 3 and 4, the matrix T is compressed (0) Iterating according to the actual search text Q to obtain a final compression matrix T, wherein each parameter of the 5 th step is that
μ 1 =0.9
μ 2 =0.1
μ 3 =0.05
μ 4 =0.05
μ 5 =0.1
μ 6 =0.1
The expected acceleration times of T generated by different compression dimensions are different, and according to the step 6 of the invention scheme, different acceleration times are obtained
Figure GDA0003900791140000092
G(L)、expectTC。
The following table is for the case of n=10000
Figure GDA0003900791140000093
Finally, 20 dimensions were chosen as compression dimensions, G (L) =g (10), and the actual measurement results of the speed comparison were as follows (under the condition that the first L names of the search results before and after compression were kept identical).
The following table shows the case where N is different in value
Figure GDA0003900791140000101
The semantic coding lossless compression system and the method based on heuristic linear transformation utilize a deep learning language model to encode (encoding) texts to obtain coding characterization (empedding) of each text, utilize methods such as an included angle cosine value, euclidean distance and the like to obtain semantic similarity between search sentences and each candidate text (text to be searched by the search sentences), and rank according to the semantic similarity to obtain semantic search results.
And caching the generated semantic codes of the candidate texts, and repeatedly generating is not needed in the real-time retrieval process.
And (3) reducing the dimension of the coding representation of the search sentence and each candidate text by using a linear transformation matrix (compression matrix), and realizing the compression effect, thereby improving the speed of calculating the semantic similarity.
The degree of deviation of the search results before and after compression is calculated, and a method capable of measuring the performance of the compression matrix is formed by the degree of deviation.
The compressed matrix is initialized by using methods in the linear algebra fields such as Principal Component Analysis (PCA) and its variants (incarnate PCA), and Moore-Penrose pseudo-inverse matrix (Moore-Penrose pseudoinverse).
And selecting different compression dimensions to initialize the compression matrixes, evaluating the performance of each compression matrix one by one, and obtaining the optimal compression dimension setting and the related parameter setting of the hierarchical screening method by utilizing the thought of the hierarchical screening method.
The compression matrix is iteratively upgraded according to search sentences (which can be understood as "search cases") that the user desires to achieve lossless compression, and as more search sentences are expected to achieve lossless compression by the user, the compression matrix is continuously upgraded,
hierarchical screening is adopted to perform hierarchical sea selection on the retrieval process, and the degree of local lossless compression is fully enhanced under the condition of ensuring that the speed is not reduced.
In the semantic search scene of a large-scale text, under the condition of ensuring the controllable quality deviation rate, the high-dimensional semantic coding/vector is reduced by ten times, so that the search speed is improved by several orders of magnitude (for example, a certain search scene has 100 ten thousands of candidate texts to be searched in total, and the search speed after local lossless compression can be improved by 27 times compared with that of uncompressed and cached and is nearly three thousand times compared with that of uncompressed and uncached).
Under the condition of ensuring that the speed lifting amplitude before and after compression is not less than 10 times, the lossless proportion can be maintained above 10-20 (namely, the first 10-20 of search results do not have any change before and after compression), and the common semantic search scene is completely satisfied.
The user can continuously iterate the parameters in the compression method aiming at the search sentences with non-ideal search results at any time, ensures that the performance of semantic search can be continuously optimized along with the perfection of scene cases,
it is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A semantic coding lossless compression system based on heuristic linear transformation, comprising a set of search texts (1) and a set of candidate texts (13), characterized in that: the method comprises the steps that a search text set (1) and a candidate text set (13) transmitting end signal are connected with a deep learning language model (2) receiving end, the deep learning language model (2) transmitting end signal is connected with a search text Q (3) and a candidate text D (4) receiving end, the candidate text D (4) transmitting end signal is connected with a coding storage (5) receiving end, the search text Q (3) and the coding storage (5) transmitting end signal are connected with a compression matrix T (9) receiving end, the compression matrix T (9) transmitting end signal is connected with a compression plate DT (11) and a compression plate QT (10) receiving end, the compression plate DT (11) and the compression plate QT (10) transmitting end signal are connected with a similarity calculation function module (12) receiving end, the similarity calculation function module (12) transmitting end signal is connected with a main component analysis module (6) receiving end, the main component analysis module (6) transmitting end signal is connected with an initial compression matrix T (8) receiving end, the initial compression matrix T (8) transmitting end signal is connected with a compression matrix T (9) receiving end, and the compression matrix T (7) level system receiving end is connected with a compression matrix T (7) receiving end.
2. A heuristically linear transformation based semantically encoded lossless compression system according to claim 1, wherein: the code storage (5) transmitting end is in signal connection with a main component analysis module (6) receiving end, and the main component analysis module (6) and the similarity calculation function module (12) are both provided with a transmitting end and a receiving end.
3. A semantic coding lossless compression method based on heuristic linear transformation is characterized by comprising the following steps:
s1, coding processing
Coding all candidate texts by using a deep learning language model, changing each title into a K-dimensional vector, and storing the K-dimensional vector in a proper data form;
s2, building system
Setting up a search result comparison evaluation system, wherein the input of the system comprises the following steps: candidate text D (4), search text Q (3), compression matrix T (9); wherein the compression matrix T epsilon R (Kxr), the value R represents the coding dimension after being compressed, and the compression codes obtained by compressing the candidate text D (4) and the search text Q (3) by T are QT epsilon R (M×r) ,DT∈R^ (N×r)
S3, constructing an iteration mechanism
Constructing a generated compression matrix iteration mechanism, and adjusting and optimizing a compression matrix T (9) according to a search text Q (3) matrix which changes in real time;
s4, iterative upgrade
By using the iterative generation method in the step S3, the compression matrix T (9) is iteratively upgraded, and T is calculated (best) T (best) As a final compression matrix;
s5, constructing a screening system
Constructing a hierarchical screening system; for different compression dimensions r a ,r b ,r c ...r a ,r b ,r c .., different compression matrices are generated, labeled as respectively
Figure FDA0004111082120000021
S6, determining a final result
By Sim α ()Sim α () And a hierarchical screening mechanism is used for running the search text Q (3) and the candidate text D (4) before and after compression to obtain a final semantic search result.
4. A method of semantically encoded lossless compression based on heuristic linear transformation according to claim 3, wherein: in the step S1, the deep learning language model is BERT of google open source, and the generated word vector is stored in the memory of the system directly, or is stored in the system hard disk in numpy and pick file format for subsequent reading and calling, so as to obtain quantized forms of candidate text D (4) and search text Q (3):
Figure FDA0004111082120000022
/>
Figure FDA0004111082120000023
the semantic search scenario is then described as
sort(q i ,D)=[d i1 ,d i2 ,d i3 ...d iN ]
So that the function Sim is calculated for a particular similarity α ()Sim α () Always have
Sim α (q i ,d ix )≥Sim α (q i ,d i,x+1 )Sim α (q i ,d ix )≥Sim α (q i ,d i,x+1 )。
5. The semantic coding lossless compression method based on heuristic linear transformation according to claim 4The method is characterized in that: the similarity calculation function Sim α ()Sim α () Mainly adopts cosine value calculation method cosine or Euclidean distance method euclidean, i.e
Figure FDA0004111082120000031
Figure FDA0004111082120000032
6. A method of semantically encoded lossless compression based on heuristic linear transformation according to claim 3, wherein: in step S2, the system is embedded with a similarity calculation function Sim for evaluating search results α ()Sim α () The input is two groups
Figure FDA0004111082120000033
Wherein lambda is i λ i Is a ranking coefficient.
7. A method of semantically encoded lossless compression based on heuristic linear transformation according to claim 3, wherein: in the step S3, a specific generation mechanism is as follows:
Figure FDA0004111082120000034
Figure FDA0004111082120000041
wherein mu 1~6 ≥0μ 1~6 Not less than 0, T at the time of initialization (best) =T (worst) =T (0) =T (-1) T (best) =T (worst) =T (0) =T (-1) And T is (rand) T (rand) Is a randomly generated compressed matrix, and
if
Figure FDA0004111082120000042
T (best) =T (t)
Figure FDA0004111082120000043
8. a method of semantically encoded lossless compression based on heuristic linear transformation according to claim 3, wherein: in the step S5, the expected acceleration multiple of the compression matrix is
Figure FDA0004111082120000044
The specific compression dimension is set by expectTC.
CN202110289154.7A 2021-03-18 2021-03-18 Semantic coding lossless compression system and method based on heuristic linear transformation Active CN113055018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110289154.7A CN113055018B (en) 2021-03-18 2021-03-18 Semantic coding lossless compression system and method based on heuristic linear transformation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110289154.7A CN113055018B (en) 2021-03-18 2021-03-18 Semantic coding lossless compression system and method based on heuristic linear transformation

Publications (2)

Publication Number Publication Date
CN113055018A CN113055018A (en) 2021-06-29
CN113055018B true CN113055018B (en) 2023-05-12

Family

ID=76513465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110289154.7A Active CN113055018B (en) 2021-03-18 2021-03-18 Semantic coding lossless compression system and method based on heuristic linear transformation

Country Status (1)

Country Link
CN (1) CN113055018B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881334A (en) * 2020-07-15 2020-11-03 浙江大胜达包装股份有限公司 Keyword-to-enterprise retrieval method based on semi-supervised learning

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080077570A1 (en) * 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
CN106776553A (en) * 2016-12-07 2017-05-31 中山大学 A kind of asymmetric text hash method based on deep learning
US11288297B2 (en) * 2017-11-29 2022-03-29 Oracle International Corporation Explicit semantic analysis-based large-scale classification
CN110502613B (en) * 2019-08-12 2022-03-08 腾讯科技(深圳)有限公司 Model training method, intelligent retrieval method, device and storage medium
CN110825901A (en) * 2019-11-11 2020-02-21 腾讯科技(北京)有限公司 Image-text matching method, device and equipment based on artificial intelligence and storage medium
CN111382260A (en) * 2020-03-16 2020-07-07 腾讯音乐娱乐科技(深圳)有限公司 Method, device and storage medium for correcting retrieved text
CN111444320B (en) * 2020-06-16 2020-09-08 太平金融科技服务(上海)有限公司 Text retrieval method and device, computer equipment and storage medium
CN111753060B (en) * 2020-07-29 2023-09-26 腾讯科技(深圳)有限公司 Information retrieval method, apparatus, device and computer readable storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881334A (en) * 2020-07-15 2020-11-03 浙江大胜达包装股份有限公司 Keyword-to-enterprise retrieval method based on semi-supervised learning

Also Published As

Publication number Publication date
CN113055018A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
Shu et al. Compressing word embeddings via deep compositional code learning
Liu et al. Deep triplet quantization
Gueniche et al. Compact prediction tree: A lossless model for accurate sequence prediction
CN101809567B (en) Two-pass hash extraction of text strings
Rajput et al. Recommender systems with generative retrieval
CN111914062B (en) Long text question-answer pair generation system based on keywords
CN113987169A (en) Text abstract generation method, device and equipment based on semantic block and storage medium
Zhang et al. A survey for efficient open domain question answering
CN116108128B (en) Open domain question-answering system and answer prediction method
Chen et al. Continual learning for generative retrieval over dynamic corpora
Kan et al. Zero-shot learning to index on semantic trees for scalable image retrieval
Sun et al. Automatic text summarization using deep reinforcement learning and beyond
CN116151266A (en) New word discovery method and device, electronic equipment and storage medium
KR20220092776A (en) Apparatus and method for quantizing neural network models
Sathyendra et al. Extreme model compression for on-device natural language understanding
CN112580325B (en) Rapid text matching method and device
CN112732862B (en) Neural network-based bidirectional multi-section reading zero sample entity linking method and device
CN117235250A (en) Dialogue abstract generation method, device and equipment
CN113055018B (en) Semantic coding lossless compression system and method based on heuristic linear transformation
WO2020241070A1 (en) Audio signal retrieving device, audio signal retrieving method, data retrieving device, data retrieving method, and program
CN117290485A (en) LLM-based question-answer enhancement method
Qiu et al. Efficient document retrieval by end-to-end refining and quantizing BERT embedding with contrastive product quantization
Strimel et al. Statistical model compression for small-footprint natural language understanding
CN113704466B (en) Text multi-label classification method and device based on iterative network and electronic equipment
CN112966501B (en) New word discovery method, system, terminal and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant