CN113055018A - Semantic coding lossless compression system and method based on heuristic linear transformation - Google Patents

Semantic coding lossless compression system and method based on heuristic linear transformation Download PDF

Info

Publication number
CN113055018A
CN113055018A CN202110289154.7A CN202110289154A CN113055018A CN 113055018 A CN113055018 A CN 113055018A CN 202110289154 A CN202110289154 A CN 202110289154A CN 113055018 A CN113055018 A CN 113055018A
Authority
CN
China
Prior art keywords
compression
semantic
text
matrix
linear transformation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110289154.7A
Other languages
Chinese (zh)
Other versions
CN113055018B (en
Inventor
裴正奇
王树徽
黄梓忱
朱斌斌
于秋鑫
段必超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Qianhai Heidun Technology Co ltd
Original Assignee
Shenzhen Qianhai Heidun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Qianhai Heidun Technology Co ltd filed Critical Shenzhen Qianhai Heidun Technology Co ltd
Priority to CN202110289154.7A priority Critical patent/CN113055018B/en
Publication of CN113055018A publication Critical patent/CN113055018A/en
Application granted granted Critical
Publication of CN113055018B publication Critical patent/CN113055018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention belongs to the technical field of lossless compression of semantic codes, and particularly relates to a lossless compression system of semantic codes based on heuristic linear transformation. The semantic coding lossless compression system and method based on heuristic linear transformation encode (encoding) texts by using a deep learning language model to obtain encoding representation (embedding) of each text, solve semantic similarity between a retrieval sentence and each candidate text (text to be retrieved by the retrieval sentence) by using methods such as an included angle cosine value, an Euclidean distance and the like, and rank the semantic similarity to obtain a semantic search result.

Description

Semantic coding lossless compression system and method based on heuristic linear transformation
Technical Field
The invention relates to the technical field of semantic code lossless compression, in particular to a semantic code lossless compression system and a semantic code lossless compression method based on heuristic linear transformation.
Background
The existing semantic search and coding technology cannot achieve the combination of content lossless and compression amplitude, the compression amplitude is limited, and the original semantic content is greatly lost after compression, for example, the LSH technology is only suitable for a scene with low accuracy requirement on the specific ranking of an output result, such as a recommendation system, and if the LSH technology needs to be accurately ranked, the LSH technology cannot be competent.
For example, some scenes are weighted more, the compression amplitude/the running speed is increased, some scenes are weighted more, the accuracy rate/the content lossless degree is increased, accurate iteration can not be performed on the model according to the index requirements of the scenes, and the scene requirements can be met without bias.
The current related technologies are relatively solid and random, and lack pertinence to scenes, in fact, different scenes should have different semantic coding compression modes, and the same text has different coding forms and compression mechanisms in different scenes (such as different scenes of 'library book retrieval', 'intelligent customer service', 'knowledge question and answer'), so that the optimal quantization effect can be realized under limited computational resources.
The current related technologies cannot perform efficient iteration according to real-time feedback of a user, and cannot avoid situations with unsatisfactory retrieval results along with use of the user, some situations are caused by semantic coding or a retrieval mechanism, and some situations are caused by external environments (such as change of knowledge points), and the current technologies cannot perform targeted efficient updating aiming at the "unsatisfactory situations", so that redesign is needed.
Disclosure of Invention
The invention aims to provide a semantic coding lossless compression system and method based on heuristic linear transformation, so as to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a semantic coding lossless compression system based on heuristic linear transformation comprises a retrieval text set and a candidate text set, wherein a transmitting end of the retrieval text set and the candidate text set is in signal connection with a deep learning language model receiving end, a transmitting end of the deep learning language model is in signal connection with a retrieval text Q and a candidate text D receiving end, a transmitting end of the candidate text D is in signal connection with a coding storage receiving end, the retrieval text Q and the coding storage transmitting end are in signal connection with a compression matrix T receiving end, the compression matrix T transmitting end is in signal connection with a compression version DT and a compression version QT, the receiving end, the compression version DT and compression version QT transmitting ends are in signal connection with a similarity calculation function module receiving end, a transmitting end of the similarity calculation function module is in signal connection with a principal component analysis module receiving end, a transmitting end of the principal component analysis module is in signal connection with an, the initial compression matrix T transmitting end is in signal connection with a level screening system receiving end, and the level screening system transmitting end is in signal connection with a compression matrix T receiving end.
Preferably, the code storage transmitting end is in signal connection with a principal component analysis module receiving end, and the principal component analysis module and the similarity calculation function module are both provided with a transmitting end and a receiving end.
Another technical problem to be solved by the present invention is to provide a semantic coding lossless compression method based on heuristic linear transformation, so as to solve the problems proposed in the background art;
in order to achieve the purpose, the invention provides the following technical scheme: the method comprises the following steps:
s1, encoding processing
And coding all candidate texts by using a deep learning language model, converting each book name into a K-dimensional vector, and storing the K-dimensional vector in a proper data form.
S2 building system
Building a search result comparison and evaluation system, wherein the input of the system comprises: candidate text D, retrieval text Q and a compression matrix T;wherein the compression matrix T belongs to R (K multiplied by R), the value R represents the coding dimension after being compressed, and the compressed codes obtained after the candidate text D and the search text Q are respectively QT belonging to R(M×r),DT∈R^(N×r)
S3, building an iteration mechanism
And establishing a generating type compression matrix iteration mechanism, and adjusting and optimizing the compression matrix T according to the retrieval text Q matrix which changes in real time.
S4, iterative upgrade
By utilizing the iteration generation method, the compression matrix T is iteratively upgraded, and the compression matrix T is updated(best)As the final compression matrix.
S5, building a screening system
Building a hierarchical screening system; for different compression dimensions ra,rb,rc.., different compression matrices are generated, labeled separately
Figure BDA0002981716780000031
The core idea of hierarchical screening is as follows: although the search results after compression may be biased compared to the search results before compression, the magnitude of the bias is limited, for example, the results ranked 10 th before compression may be ranked 18 th instead of "far away" as in 2000 th after compression, and assuming that the user only focuses on the top L names of the rankings, the compressed security bias value g (L) is:
G(L)=max([sort(qiT,DT).index(item)for item in sort(qi,D)[:L]])
g (L) can be understood as the largest ranking bias value in the top L; after G (L) is obtained, taking 1.5G (L) as a safety threshold, and using sort (q)iT,DT)[:1.5G(L)]The similarity is recalculated once with their uncompressed encoded form. The idea is similar to "sea election", and the compression matrix mentioned above is the initial sea election, which can select "superior players", but if the selected "superior players" are to be ranked specifically, a more complete and more complicated selection method (uncompressed coding form) is still needed. However, becauseThe majority of 'players' are filtered out from 'sea election', and even if the rest 'excellent players' are all operated in an uncompressed coding form, the cost of extra computing time is not increased too much. An expected speed-up multiple of a compression matrix is
Figure BDA0002981716780000032
The specific compression dimension may be set by expectTC.
S6, determining the final result
Using Simα() And a hierarchical screening mechanism is used for operating the retrieval text Q and the candidate text D before and after compression to obtain a final semantic search result.
Preferably, in step S1, the deep learning language model may be BERT derived from google, and the storage may be directly stored in the system memory, or may be stored in the system hard disk in file formats such as numpy and pickle for subsequent reading and calling, so as to obtain the quantization forms of the candidate text D and the search text Q:
Figure BDA0002981716780000041
Figure BDA0002981716780000042
the semantic search scenario described above may then be described as
sort(qi,D)=[di1,di2,di3...diN]
So that the function Sim is calculated for a particular similarityα() Always have
Simα(qi,dix)≥Simα(qi,di,x+1)。
Preferably, the similarity calculation function Simα() Using cosine value calculation (cosine) or Euclidean distanceMethod (eutlidean), i.e.
Figure BDA0002981716780000043
Figure BDA0002981716780000044
Preferably, in the step S2, the system is embedded with a similarity calculation function Sim for evaluating the search resultsα() Input as two arrays
Figure BDA0002981716780000045
Wherein λ isiThe principle of the ranking coefficient is as follows: the more top ranking, the more important the search results presentation, the more serious the consequences are that the first ranked result is wrong than the tenth ranked result. Calculating function Sim based on the similarityα() The method can realize a perfect search result comparison and evaluation mechanism, and for the candidate text D, the retrieval text Q and the compression matrix T, the performance evaluation and calculation method is as follows (since the candidate text D does not change in the actual scene and belongs to a constant which is not changed, the following calculation can omit the parameter)
Figure BDA0002981716780000051
Wherein the content of the first and second substances,
Figure BDA0002981716780000052
and is equivalent to the degree of lossless compression matrix T for search performance,
Figure BDA0002981716780000053
the higher the compression, the closer the search results before and after compression, the less performance loss after compression.
Figure BDA0002981716780000054
Note that if the usage scenario focuses only on the top ten of the search results, then λi>10=0。
The compression matrix T can be initialized randomly, but the performance of the randomly generated T is general, the compression matrix T is initialized by using a linear algebra method, the main principle is that in the case that the retrieval text Q is unknown or incomplete, the retrieval text Q can be temporarily replaced by the candidate text D, as long as the structural relationship between the codes of the texts in the candidate text D is ensured to be kept unchanged after the compression transformation of T, the compression matrix T can be considered to 'master' the semantic structure of the candidate text D
Figure BDA0002981716780000055
By means of a variant of Principal Component Analysis (PCA) in the linear algebra domain (incemental PCA), an optimal compressed form of the candidate text D for the compressed dimension r can be obtained
Figure BDA0002981716780000056
In addition, a Moore-Penrose Pseudoinverse (Moore-Penrose Pseudoinverse) method in the linear algebra field is used to obtain a Pseudoinverse D of the candidate text D+And obtaining the initialized compression matrix T by means of iPCA (D)
Figure BDA0002981716780000061
Preferably, in step S3, the specific generation mechanism is as follows:
Figure BDA0002981716780000062
wherein, mu1~6Not less than 0, at initialization, T(best)=T(worst)=T(0)=T(-1)And T is(rand)Is a compression matrix generated randomly, and
Figure BDA0002981716780000063
Figure BDA0002981716780000064
preferably, in the step S5, the expected speed-up multiple of the compression matrix is
Figure BDA0002981716780000065
The specific compression dimension may be set by expectTC.
Compared with the prior art, the invention has the beneficial effects that:
1. the semantic coding lossless compression system and method based on heuristic linear transformation encode (encoding) texts by using a deep learning language model to obtain encoding representation (embedding) of each text, solve semantic similarity between a retrieval sentence and each candidate text (text to be retrieved by the retrieval sentence) by using methods such as an included angle cosine value, an Euclidean distance and the like, and rank the semantic similarity to obtain a semantic search result.
2. The semantic code lossless compression system and method based on heuristic linear transformation caches the generated semantic codes of the candidate texts, and repeated generation is not needed in the real-time retrieval process.
3. The semantic coding lossless compression system and method based on heuristic linear transformation utilize a linear transformation matrix (compression matrix) to reduce the dimension of the retrieval sentence and the coding representation of each candidate text, realize the compression effect and further improve the speed of calculating the semantic similarity.
4. The semantic coding lossless compression system and method based on heuristic linear transformation calculate the deviation degree of the search results before and after compression, and form a method capable of measuring the performance of a compression matrix.
5. The semantic coding lossless compression system and method based on heuristic linear transformation utilize methods in linear algebraic fields such as Principal Component Analysis (PCA) and its variants (incrimental PCA), Moore-Penrose pseudo inverse matrix (Moore-Penrose pseudo inverse) and the like to initialize a compression matrix.
6. According to the semantic coding lossless compression system and method based on heuristic linear transformation, different compression dimensions are selected to initialize compression matrixes, the performance of each compression matrix is evaluated one by one, and the optimal compression dimension setting and related parameter setting of a hierarchical screening method are obtained by using the thought of the hierarchical screening method.
7. According to the semantic coding lossless compression system and method based on heuristic linear transformation, iterative upgrading is carried out on a compression matrix according to a retrieval sentence (which can be understood as a 'retrieval case') which is expected by a user to realize lossless compression, and the compression matrix can be continuously upgraded along with the fact that the number of the retrieval sentences which are expected by the user to realize lossless compression is more and more.
8. The semantic coding lossless compression system and method based on heuristic linear transformation adopt a hierarchical screening method to carry out hierarchical 'sea-election' on the retrieval process, and fully strengthen the degree of local lossless compression under the condition of ensuring that the speed is not reduced.
Drawings
FIG. 1 is a schematic view of the system as a whole.
In the figure: 1. retrieving a text set; 2. a guide sleeve; 3. retrieving a text Q; 4. a candidate text D; 5. coding and storing; 6. a principal component analysis module; 7. a hierarchical screening system; 8. an initial compression matrix T; 9. compressing the matrix T; 10. a compressed version DT; 11. a compressed version QT; 12. a similarity calculation function module; 13. and (5) candidate text sets.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
Referring to fig. 1, the present invention provides a technical solution: a semantic coding lossless compression system based on heuristic linear transformation comprises a retrieval text set 1 and a candidate text set 13, wherein the transmitting ends of the retrieval text set 1 and the candidate text set 13 are in signal connection with a deep learning language model 2 receiving end, the transmitting end of the deep learning language model 2 is in signal connection with a retrieval text Q3 and a candidate text D4 receiving end, the transmitting end of the candidate text D4 is in signal connection with a coding storage 5 receiving end, the transmitting ends of the retrieval text Q3 and the coding storage 5 are in signal connection with a compression matrix T9 receiving end, the transmitting end of the compression matrix T9 is in signal connection with a compression version DT11 and a compression version 10, the receiving ends, the transmitting ends of the compression version DT11 and the compression version QT10 are in signal connection with a similarity calculation function module 12 receiving end, the transmitting end of the similarity calculation function module 12 is in signal connection with a principal component analysis module 6 receiving end, the transmitting end of the principal component, the initial compression matrix T8 transmitting end is in signal connection with the receiving end of the hierarchical screening system 7, the hierarchical screening system 7 transmitting end is in signal connection with the receiving end of the compression matrix T9, the code storage 5 transmitting end is in signal connection with the receiving end of the principal component analysis module 6, and the principal component analysis module 6 and the similarity calculation function module 12 are both provided with the transmitting end and the receiving end.
Example two
For a semantic search system in a library, the candidate text to be searched is a book name (about thirty-six thousand books accumulated), for example: peak of wave, differential geometry and generalized relativistic entry, Fermat theorem, etc.; it is required to search for the optimal book name according to the user's search sentence, for example, the user searches "what is the mathematical principle of the generalized relativity theory", the system should preferentially output the book name related at semantic level like "entry of differential geometry and generalized relativity theory".
All book names are encoded by using a deep learning language model (such as BERT of Google open source), each book name is changed into a vector with 768 dimensions, a matrix with dimensions of 360000 x 768 is formed in a gathering mode, and the matrix is cached in a global variable mode.
According to the step 2, obtaining an initialized compression matrix T(0)The parameter setting method comprises the following steps:
Figure BDA0002981716780000091
according to steps 3 and 4, the compression matrix T is aligned(0)Iterating according to the actual retrieval text Q to obtain a final compression matrix T, wherein each parameter in the step 5 is
μ1=0.9
μ2=0.1
μ3=0.05
μ4=0.05
μ5=0.1
μ6=0.1
The expected acceleration multiples of T generated by different compression dimensions are different, and different T are obtained according to the step 6 of the invention scheme
Figure BDA0002981716780000092
G(L)、expectTC。
The following table shows the case where N is 10000
Figure BDA0002981716780000101
Finally, 20 dimensions are selected as the compression dimension, G (L) ═ G (10), and the actual measurement results of the velocity comparison are as follows (under the condition that the top L names of the search results before and after compression are completely matched).
The following table shows the case where N is different
Figure BDA0002981716780000102
The semantic coding lossless compression system and method based on heuristic linear transformation encode (encoding) texts by using a deep learning language model to obtain encoding representation (embedding) of each text, solve semantic similarity between a retrieval sentence and each candidate text (text to be retrieved by the retrieval sentence) by using methods such as an included angle cosine value, an Euclidean distance and the like, and rank the semantic similarity to obtain a semantic search result.
And caching the semantic codes generated by the candidate texts, and repeatedly generating the semantic codes in the real-time retrieval process.
And reducing the dimension of the retrieval sentence and the coding representation of each candidate text by utilizing a linear transformation matrix (compression matrix), realizing a compression effect and further improving the speed of calculating the semantic similarity.
And calculating the deviation degree of the search results before and after compression, and forming a method capable of measuring the performance of the compression matrix.
The compression matrix is initialized by methods in the linear algebraic domain such as Principal Component Analysis (PCA) and its variants (incrimental PCA), Moore-Penrose pseudo-inverse matrix (Moore-Penrose pseudo-inverse).
Different compression dimensions are selected to initialize the compression matrixes, the performance of each compression matrix is evaluated one by one, and the optimal compression dimension setting and the related parameter setting of the hierarchical screening method are obtained by utilizing the thought of the hierarchical screening method.
The compression matrix is iteratively upgraded according to the retrieval sentences (which can be understood as "retrieval cases") which the user desires to realize the lossless compression, and as the retrieval sentences which the user desires to realize the lossless compression are more and more, the compression matrix is also continuously upgraded,
and a hierarchical screening method is adopted to carry out hierarchical 'sea picking' on the retrieval process, and the degree of local lossless compression is fully strengthened under the condition of ensuring that the speed is not reduced.
In a semantic search scene of a large-scale text, under the condition of ensuring that the quality deviation rate is controllable, the high-dimensional semantic code/vector can be reduced by tens of times, so that the retrieval speed is increased by a plurality of orders of magnitude (for example, a certain search scene has 100 ten thousand candidate texts to be retrieved in total, and the retrieval speed after local lossless compression can be increased by 27 times compared with that of an uncompressed text but cached text and can be increased by three thousand times compared with that of the uncompressed text but not cached text).
Under the condition of ensuring that the speed increasing amplitude before and after compression is not less than 10 times, the lossless proportion can be maintained above 10-20 (namely, before and after compression, the first 10-20 of the search results do not have any change), and the common semantic search scene is completely met.
The user can continuously iterate the parameters in the compression method at any time aiming at the retrieval sentences with unsatisfactory search results, the semantic search performance can be continuously optimized along with the improvement of the scene cases,
it is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A semantic code lossless compression system based on heuristic linear transformation, comprising a search text set (1) and a candidate text set (13), characterized in that: the retrieval text set (1) and the candidate text set (13) are connected with a receiving end of a deep learning language model (2) through signals at a transmitting end, the deep learning language model (2) is connected with a retrieval text Q (3) and a candidate text D (4) through signals at a transmitting end, the candidate text D (4) is connected with a coding storage (5) through signals at a receiving end, the retrieval text Q (3) and the coding storage (5) are connected with a receiving end of a compression matrix T (9) through signals at a transmitting end, the compression matrix T (9) is connected with a compression version DT (11) and a compression version QT (10) through signals at a transmitting end, the receiving end, the compression version DT (11) and the compression version QT (10) are connected with a similarity calculation function module (12) through signals at a transmitting end, and the similarity calculation function module (12) is connected with a main component analysis module (6) through signals at a transmitting, the principal component analysis module (6) is connected with an initial compression matrix T (8) receiving end through signals at a transmitting end, the initial compression matrix T (8) is connected with a hierarchy screening system (7) receiving end through signals at the transmitting end, and the hierarchy screening system (7) is connected with a compression matrix T (9) receiving end through signals at the transmitting end.
2. The semantic coding lossless compression system based on the heuristic linear transformation as claimed in claim 1, wherein: the code storage (5) is connected with a transmitting end through a signal and a receiving end through a principal component analysis module (6), and the principal component analysis module (6) and the similarity calculation function module (12) are both provided with the transmitting end and the receiving end.
3. A semantic coding lossless compression method based on heuristic linear transformation is characterized by comprising the following steps:
s1, encoding processing
And coding all candidate texts by using a deep learning language model, converting each book name into a K-dimensional vector, and storing the K-dimensional vector in a proper data form.
S2 building system
Building a search result comparison and evaluation system, wherein the input of the system comprises: candidate texts D (4), search texts Q (3) and a compression matrix T (9); wherein the compression matrix T ∈ R ^ (K x R), the value R represents the coding dimension after being compressed, the candidate textD (4) and the search text Q (3) are compressed and coded by QT ∈ R ^ respectively(M×r),DT∈R^(N×r)
S3, building an iteration mechanism
And constructing a generating type compression matrix iteration mechanism, and adjusting and optimizing a compression matrix T (9) according to a retrieval text Q (3) matrix which changes in real time.
S4, iterative upgrade
By utilizing the iteration generation method, the compression matrix T (9) is iteratively upgraded, and T is updated(best)As the final compression matrix.
S5, building a screening system
Building a hierarchical screening system; for different compression dimensions ra,rb,rc…, different compression matrices are generated, labeled separately
Figure FDA0002981716770000021
S6, determining the final result
Using Simα() And a hierarchical screening mechanism is used for operating the retrieval text Q and the candidate text D before and after compression to obtain a final semantic search result.
4. The semantic coding lossless compression method based on the heuristic linear transformation as claimed in claim 3, wherein: in step S1, the deep learning language model may be BERT derived from google, and the deep learning language model may be stored in the system memory directly, or may be stored in the system hard disk in a numpy or pickle file format for subsequent reading and calling, so as to obtain the quantization forms of the candidate text D and the search text Q:
Figure FDA0002981716770000022
Figure FDA0002981716770000023
the semantic search scenario described above may then be described as
sort(qi,D)=[di1,di2,di3…duN]
So that the function Sim is calculated for a particular similarityα() Always have
Simα(qi,dix)≥Simα(qi,di,x+1)。
5. The semantic coding lossless compression method based on the heuristic linear transformation as claimed in claim 4, wherein: the similarity calculation function Simα() Using cosine-value calculation (cosine) or Euclidean distance (euclidean), i.e.
Figure FDA0002981716770000031
Figure FDA0002981716770000032
6. The semantic coding lossless compression method based on the heuristic linear transformation as claimed in claim 3, wherein: in the step S2, the system is embedded with a similarity calculation function Sim for evaluating the search resultsa() Input as two arrays
Figure FDA0002981716770000033
Wherein λ isiIs a ranking coefficient.
7. The semantic coding lossless compression method based on the heuristic linear transformation as claimed in claim 3, wherein: in step S3, the specific generation mechanism is as follows:
Figure FDA0002981716770000034
Figure FDA0002981716770000041
wherein, mu1~6Not less than 0, at initialization, T(best)=T(worst)=T(0)=T(-1)And T is(rand)Is a compression matrix generated randomly, and
Figure FDA0002981716770000042
Figure FDA0002981716770000043
8. the semantic coding lossless compression method based on the heuristic linear transformation as claimed in claim 3, wherein: in the step S5, the expected speed-up multiple of the compression matrix is
Figure FDA0002981716770000044
The specific compression dimension may be set by expectTC.
CN202110289154.7A 2021-03-18 2021-03-18 Semantic coding lossless compression system and method based on heuristic linear transformation Active CN113055018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110289154.7A CN113055018B (en) 2021-03-18 2021-03-18 Semantic coding lossless compression system and method based on heuristic linear transformation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110289154.7A CN113055018B (en) 2021-03-18 2021-03-18 Semantic coding lossless compression system and method based on heuristic linear transformation

Publications (2)

Publication Number Publication Date
CN113055018A true CN113055018A (en) 2021-06-29
CN113055018B CN113055018B (en) 2023-05-12

Family

ID=76513465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110289154.7A Active CN113055018B (en) 2021-03-18 2021-03-18 Semantic coding lossless compression system and method based on heuristic linear transformation

Country Status (1)

Country Link
CN (1) CN113055018B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080077570A1 (en) * 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
CN106776553A (en) * 2016-12-07 2017-05-31 中山大学 A kind of asymmetric text hash method based on deep learning
US20190163817A1 (en) * 2017-11-29 2019-05-30 Oracle International Corporation Approaches for large-scale classification and semantic text summarization
CN110502613A (en) * 2019-08-12 2019-11-26 腾讯科技(深圳)有限公司 A kind of model training method, intelligent search method, device and storage medium
CN110825901A (en) * 2019-11-11 2020-02-21 腾讯科技(北京)有限公司 Image-text matching method, device and equipment based on artificial intelligence and storage medium
CN111382260A (en) * 2020-03-16 2020-07-07 腾讯音乐娱乐科技(深圳)有限公司 Method, device and storage medium for correcting retrieved text
CN111444320A (en) * 2020-06-16 2020-07-24 太平金融科技服务(上海)有限公司 Text retrieval method and device, computer equipment and storage medium
CN111753060A (en) * 2020-07-29 2020-10-09 腾讯科技(深圳)有限公司 Information retrieval method, device, equipment and computer readable storage medium
CN111881334A (en) * 2020-07-15 2020-11-03 浙江大胜达包装股份有限公司 Keyword-to-enterprise retrieval method based on semi-supervised learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080077570A1 (en) * 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
CN106776553A (en) * 2016-12-07 2017-05-31 中山大学 A kind of asymmetric text hash method based on deep learning
US20190163817A1 (en) * 2017-11-29 2019-05-30 Oracle International Corporation Approaches for large-scale classification and semantic text summarization
CN110502613A (en) * 2019-08-12 2019-11-26 腾讯科技(深圳)有限公司 A kind of model training method, intelligent search method, device and storage medium
CN110825901A (en) * 2019-11-11 2020-02-21 腾讯科技(北京)有限公司 Image-text matching method, device and equipment based on artificial intelligence and storage medium
CN111382260A (en) * 2020-03-16 2020-07-07 腾讯音乐娱乐科技(深圳)有限公司 Method, device and storage medium for correcting retrieved text
CN111444320A (en) * 2020-06-16 2020-07-24 太平金融科技服务(上海)有限公司 Text retrieval method and device, computer equipment and storage medium
CN111881334A (en) * 2020-07-15 2020-11-03 浙江大胜达包装股份有限公司 Keyword-to-enterprise retrieval method based on semi-supervised learning
CN111753060A (en) * 2020-07-29 2020-10-09 腾讯科技(深圳)有限公司 Information retrieval method, device, equipment and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李宇: ""文档检索中文本片段化机制的研究"", 《计算机科学与探索》 *

Also Published As

Publication number Publication date
CN113055018B (en) 2023-05-12

Similar Documents

Publication Publication Date Title
Chen et al. Learning k-way d-dimensional discrete codes for compact embedding representations
Gueniche et al. Compact prediction tree: A lossless model for accurate sequence prediction
CN101809567B (en) Two-pass hash extraction of text strings
CN111914062B (en) Long text question-answer pair generation system based on keywords
US20230237332A1 (en) Learning compressible features
CN115238053A (en) BERT model-based new crown knowledge intelligent question-answering system and method
JP3364242B2 (en) Link learning device for artificial neural networks
Ozan et al. K-subspaces quantization for approximate nearest neighbor search
CN110851584B (en) Legal provision accurate recommendation system and method
Kan et al. Zero-shot learning to index on semantic trees for scalable image retrieval
Sun et al. Automatic text summarization using deep reinforcement learning and beyond
Chen et al. Continual learning for generative retrieval over dynamic corpora
CN112598039B (en) Method for obtaining positive samples in NLP (non-linear liquid) classification field and related equipment
CN111507108B (en) Alias generation method and device, electronic equipment and computer readable storage medium
CN109902273B (en) Modeling method and device for keyword generation model
CN113055018A (en) Semantic coding lossless compression system and method based on heuristic linear transformation
CN113204679B (en) Code query model generation method and computer equipment
US20220092382A1 (en) Quantization for neural network computation
KR20220092776A (en) Apparatus and method for quantizing neural network models
CN114238564A (en) Information retrieval method and device, electronic equipment and storage medium
Qiang et al. Large-scale multi-label image retrieval using residual network with hash layer
CN117494815A (en) File-oriented credible large language model training and reasoning method and device
CN110929527B (en) Method and device for determining semantic similarity
CN117236410B (en) Trusted electronic file large language model training and reasoning method and device
Olewniczak et al. Fast approximate string search for wikification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant