CN104933063A - Data processing method, searching method and apparatus - Google Patents

Data processing method, searching method and apparatus Download PDF

Info

Publication number
CN104933063A
CN104933063A CN201410102604.7A CN201410102604A CN104933063A CN 104933063 A CN104933063 A CN 104933063A CN 201410102604 A CN201410102604 A CN 201410102604A CN 104933063 A CN104933063 A CN 104933063A
Authority
CN
China
Prior art keywords
word frequency
raw data
frequency file
high word
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410102604.7A
Other languages
Chinese (zh)
Other versions
CN104933063B (en
Inventor
王忻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Singularity Xinyuan International Technology Development (Beijing) Co.,Ltd.
Original Assignee
CHONGQING XINMEI AGRICULTURAL INFORMATION TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHONGQING XINMEI AGRICULTURAL INFORMATION TECHNOLOGY CO LTD filed Critical CHONGQING XINMEI AGRICULTURAL INFORMATION TECHNOLOGY CO LTD
Priority to CN201410102604.7A priority Critical patent/CN104933063B/en
Publication of CN104933063A publication Critical patent/CN104933063A/en
Application granted granted Critical
Publication of CN104933063B publication Critical patent/CN104933063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention provides a data processing method, a data searching method and a data processing apparatus. The data processing method comprises: calculating a compression ratio of each word in original data; carrying out compression on the words in the original data, of which the compression ratios are greater than a preset threshold value, and generating a high word frequency file, wherein the high word frequency file comprises the words and position information of the words in the original data; and after deleting the words, of which the compression ratios are greater than the preset threshold value, from the original data, compressing the original data to generate a non-high-word-frequency file. According to the embodiments of the present invention, adoption of the data processing method can enable the data to occupy a small storage space in the storing process and is beneficial for improving a transmission speed in the network transmitting process.

Description

Data processing method, searching method and device
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of data processing method, searching method and device.
Background technology
At present, in the network services such as such as shopping at network, information retrieval and information website, relate to very huge data (such as: word) and need process, traditional processing mode directly these data is encoded according to predetermined form, but the data volume that this kind of mode Problems existing is through the file of coding still very huge, does not utilize the later stage to apply (such as: storage, transmission etc.).For example:
Space flight: also known as space flight, space flight, space travel or space shuttle.Mean the navigation activity of spacecraft at space.Some scientists once called space flight spacecraft in interplanetary navigation activity, and spacecraft is called space flight in extrasolar navigation activity, then spacecraft were referred to as space flight with extrasolar navigation activity in the solar system now.The object of solar-system operation is exploration, development and utilization space and celestial body, is mankind's services.The pacing items of space flight is the speed that spacecraft must reach enough, breaks away from the gravitation of the earth or the sun.
This section of word totally 170 words (comprising punctuation mark), suppose it with UTF-8(8-bit UnicodeTransformation Format, ten thousand country codes) form storage (each word takies 3 bytes), the compressed file taking 510 byte spaces will be generated, its when storing very take storage space, when transmitting, due to data volume, greatly therefore to transmit required time long.
In addition, if apply traditional data processing method in web services, reduce causing the Experience Degree of user.For search: in traditional way of search, raw data is stored in the middle of local file system without change, this will expend larger storage space, particularly in distributed search, the data volume of Search Results very too, and the network latency needing cost long, cause search speed slack-off.Meanwhile, traditional index creation mode, its process is: after index server receives plaintext data, first creates index, then stores plaintext data; Again plaintext data is extracted from disk when user search is recorded to this and return to user, in the I/O and Internet Transmission of disk, be easy to the bottleneck becoming system performance lifting like this, affect Consumer's Experience effect.
Summary of the invention
In view of this, the invention provides a kind of data processing method and device, the data after the method or device process, when storing, shared storage space is little, and when Internet Transmission, the required transmission time is short.In addition, present invention also offers a kind of data search method and device, this data search method or device, the network service that can improve user is experienced.
Embodiments providing a kind of data processing method, for being high word frequency file and not high word frequency file by original data processing, comprising:
Calculate the compressibility of each vocabulary in raw data;
Compressed by the vocabulary that compressibility in described raw data is greater than predetermined threshold value, generate high word frequency file, described high word frequency file comprises vocabulary and the positional information of vocabulary in described raw data;
Delete after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compress described Raw Data Generation not high word frequency file.
Preferably, the compressibility of each vocabulary in described calculating raw data, comprise: the byte number in number of words contained by the number of times occurred in described raw data according to each vocabulary, each vocabulary and coding original number needed for a word, calculate the compressibility of each vocabulary, and this step specifically comprises: according to formula: calculate the compressibility of each vocabulary in described raw data, wherein Co represents compressibility, and W_F represents the number of times that vocabulary occurs in raw data, and W_L represents the number of words contained by vocabulary, n for the byte number in encoded primary data needed for a word, f be compressibility factor.
Preferably, in described calculating raw data each vocabulary compressibility before, also comprise: contained by raw data, number of words arranges the value of compressibility factor f, this step is specially: when contained by described raw data, number of words is less than or equal to 256, the value of f is set to 1, when contained by described raw data, number of words is greater than 256 and is less than or equal to 65536, the value of described f is set to 2, when contained by described raw data, number of words is greater than 65536 and is less than or equal to 16777216, the value of described f is set to 3, when contained by described raw data, number of words is greater than 16777216 and is less than or equal to 42949677296, the value of described f is set to 4.
Preferably, after the described high word frequency file of generation and not high word frequency file, described method also comprises: described high word frequency file and not high word frequency file are sent to search server, store described high word frequency file and not high word frequency file by described search server, and according to described high word frequency file and not high word frequency document creation the index for searching for.
The embodiment of the present invention additionally provides a kind of data search method, comprising:
Receive the search condition that user is inputted by access client;
Search condition according to described reception is searched for, and extract corresponding high word frequency file and not high word frequency file according to Search Results, described high word frequency file is the file generated after the vocabulary being greater than predetermined threshold value to compressibility in raw data carries out compressing, described not high word frequency file is delete after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compresses the file of described Raw Data Generation;
The high word frequency file extracted and not high word frequency file are sent to described access client, by described access client according to the high word frequency file of described extraction and not high word frequency file generated raw data.
The embodiment of the present invention additionally provides a kind of data processing equipment, for processing raw data, generating high word frequency file and not high word frequency file, comprising:
Computing module, for calculating the compressibility of each vocabulary in raw data;
High word frequency file generating module, compress for vocabulary compressibility in described raw data being greater than predetermined threshold value, generate high word frequency file, described high word frequency file comprises vocabulary and the positional information of vocabulary in described raw data;
Not high word frequency file generating module, for deleting after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compresses described Raw Data Generation not high word frequency file;
Preferably, described computing module, for the byte number in number of words contained by the number of times occurred in described raw data according to each vocabulary, each vocabulary and coding original number needed for a word, calculates the compressibility of each vocabulary, and specifically for: according to formula: calculate the compressibility of each vocabulary in described raw data, wherein Co represents compressibility, and W_F represents the number of times that vocabulary occurs in raw data, and W_L represents the number of words contained by vocabulary, n for the byte number in encoded primary data needed for a word, f be compressibility factor.
Preferably, data processing equipment also comprises: arrange module, for number of words contained by raw data, the value of compressibility factor f is set, concrete for: when contained by described raw data, number of words is less than or equal to 256, the value of f is set to 1, when contained by described raw data, number of words is greater than 256 and is less than or equal to 65536, the value of described f is set to 2, when contained by described raw data, number of words is greater than 65536 and is less than or equal to 16777216, the value of described f is set to 3, when contained by described raw data, number of words is greater than 16777216 and is less than or equal to 42949677296, the value of described f is set to 4.
Preferably, data processing equipment also comprises: sending module, for after the described high word frequency file of generation and not high word frequency file, described high word frequency file and not high word frequency file are sent to search server, store described high word frequency file and not high word frequency file by described search server, and according to described high word frequency file and not high word frequency document creation the index for searching for.
The embodiment of the present invention additionally provides a kind of data serching device, comprising:
Receiver module, for receiving the search condition that user is inputted by access client;
Processing module, for searching for according to the search condition of described reception, and extract corresponding high word frequency file and not high word frequency file according to Search Results, described high word frequency file is the file generated after the vocabulary being greater than predetermined threshold value to compressibility in raw data carries out compressing, described not high word frequency file is delete after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compresses the file of described Raw Data Generation;
Sending module, for the high word frequency file extracted and not high word frequency file are sent to described access client, is merged high word frequency file and the not high word frequency file generated raw data of described extraction by described access client.
Beneficial effect of the present invention:
The data processing method of the embodiment of the present invention or device, store when directly storing for raw data data volume large, be not easy to the problems such as Internet Transmission, raw data is calculated to the compressibility of each vocabulary, then vocabulary compressibility being greater than predetermined threshold value carries out compression and generates high word frequency file, and vocabulary compressibility being greater than predetermined threshold value is deleted from raw data, generate the not high word frequency file of compression, through such process, when storing, shared storage space is little, and therefore when Internet Transmission, transmission speed is fast.
The data search method of the embodiment of the present invention and device, owing to searching for high word frequency file and not high word frequency file, therefore search speed is fast, and be also return high word frequency file and not high word frequency file when access client returns Search Results, raw data is reduced by access client, thus data transmission speed in a network can be improved, improve the network Experience Degree of user.
Accompanying drawing explanation
Below in conjunction with drawings and Examples, the invention will be further described:
Fig. 1 is the schematic flow sheet of the embodiment of data processing method provided by the invention.
Fig. 2 is the schematic flow sheet of the embodiment of data search method provided by the invention.
Fig. 3 is the schematic flow sheet of the embodiment of data processing equipment provided by the invention.
Fig. 4 is the schematic flow sheet of the embodiment of data serching device provided by the invention.
Embodiment
Please refer to Fig. 1, is the schematic flow sheet of the embodiment of data processing method provided by the invention.It comprises the steps:
The compressibility of each vocabulary in step S11, calculating raw data.
In the present embodiment, the ratio of shared byte number when the byte number that compressibility reduces for a certain vocabulary compression is rear and this vocabulary do not compress, wherein, step S11 can adopt formula: calculate the compressibility of vocabulary, wherein Co represents compressibility, W_F represents the number of times that vocabulary occurs in raw data, W_L represents the number of words contained by vocabulary, n is the byte number in encoded primary data needed for a word, f is compressibility factor, is used to indicate byte number shared by positional information, and described positional information is for representing the position of vocabulary in raw data.
Wherein, n is relevant with concrete coded system, such as: if raw data takes UTF-8 form to carry out code storage, then because a UTF-8 word of encoding needs 3 bytes, therefore n=3.
Wherein, the value of f can be the value preset according to the length of historical empirical data, for improving the compressibility of raw data further, the number of words in the present embodiment contained by raw data arranges (comprising punctuation mark) value of f, wherein table one imbody this kind of relation.
Table one:
Number of words contained by raw data (representing with L) f
L<=256 1
256<L<=65536 2
65536<L<=16777216 3
16777216<L<=4294967296 4
For example, the compressibility how calculating vocabulary is described with passage below:
Space flight: also known as space flight, space flight, space travel or space shuttle.Mean the navigation activity of spacecraft at space.Some scientists once called space flight spacecraft in interplanetary navigation activity, and spacecraft is called space flight in extrasolar navigation activity, then spacecraft were referred to as space flight with extrasolar navigation activity in the solar system now.The object of solar-system operation is exploration, development and utilization space and celestial body, is mankind's services.The pacing items of space flight is the speed that spacecraft must reach enough, breaks away from the gravitation of the earth or the sun.
Wherein, " space flight " this vocabulary has occurred 4 times altogether, therefore its W_F=4, again because this word is only containing 2 words, therefore its W_L=2, therefore analogize in this approach, the compressibility that in this section of word, each vocabulary is corresponding can be calculated.
Step S12, vocabulary compressibility in raw data being greater than predetermined threshold value compress, and generate high word frequency file.Herein, high word frequency file comprises vocabulary and the positional information of vocabulary in raw data.
Wherein, have compression effectiveness, and the larger compression effectiveness of threshold value is better when general threshold value is greater than 0, and threshold value is established too highly the compressibility without any vocabulary may be caused to meet its condition, therefore specifically arranging of threshold value can rule of thumb set.
Step S13, delete after compressibility is greater than the vocabulary of predetermined threshold value from raw data, compress generates not high word frequency file.
Below again for aforementioned that section of word about space flight mentioned, the embodiment of the present invention is described.
In order to calculate the compressibility of vocabulary, first needing to carry out participle to this section of word, adding up the word frequency (i.e. W_F) of each vocabulary, specifically as shown in Table 2:
Table two
Vocabulary There is position (byte) Word frequency
Spacecraft x1c x2f x41 x57 x93 5
Space flight x00 x15 x3e x6c 4
Also known as x03 1
Space x06 1
Flight x08x0d x17 3
Universe x10 1
Navigation x12 x23 x38 x4a 4
Mean x1a 1
Space x0b x20 0x7f 3
Movable x25 x4c x67 3
Have x28 1
The sun x33 x45 x51 x5b x60 5
In system x35 1
Be called x3c x4e x6a 3
Outside system x47 x62 2
Space flight x50 1
Now x53 1
Object x74 1
Explore x77 1
Exploitation x79 1
Utilize x7d 1
Celestial body x82 1
The mankind x86 1
Service x88 1
Substantially x8e 1
Condition x90 1
Necessary x96 1
Reach x98 1
Enough x9a 1
Speed x9d 1
Break away from xa0 1
The earth xa2 1
Gravitation xa8 1
Then the compressibility of each vocabulary is calculated, as shown in Table 3.
Table three
Compressibility is greater than 0 be considered as high frequency vocabulary, extract compression and generate high word frequency file, then high frequency vocabulary is deleted from epimere word, generate not high word frequency file.
Wherein not high word frequency file is as follows:
: also known as space, universe or.Mean.Some scientists once being interior, space flight, now then be interior and system.Object be explore, development and utilization and celestial body, be mankind's services.Pacing items be to reach enough speed, break away from the earth or gravitation.
The present embodiment, through above-mentioned process, the total draught of raw data is (118/510=23.1%), and compression effectiveness is better than traditional compress mode.
The present embodiment can be applied to web search field, its application process can be: application server captures raw data from network, or after otherwise obtaining raw data, high word frequency file and not high word frequency file is treated to according to the mode of step S11-step S13, then high word frequency file and not high word frequency file are sent to search server by application server, store high word frequency file and not high word frequency file by search server, and according to high word frequency file and not high word frequency document creation the index for searching for.
The data processing method of the present embodiment, the mode of participle is adopted raw data to be divided into high word frequency file and not high frequency file, and high word frequency file and not high word frequency file are less compared to raw data shared storage space when storing, and transmission speed is faster when Internet Transmission.
Please refer to Fig. 2, is the schematic flow sheet of the embodiment of data search method provided by the invention.It comprises the steps:
The search condition that step S21, reception user are inputted by access client.
Step S22, to search for according to the search condition received, and extract corresponding high word frequency file and not high word frequency file according to Search Results.
Wherein, high word frequency file is the file generated after the vocabulary being greater than predetermined threshold value to compressibility in raw data carries out compressing.
Wherein, not high word frequency file is delete after compressibility is greater than the vocabulary of predetermined threshold value from raw data, compresses the file of described Raw Data Generation.
The poly-S23 of step, is sent to access client by the high word frequency file extracted and not high word frequency file, is merged the high word frequency file and not high word frequency file generated raw data that extract by access client.
The present embodiment, its executive agent can for providing the search server of search service.
The present embodiment, owing to searching for high word frequency file and not high word frequency file, therefore search speed is fast, and be also return high word frequency file and not high word frequency file when access client returns Search Results, raw data is reduced by access client, thus data transmission speed in a network can be improved, improve the network Experience Degree of user.
Please refer to Fig. 3, is the structural representation of the embodiment of data processing equipment provided by the invention.It comprises:
Computing module 11, for calculating the compressibility of each vocabulary in raw data.
Wherein, computing module 11, for the byte number in number of words contained by the number of times occurred in described raw data according to each vocabulary, each vocabulary and coding original number needed for a word, calculates the compressibility of each vocabulary, and specifically for:
According to formula: calculate the compressibility of vocabulary, wherein Co represents compressibility, and W_F represents the number of times that vocabulary occurs in raw data, and W_L represents the number of words contained by vocabulary, n for the byte number in encoded primary data needed for a word, f be compressibility factor.
Wherein, n is relevant with concrete coded system, such as: if raw data takes UTF-8 form to carry out code storage, then because a UTF-8 word of encoding needs 3 bytes, therefore n=3.
Wherein, the value of f is relevant with the number of words (comprising punctuation mark) contained by raw data, wherein table one imbody this kind of relation.
Table one:
Number of words contained by raw data (representing with L) f
L<=256 1
256<L<=65536 2
65536<L<=16777216 3
16777216<L<=4294967296 4
In addition, before computing module 11 calculates, can by the value arranging module 10 and arrange f, concrete, module 10 is set, for number of words contained by raw data, the value of compressibility factor f is set, concrete for: when contained by described raw data, number of words is less than or equal to 256, the value of f is set to 1, when contained by described raw data, number of words is greater than 256 and is less than or equal to 65536, the value of described f is set to 2, when contained by described raw data, number of words is greater than 65536 and is less than or equal to 16777216, the value of described f is set to 3, when contained by described raw data, number of words is greater than 16777216 and is less than or equal to 42949677296, the value of described f is set to 4.
High word frequency file generating module 12, compresses for vocabulary compressibility in raw data being greater than predetermined threshold value, generates high word frequency file.Herein, high word frequency file comprises vocabulary and the positional information of vocabulary in raw data.
Wherein, have compression effectiveness, and the larger compression effectiveness of threshold value is better when general threshold value is greater than 0, and threshold value is established too highly the compressibility without any vocabulary may be caused to meet its condition, therefore specifically arranging of threshold value can rule of thumb set.
Not high word frequency file generating module 13, be greater than the vocabulary of predetermined threshold value for deleting compressibility from raw data after, compress generates not high word frequency file.
Sending module 14, for high word frequency file and not high word frequency file are sent to search server, stores high word frequency file and not high word frequency file by search server, and according to high word frequency file and not high word frequency document creation the index for searching for.
The data processing method of the present embodiment, the mode of participle is adopted raw data to be divided into high word frequency file and not high frequency file, and high word frequency file and not high word frequency file are less compared to raw data shared storage space when storing, and transmission speed is faster when Internet Transmission.
Please refer to Fig. 4, is the structural representation of the embodiment of data serching device provided by the invention.It comprises:
Receiver module 21, for receiving the search condition that user is inputted by access client.
Processing module 22, for searching for according to the search condition received, and extracts corresponding high word frequency file and not high word frequency file according to Search Results.
Wherein, high word frequency file is the file generated after the vocabulary being greater than predetermined threshold value to compressibility in raw data carries out compressing.
Wherein, not high word frequency file is delete after compressibility is greater than the vocabulary of predetermined threshold value from raw data, compresses the file of described Raw Data Generation.
Sending module 23, for the high word frequency file extracted and not high word frequency file are sent to access client, by access client according to the high word frequency file extracted and not high word frequency file generated raw data, namely merges high word frequency file and not high word frequency file.
The present embodiment, owing to searching for high word frequency file and not high word frequency file, therefore search speed is fast, and be also return high word frequency file and not high word frequency file when access client returns Search Results, raw data is reduced by access client, thus data transmission speed in a network can be improved, improve the network Experience Degree of user.
What finally illustrate is, above embodiment is only in order to illustrate technical scheme of the present invention and unrestricted, although with reference to preferred embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that, can modify to technical scheme of the present invention or equivalent replacement, and not departing from aim and the scope of technical solution of the present invention, it all should be encompassed in the middle of right of the present invention.

Claims (10)

1. a data processing method, for processing raw data, generating high word frequency file and not high word frequency file, it is characterized in that: comprising:
Calculate the compressibility of each vocabulary in raw data;
Compressed by the vocabulary that compressibility in described raw data is greater than predetermined threshold value, generate high word frequency file, described high word frequency file comprises vocabulary and the positional information of vocabulary in described raw data;
Delete after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compress described Raw Data Generation not high word frequency file.
2. data processing method as claimed in claim 1, is characterized in that: the compressibility of each vocabulary in described calculating raw data, comprising:
Byte number in number of words contained by the number of times occurred in described raw data according to each vocabulary, each vocabulary and encoded primary data needed for a word, calculate the compressibility of each vocabulary, and this step specifically comprises: according to formula: calculate the compressibility of each vocabulary in described raw data, wherein Co represents compressibility, and W_F represents the number of times that vocabulary occurs in raw data, and W_L represents the number of words contained by vocabulary, n for the byte number in encoded primary data needed for a word, f be compressibility factor.
3. data processing method as claimed in claim 2, it is characterized in that: in described calculating raw data each vocabulary compressibility before, also comprise: contained by raw data, number of words arranges the value of compressibility factor f, this step is specially: when contained by described raw data, number of words is less than or equal to 256, the value of f is set to 1, when contained by described raw data, number of words is greater than 256 and is less than or equal to 65536, the value of described f is set to 2, when contained by described raw data, number of words is greater than 65536 and is less than or equal to 16777216, the value of described f is set to 3, when contained by described raw data, number of words is greater than 16777216 and is less than or equal to 42949677296, the value of described f is set to 4.
4. the data processing method according to any one of claim 1-3, is characterized in that: after the described high word frequency file of generation and not high word frequency file, described method also comprises:
Described high word frequency file and not high word frequency file are sent to search server, store described high word frequency file and not high word frequency file by described search server, and according to described high word frequency file and not high word frequency document creation the index for searching for.
5. a data search method, is characterized in that: comprising:
Receive the search condition that user is inputted by access client;
Search condition according to described reception is searched for, and extract corresponding high word frequency file and not high word frequency file according to Search Results, described high word frequency file is the file generated after the vocabulary being greater than predetermined threshold value to compressibility in raw data carries out compressing, described not high word frequency file is delete after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compresses the file of described Raw Data Generation;
The high word frequency file extracted and not high word frequency file are sent to described access client, by described access client according to the high word frequency file of described extraction and not high word frequency file generated raw data.
6. a data processing equipment, for processing raw data, generating high word frequency file and not high word frequency file, it is characterized in that: comprising:
Computing module, for calculating the compressibility of each vocabulary in raw data;
High word frequency file generating module, compress for vocabulary compressibility in described raw data being greater than predetermined threshold value, generate high word frequency file, described high word frequency file comprises vocabulary and the positional information of vocabulary in described raw data;
Not high word frequency file generating module, for deleting after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compresses described Raw Data Generation not high word frequency file.
7. data processing equipment as claimed in claim 6, is characterized in that:
Described computing module, for the byte number in number of words contained by the number of times occurred in described raw data according to each vocabulary, each vocabulary and coding original number needed for a word, calculates the compressibility of each vocabulary, and specifically for:
According to formula: calculate the compressibility of each vocabulary in described raw data, wherein Co represents compressibility, and W_F represents the number of times that vocabulary occurs in raw data, and W_L represents the number of words contained by vocabulary, n for the byte number in encoded primary data needed for a word, f be compressibility factor.
8. data processing equipment as claimed in claim 7, is characterized in that: also comprise:
Module is set, for number of words contained by raw data, the value of compressibility factor f is set, concrete for: when contained by described raw data, number of words is less than or equal to 256, the value of f is set to 1, when contained by described raw data, number of words is greater than 256 and is less than or equal to 65536, the value of described f is set to 2, when contained by described raw data, number of words is greater than 65536 and is less than or equal to 16777216, the value of described f is set to 3, when number of words is greater than 16777216 and is less than or equal to 42949677296 contained by described raw data, the value of described f is set to 4.
9. the data processing equipment according to any one of claim 6-8, is characterized in that: also comprise:
Sending module, for after the described high word frequency file of generation and not high word frequency file, described high word frequency file and not high word frequency file are sent to search server, store described high word frequency file and not high word frequency file by described search server, and according to described high word frequency file and not high word frequency document creation the index for searching for.
10. a data serching device, is characterized in that: comprising:
Receiver module, for receiving the search condition that user is inputted by access client;
Processing module, for searching for according to the search condition of described reception, and extract corresponding high word frequency file and not high word frequency file according to Search Results, described high word frequency file is the file generated after the vocabulary being greater than predetermined threshold value to compressibility in raw data carries out compressing, described not high word frequency file is delete after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compresses the file of described Raw Data Generation;
Sending module, for the high word frequency file extracted and not high word frequency file are sent to described access client, is merged high word frequency file and the not high word frequency file generated raw data of described extraction by described access client.
CN201410102604.7A 2014-03-19 2014-03-19 Data processing method, searching method and device Active CN104933063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410102604.7A CN104933063B (en) 2014-03-19 2014-03-19 Data processing method, searching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410102604.7A CN104933063B (en) 2014-03-19 2014-03-19 Data processing method, searching method and device

Publications (2)

Publication Number Publication Date
CN104933063A true CN104933063A (en) 2015-09-23
CN104933063B CN104933063B (en) 2018-08-24

Family

ID=54120231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410102604.7A Active CN104933063B (en) 2014-03-19 2014-03-19 Data processing method, searching method and device

Country Status (1)

Country Link
CN (1) CN104933063B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115333685A (en) * 2022-10-10 2022-11-11 永鼎行远(南京)信息科技有限公司 Intelligent information allocation system based on big data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6309424B1 (en) * 1998-12-11 2001-10-30 Realtime Data Llc Content independent data compression method and system
WO2002039591A1 (en) * 2000-11-09 2002-05-16 Realtime Data Llc Content independent data compression method and system
CN1816182A (en) * 2005-02-02 2006-08-09 华为技术有限公司 Method of transmitting data to base station by base station controller
CN101751451A (en) * 2008-12-11 2010-06-23 高德软件有限公司 Chinese data compression method and Chinese data decompression method and related devices
CN101783788A (en) * 2009-01-21 2010-07-21 联想(北京)有限公司 File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device
CN102567322A (en) * 2010-12-09 2012-07-11 北京大学 Text compression method and text compression device
CN102929783A (en) * 2012-10-25 2013-02-13 华为技术有限公司 Data storage method, device and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6309424B1 (en) * 1998-12-11 2001-10-30 Realtime Data Llc Content independent data compression method and system
WO2002039591A1 (en) * 2000-11-09 2002-05-16 Realtime Data Llc Content independent data compression method and system
CN1816182A (en) * 2005-02-02 2006-08-09 华为技术有限公司 Method of transmitting data to base station by base station controller
CN101751451A (en) * 2008-12-11 2010-06-23 高德软件有限公司 Chinese data compression method and Chinese data decompression method and related devices
CN101783788A (en) * 2009-01-21 2010-07-21 联想(北京)有限公司 File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device
CN102567322A (en) * 2010-12-09 2012-07-11 北京大学 Text compression method and text compression device
CN102929783A (en) * 2012-10-25 2013-02-13 华为技术有限公司 Data storage method, device and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115333685A (en) * 2022-10-10 2022-11-11 永鼎行远(南京)信息科技有限公司 Intelligent information allocation system based on big data
CN115333685B (en) * 2022-10-10 2023-02-28 永鼎行远(南京)信息科技有限公司 Intelligent information allocation system based on big data

Also Published As

Publication number Publication date
CN104933063B (en) 2018-08-24

Similar Documents

Publication Publication Date Title
Gueniche et al. Compact prediction tree: A lossless model for accurate sequence prediction
Lau et al. Trigger-based language models: A maximum entropy approach
CN101783788B (en) File compression method, file compression device, file decompression method, file decompression device, compressed file searching method and compressed file searching device
CN104753540A (en) Data compression method, data decompression method and device
CN103248369A (en) Compression system and method based on FPFA (Field Programmable Gate Array)
US11722148B2 (en) Systems and methods of data compression
CN104168085A (en) Data compression method based on redundant entropy conversion
Bedruz et al. Comparison of Huffman Algorithm and Lempel-Ziv Algorithm for audio, image and text compression
CN103871402A (en) Language model training system, a voice identification system and corresponding method
CN103078647A (en) Hardware decoding implementation system and method of LZ77 compression algorithm
CN101534124A (en) Compression algorithm for short natural language
CN104933063A (en) Data processing method, searching method and apparatus
Sinaga et al. Development of word-based text compression algorithm for Indonesian language document
CN105631000B (en) The data compression method of terminal buffers based on mobile terminal locations characteristic information
Chandrasekhar et al. Compressing feature sets with digital search trees
CN117040539A (en) Petroleum logging data compression method and device based on M-ary tree and LZW algorithm
CN104021121A (en) Method, device and server for compressing text data
Shanmugasundaram et al. IIDBE: A lossless text transform for better compression
CN103106144A (en) Compressing method and device of internal storage index
CN110430012A (en) The polarization code the minimum weight codewords distribution estimation method of low complex degree
CN109617708A (en) A kind of compression method burying a log, equipment and system
Govinda et al. Storage optimization in cloud environment using compression algorithm
Shanmugasundaram et al. Text preprocessing using enhanced intelligent dictionary based encoding (EIDBE)
Yan et al. Robust data transmission upon compressive sensing for smart grid
Nam et al. Synccoding: A compression technique exploiting references for data synchronization services

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200605

Address after: Room 502-1, floor 5, building 2, courtyard 10, KEGU 1st Street, economic development zone, Daxing District, Beijing 100081

Patentee after: Singularity Xinyuan International Technology Development (Beijing) Co.,Ltd.

Address before: The 401121 northern New District of Chongqing municipality Mount Huangshan Road 5 south of Mercury Technology Building 1 floor office No. 3

Patentee before: A-MEDIA COMMUNICATION TECH Co.,Ltd.