CN104933063A

CN104933063A - Data processing method, searching method and apparatus

Info

Publication number: CN104933063A
Application number: CN201410102604.7A
Authority: CN
Inventors: 王忻
Original assignee: CHONGQING XINMEI AGRICULTURAL INFORMATION TECHNOLOGY CO LTD
Current assignee: Singularity Xinyuan International Technology Development (Beijing) Co.,Ltd.
Priority date: 2014-03-19
Filing date: 2014-03-19
Publication date: 2015-09-23
Anticipated expiration: 2034-03-19
Also published as: CN104933063B

Abstract

The present invention provides a data processing method, a data searching method and a data processing apparatus. The data processing method comprises: calculating a compression ratio of each word in original data; carrying out compression on the words in the original data, of which the compression ratios are greater than a preset threshold value, and generating a high word frequency file, wherein the high word frequency file comprises the words and position information of the words in the original data; and after deleting the words, of which the compression ratios are greater than the preset threshold value, from the original data, compressing the original data to generate a non-high-word-frequency file. According to the embodiments of the present invention, adoption of the data processing method can enable the data to occupy a small storage space in the storing process and is beneficial for improving a transmission speed in the network transmitting process.

Description

Data processing method, searching method and device

Technical field

The present invention relates to technical field of data processing, particularly relate to a kind of data processing method, searching method and device.

Background technology

At present, in the network services such as such as shopping at network, information retrieval and information website, relate to very huge data (such as: word) and need process, traditional processing mode directly these data is encoded according to predetermined form, but the data volume that this kind of mode Problems existing is through the file of coding still very huge, does not utilize the later stage to apply (such as: storage, transmission etc.).For example:

Space flight: also known as space flight, space flight, space travel or space shuttle.Mean the navigation activity of spacecraft at space.Some scientists once called space flight spacecraft in interplanetary navigation activity, and spacecraft is called space flight in extrasolar navigation activity, then spacecraft were referred to as space flight with extrasolar navigation activity in the solar system now.The object of solar-system operation is exploration, development and utilization space and celestial body, is mankind's services.The pacing items of space flight is the speed that spacecraft must reach enough, breaks away from the gravitation of the earth or the sun.

This section of word totally 170 words (comprising punctuation mark), suppose it with UTF-8(8-bit UnicodeTransformation Format, ten thousand country codes) form storage (each word takies 3 bytes), the compressed file taking 510 byte spaces will be generated, its when storing very take storage space, when transmitting, due to data volume, greatly therefore to transmit required time long.

In addition, if apply traditional data processing method in web services, reduce causing the Experience Degree of user.For search: in traditional way of search, raw data is stored in the middle of local file system without change, this will expend larger storage space, particularly in distributed search, the data volume of Search Results very too, and the network latency needing cost long, cause search speed slack-off.Meanwhile, traditional index creation mode, its process is: after index server receives plaintext data, first creates index, then stores plaintext data; Again plaintext data is extracted from disk when user search is recorded to this and return to user, in the I/O and Internet Transmission of disk, be easy to the bottleneck becoming system performance lifting like this, affect Consumer's Experience effect.

Summary of the invention

In view of this, the invention provides a kind of data processing method and device, the data after the method or device process, when storing, shared storage space is little, and when Internet Transmission, the required transmission time is short.In addition, present invention also offers a kind of data search method and device, this data search method or device, the network service that can improve user is experienced.

Embodiments providing a kind of data processing method, for being high word frequency file and not high word frequency file by original data processing, comprising:

Calculate the compressibility of each vocabulary in raw data;

Compressed by the vocabulary that compressibility in described raw data is greater than predetermined threshold value, generate high word frequency file, described high word frequency file comprises vocabulary and the positional information of vocabulary in described raw data;

Delete after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compress described Raw Data Generation not high word frequency file.

Preferably, the compressibility of each vocabulary in described calculating raw data, comprise: the byte number in number of words contained by the number of times occurred in described raw data according to each vocabulary, each vocabulary and coding original number needed for a word, calculate the compressibility of each vocabulary, and this step specifically comprises: according to formula: calculate the compressibility of each vocabulary in described raw data, wherein Co represents compressibility, and W_F represents the number of times that vocabulary occurs in raw data, and W_L represents the number of words contained by vocabulary, n for the byte number in encoded primary data needed for a word, f be compressibility factor.

Preferably, in described calculating raw data each vocabulary compressibility before, also comprise: contained by raw data, number of words arranges the value of compressibility factor f, this step is specially: when contained by described raw data, number of words is less than or equal to 256, the value of f is set to 1, when contained by described raw data, number of words is greater than 256 and is less than or equal to 65536, the value of described f is set to 2, when contained by described raw data, number of words is greater than 65536 and is less than or equal to 16777216, the value of described f is set to 3, when contained by described raw data, number of words is greater than 16777216 and is less than or equal to 42949677296, the value of described f is set to 4.

Preferably, after the described high word frequency file of generation and not high word frequency file, described method also comprises: described high word frequency file and not high word frequency file are sent to search server, store described high word frequency file and not high word frequency file by described search server, and according to described high word frequency file and not high word frequency document creation the index for searching for.

The embodiment of the present invention additionally provides a kind of data search method, comprising:

Receive the search condition that user is inputted by access client;

Search condition according to described reception is searched for, and extract corresponding high word frequency file and not high word frequency file according to Search Results, described high word frequency file is the file generated after the vocabulary being greater than predetermined threshold value to compressibility in raw data carries out compressing, described not high word frequency file is delete after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compresses the file of described Raw Data Generation;

The high word frequency file extracted and not high word frequency file are sent to described access client, by described access client according to the high word frequency file of described extraction and not high word frequency file generated raw data.

The embodiment of the present invention additionally provides a kind of data processing equipment, for processing raw data, generating high word frequency file and not high word frequency file, comprising:

Computing module, for calculating the compressibility of each vocabulary in raw data;

High word frequency file generating module, compress for vocabulary compressibility in described raw data being greater than predetermined threshold value, generate high word frequency file, described high word frequency file comprises vocabulary and the positional information of vocabulary in described raw data;

Not high word frequency file generating module, for deleting after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compresses described Raw Data Generation not high word frequency file;

Preferably, described computing module, for the byte number in number of words contained by the number of times occurred in described raw data according to each vocabulary, each vocabulary and coding original number needed for a word, calculates the compressibility of each vocabulary, and specifically for: according to formula: calculate the compressibility of each vocabulary in described raw data, wherein Co represents compressibility, and W_F represents the number of times that vocabulary occurs in raw data, and W_L represents the number of words contained by vocabulary, n for the byte number in encoded primary data needed for a word, f be compressibility factor.

Preferably, data processing equipment also comprises: arrange module, for number of words contained by raw data, the value of compressibility factor f is set, concrete for: when contained by described raw data, number of words is less than or equal to 256, the value of f is set to 1, when contained by described raw data, number of words is greater than 256 and is less than or equal to 65536, the value of described f is set to 2, when contained by described raw data, number of words is greater than 65536 and is less than or equal to 16777216, the value of described f is set to 3, when contained by described raw data, number of words is greater than 16777216 and is less than or equal to 42949677296, the value of described f is set to 4.

Preferably, data processing equipment also comprises: sending module, for after the described high word frequency file of generation and not high word frequency file, described high word frequency file and not high word frequency file are sent to search server, store described high word frequency file and not high word frequency file by described search server, and according to described high word frequency file and not high word frequency document creation the index for searching for.

The embodiment of the present invention additionally provides a kind of data serching device, comprising:

Receiver module, for receiving the search condition that user is inputted by access client;

Processing module, for searching for according to the search condition of described reception, and extract corresponding high word frequency file and not high word frequency file according to Search Results, described high word frequency file is the file generated after the vocabulary being greater than predetermined threshold value to compressibility in raw data carries out compressing, described not high word frequency file is delete after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compresses the file of described Raw Data Generation;

Sending module, for the high word frequency file extracted and not high word frequency file are sent to described access client, is merged high word frequency file and the not high word frequency file generated raw data of described extraction by described access client.

Beneficial effect of the present invention:

The data processing method of the embodiment of the present invention or device, store when directly storing for raw data data volume large, be not easy to the problems such as Internet Transmission, raw data is calculated to the compressibility of each vocabulary, then vocabulary compressibility being greater than predetermined threshold value carries out compression and generates high word frequency file, and vocabulary compressibility being greater than predetermined threshold value is deleted from raw data, generate the not high word frequency file of compression, through such process, when storing, shared storage space is little, and therefore when Internet Transmission, transmission speed is fast.

The data search method of the embodiment of the present invention and device, owing to searching for high word frequency file and not high word frequency file, therefore search speed is fast, and be also return high word frequency file and not high word frequency file when access client returns Search Results, raw data is reduced by access client, thus data transmission speed in a network can be improved, improve the network Experience Degree of user.

Accompanying drawing explanation

Below in conjunction with drawings and Examples, the invention will be further described:

Fig. 1 is the schematic flow sheet of the embodiment of data processing method provided by the invention.

Fig. 2 is the schematic flow sheet of the embodiment of data search method provided by the invention.

Fig. 3 is the schematic flow sheet of the embodiment of data processing equipment provided by the invention.

Fig. 4 is the schematic flow sheet of the embodiment of data serching device provided by the invention.

Embodiment

Please refer to Fig. 1, is the schematic flow sheet of the embodiment of data processing method provided by the invention.It comprises the steps:

The compressibility of each vocabulary in step S11, calculating raw data.

In the present embodiment, the ratio of shared byte number when the byte number that compressibility reduces for a certain vocabulary compression is rear and this vocabulary do not compress, wherein, step S11 can adopt formula: calculate the compressibility of vocabulary, wherein Co represents compressibility, W_F represents the number of times that vocabulary occurs in raw data, W_L represents the number of words contained by vocabulary, n is the byte number in encoded primary data needed for a word, f is compressibility factor, is used to indicate byte number shared by positional information, and described positional information is for representing the position of vocabulary in raw data.

Wherein, n is relevant with concrete coded system, such as: if raw data takes UTF-8 form to carry out code storage, then because a UTF-8 word of encoding needs 3 bytes, therefore n=3.

Wherein, the value of f can be the value preset according to the length of historical empirical data, for improving the compressibility of raw data further, the number of words in the present embodiment contained by raw data arranges (comprising punctuation mark) value of f, wherein table one imbody this kind of relation.

Table one:

Number of words contained by raw data (representing with L)	f
		L<=256	1
256<L<=65536	2
		65536<L<=16777216	3
16777216<L<=4294967296	4

For example, the compressibility how calculating vocabulary is described with passage below:

Wherein, " space flight " this vocabulary has occurred 4 times altogether, therefore its W_F=4, again because this word is only containing 2 words, therefore its W_L=2, therefore analogize in this approach, the compressibility that in this section of word, each vocabulary is corresponding can be calculated.

Step S12, vocabulary compressibility in raw data being greater than predetermined threshold value compress, and generate high word frequency file.Herein, high word frequency file comprises vocabulary and the positional information of vocabulary in raw data.

Wherein, have compression effectiveness, and the larger compression effectiveness of threshold value is better when general threshold value is greater than 0, and threshold value is established too highly the compressibility without any vocabulary may be caused to meet its condition, therefore specifically arranging of threshold value can rule of thumb set.

Step S13, delete after compressibility is greater than the vocabulary of predetermined threshold value from raw data, compress generates not high word frequency file.

Below again for aforementioned that section of word about space flight mentioned, the embodiment of the present invention is described.

In order to calculate the compressibility of vocabulary, first needing to carry out participle to this section of word, adding up the word frequency (i.e. W_F) of each vocabulary, specifically as shown in Table 2:

Table two

Vocabulary	There is position (byte)	Word frequency
			Spacecraft	x1c x2f x41 x57 x93	5
Space flight	x00 x15 x3e x6c	4
			Also known as	x03	1
Space	x06	1
			Flight	x08x0d x17	3
Universe	x10	1
			Navigation	x12 x23 x38 x4a	4
Mean	x1a	1
			Space	x0b x20 0x7f	3
Movable	x25 x4c x67	3
			Have	x28	1
The sun	x33 x45 x51 x5b x60	5
			In system	x35	1
Be called	x3c x4e x6a	3
			Outside system	x47 x62	2
Space flight	x50	1
			Now	x53	1
Object	x74	1
			Explore	x77	1
Exploitation	x79	1
			Utilize	x7d	1
Celestial body	x82	1
			The mankind	x86	1
Service	x88	1
			Substantially	x8e	1
Condition	x90	1
			Necessary	x96	1
Reach	x98	1
			Enough	x9a	1

Speed	x9d	1
			Break away from	xa0	1
The earth	xa2	1
			Gravitation	xa8	1

Then the compressibility of each vocabulary is calculated, as shown in Table 3.

Table three

Compressibility is greater than 0 be considered as high frequency vocabulary, extract compression and generate high word frequency file, then high frequency vocabulary is deleted from epimere word, generate not high word frequency file.

Wherein not high word frequency file is as follows:

: also known as space, universe or.Mean.Some scientists once being interior, space flight, now then be interior and system.Object be explore, development and utilization and celestial body, be mankind's services.Pacing items be to reach enough speed, break away from the earth or gravitation.

The present embodiment, through above-mentioned process, the total draught of raw data is (118/510=23.1%), and compression effectiveness is better than traditional compress mode.

The present embodiment can be applied to web search field, its application process can be: application server captures raw data from network, or after otherwise obtaining raw data, high word frequency file and not high word frequency file is treated to according to the mode of step S11-step S13, then high word frequency file and not high word frequency file are sent to search server by application server, store high word frequency file and not high word frequency file by search server, and according to high word frequency file and not high word frequency document creation the index for searching for.

The data processing method of the present embodiment, the mode of participle is adopted raw data to be divided into high word frequency file and not high frequency file, and high word frequency file and not high word frequency file are less compared to raw data shared storage space when storing, and transmission speed is faster when Internet Transmission.

Please refer to Fig. 2, is the schematic flow sheet of the embodiment of data search method provided by the invention.It comprises the steps:

The search condition that step S21, reception user are inputted by access client.

Step S22, to search for according to the search condition received, and extract corresponding high word frequency file and not high word frequency file according to Search Results.

Wherein, high word frequency file is the file generated after the vocabulary being greater than predetermined threshold value to compressibility in raw data carries out compressing.

Wherein, not high word frequency file is delete after compressibility is greater than the vocabulary of predetermined threshold value from raw data, compresses the file of described Raw Data Generation.

The poly-S23 of step, is sent to access client by the high word frequency file extracted and not high word frequency file, is merged the high word frequency file and not high word frequency file generated raw data that extract by access client.

The present embodiment, its executive agent can for providing the search server of search service.

The present embodiment, owing to searching for high word frequency file and not high word frequency file, therefore search speed is fast, and be also return high word frequency file and not high word frequency file when access client returns Search Results, raw data is reduced by access client, thus data transmission speed in a network can be improved, improve the network Experience Degree of user.

Please refer to Fig. 3, is the structural representation of the embodiment of data processing equipment provided by the invention.It comprises:

Computing module 11, for calculating the compressibility of each vocabulary in raw data.

Wherein, computing module 11, for the byte number in number of words contained by the number of times occurred in described raw data according to each vocabulary, each vocabulary and coding original number needed for a word, calculates the compressibility of each vocabulary, and specifically for:

According to formula: calculate the compressibility of vocabulary, wherein Co represents compressibility, and W_F represents the number of times that vocabulary occurs in raw data, and W_L represents the number of words contained by vocabulary, n for the byte number in encoded primary data needed for a word, f be compressibility factor.

Wherein, the value of f is relevant with the number of words (comprising punctuation mark) contained by raw data, wherein table one imbody this kind of relation.

Table one:

In addition, before computing module 11 calculates, can by the value arranging module 10 and arrange f, concrete, module 10 is set, for number of words contained by raw data, the value of compressibility factor f is set, concrete for: when contained by described raw data, number of words is less than or equal to 256, the value of f is set to 1, when contained by described raw data, number of words is greater than 256 and is less than or equal to 65536, the value of described f is set to 2, when contained by described raw data, number of words is greater than 65536 and is less than or equal to 16777216, the value of described f is set to 3, when contained by described raw data, number of words is greater than 16777216 and is less than or equal to 42949677296, the value of described f is set to 4.

High word frequency file generating module 12, compresses for vocabulary compressibility in raw data being greater than predetermined threshold value, generates high word frequency file.Herein, high word frequency file comprises vocabulary and the positional information of vocabulary in raw data.

Not high word frequency file generating module 13, be greater than the vocabulary of predetermined threshold value for deleting compressibility from raw data after, compress generates not high word frequency file.

Sending module 14, for high word frequency file and not high word frequency file are sent to search server, stores high word frequency file and not high word frequency file by search server, and according to high word frequency file and not high word frequency document creation the index for searching for.

Please refer to Fig. 4, is the structural representation of the embodiment of data serching device provided by the invention.It comprises:

Receiver module 21, for receiving the search condition that user is inputted by access client.

Processing module 22, for searching for according to the search condition received, and extracts corresponding high word frequency file and not high word frequency file according to Search Results.

Sending module 23, for the high word frequency file extracted and not high word frequency file are sent to access client, by access client according to the high word frequency file extracted and not high word frequency file generated raw data, namely merges high word frequency file and not high word frequency file.

What finally illustrate is, above embodiment is only in order to illustrate technical scheme of the present invention and unrestricted, although with reference to preferred embodiment to invention has been detailed description, those of ordinary skill in the art is to be understood that, can modify to technical scheme of the present invention or equivalent replacement, and not departing from aim and the scope of technical solution of the present invention, it all should be encompassed in the middle of right of the present invention.

Claims

1. a data processing method, for processing raw data, generating high word frequency file and not high word frequency file, it is characterized in that: comprising:

Calculate the compressibility of each vocabulary in raw data;

2. data processing method as claimed in claim 1, is characterized in that: the compressibility of each vocabulary in described calculating raw data, comprising:

Byte number in number of words contained by the number of times occurred in described raw data according to each vocabulary, each vocabulary and encoded primary data needed for a word, calculate the compressibility of each vocabulary, and this step specifically comprises: according to formula: calculate the compressibility of each vocabulary in described raw data, wherein Co represents compressibility, and W_F represents the number of times that vocabulary occurs in raw data, and W_L represents the number of words contained by vocabulary, n for the byte number in encoded primary data needed for a word, f be compressibility factor.

3. data processing method as claimed in claim 2, it is characterized in that: in described calculating raw data each vocabulary compressibility before, also comprise: contained by raw data, number of words arranges the value of compressibility factor f, this step is specially: when contained by described raw data, number of words is less than or equal to 256, the value of f is set to 1, when contained by described raw data, number of words is greater than 256 and is less than or equal to 65536, the value of described f is set to 2, when contained by described raw data, number of words is greater than 65536 and is less than or equal to 16777216, the value of described f is set to 3, when contained by described raw data, number of words is greater than 16777216 and is less than or equal to 42949677296, the value of described f is set to 4.

4. the data processing method according to any one of claim 1-3, is characterized in that: after the described high word frequency file of generation and not high word frequency file, described method also comprises:

Described high word frequency file and not high word frequency file are sent to search server, store described high word frequency file and not high word frequency file by described search server, and according to described high word frequency file and not high word frequency document creation the index for searching for.

5. a data search method, is characterized in that: comprising:

Receive the search condition that user is inputted by access client;

6. a data processing equipment, for processing raw data, generating high word frequency file and not high word frequency file, it is characterized in that: comprising:

Not high word frequency file generating module, for deleting after compressibility is greater than the vocabulary of predetermined threshold value from described raw data, compresses described Raw Data Generation not high word frequency file.

7. data processing equipment as claimed in claim 6, is characterized in that:

Described computing module, for the byte number in number of words contained by the number of times occurred in described raw data according to each vocabulary, each vocabulary and coding original number needed for a word, calculates the compressibility of each vocabulary, and specifically for:

According to formula: calculate the compressibility of each vocabulary in described raw data, wherein Co represents compressibility, and W_F represents the number of times that vocabulary occurs in raw data, and W_L represents the number of words contained by vocabulary, n for the byte number in encoded primary data needed for a word, f be compressibility factor.

8. data processing equipment as claimed in claim 7, is characterized in that: also comprise:

Module is set, for number of words contained by raw data, the value of compressibility factor f is set, concrete for: when contained by described raw data, number of words is less than or equal to 256, the value of f is set to 1, when contained by described raw data, number of words is greater than 256 and is less than or equal to 65536, the value of described f is set to 2, when contained by described raw data, number of words is greater than 65536 and is less than or equal to 16777216, the value of described f is set to 3, when number of words is greater than 16777216 and is less than or equal to 42949677296 contained by described raw data, the value of described f is set to 4.

9. the data processing equipment according to any one of claim 6-8, is characterized in that: also comprise:

Sending module, for after the described high word frequency file of generation and not high word frequency file, described high word frequency file and not high word frequency file are sent to search server, store described high word frequency file and not high word frequency file by described search server, and according to described high word frequency file and not high word frequency document creation the index for searching for.

10. a data serching device, is characterized in that: comprising: