CN106874518B

CN106874518B - Method and equipment for determining source of corpus and computing equipment

Info

Publication number: CN106874518B
Application number: CN201710153881.4A
Authority: CN
Inventors: 马东辰
Original assignee: Beijing Knownsec Information Technology Co Ltd
Current assignee: Beijing Knownsec Information Technology Co Ltd
Priority date: 2017-03-15
Filing date: 2017-03-15
Publication date: 2020-05-12
Anticipated expiration: 2037-03-15
Also published as: CN106874518A

Abstract

The invention discloses a method for determining the source of a corpus, which is suitable for being executed in a computing device, wherein the computing device is coupled with a corpus sample storage device, and the corpus sample storage device stores a corpus sample from at least one source, and the method comprises the following steps: obtaining a corpus sample of at least one source from a corpus sample storage device; combining the corpus sample of each source with the corpus of the source to be determined, and performing data compression according to a predetermined coding algorithm to generate a compressed file; calculating the compression rate of each compressed file; and determining the source corresponding to the compressed file with the highest compression rate in the obtained at least one compressed file as the source of the corpus of which the source is to be determined. The invention also discloses equipment and computing equipment for determining the source of the corpus.

Description

Method and equipment for determining source of corpus and computing equipment

Technical Field

The invention relates to the technical field of computers, in particular to a method, equipment and computing equipment for determining the source of a corpus.

Background

With the rapid development of network communication technology, the continuous deepening of internet application and the increasing abundance of carried information, the internet has become an important infrastructure of human society. As the year of 2016 is 6 months, the scale of Chinese netizens reaches 7.10 hundred million, wherein 2132 million new netizens are added in the first half of the year, and the growth rate is 3.1%. The popularity of the Internet reaches 51.7 percent, which exceeds the global average level by 3.1 percent. In the 7.1 million netizens, a large amount of anonymous corpora (such as anonymous speeches and anonymous malicious codes) are generated every day, which has great influence on the stable harmony of the society and the information security of the masses. Therefore, it is necessary to determine the source of these corpora.

Typically, the source of the corpus can be determined by looking up the IP address and MAC address of the device that published it. However, this approach is costly, takes a long time, and is difficult to find out carefully disguised corpora, such as anonymous speeches published by publishers using a network in public places, and then by multi-tier brokers.

Therefore, a more advanced and effective scheme for determining the source of corpus is urgently needed.

Disclosure of Invention

To this end, the present invention provides a solution for determining the source of corpora in an attempt to solve or at least alleviate at least one of the problems presented above.

According to one aspect of the present invention, there is provided a method of determining a source of corpus, adapted to be executed in a computing device coupled to a corpus sample storage device, the corpus sample storage device storing corpus samples from at least one source, the method comprising the steps of: obtaining a corpus sample of at least one source from a corpus sample storage device; combining the corpus sample of each source with the corpus of the source to be determined, and performing data compression according to a predetermined coding algorithm to generate a compressed file; calculating the compression rate of each compressed file; and determining the source corresponding to the compressed file with the highest compression rate in the obtained at least one compressed file as the source of the corpus of which the source is to be determined.

According to another aspect of the present invention, there is provided an apparatus for determining a source of corpus, coupled to a corpus sample storage device, the corpus sample storage device storing corpus samples from at least one source, the apparatus for determining a source of corpus comprising: the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is suitable for acquiring a corpus sample of at least one source from corpus sample storage equipment; the corpus compression module is suitable for combining the corpus sample and the corpus of the source to be determined for the corpus sample of each source, and performing data compression according to a preset coding algorithm to generate a compressed file; the ratio calculation module is suitable for calculating the compression ratio of each compressed file generated by the corpus compression module; and the source determining module is suitable for determining the source corresponding to the compressed file with the highest compression rate in the obtained at least one compressed file as the source of the corpus of which the source is to be determined.

According to yet another aspect of the present invention, there is provided a computing device comprising: at least one processor; and at least one memory including computer program instructions; the at least one memory and the computer program instructions are configured to, with the at least one processor, cause the computing device to perform a method of determining a source of a corpus according to the present invention.

According to the scheme for determining the source of the corpus, the collected corpus samples of the known source and the corpus of which the source is to be determined are compressed together, the compression ratio is calculated, and the source of the corpus is determined according to the compression ratio. The whole scheme is simple and quick to realize, the accuracy rate is high, and the operation experience of a user is greatly improved.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 illustrates a block diagram of a computing device 100, according to an exemplary embodiment of the invention;

FIG. 2 illustrates a block diagram of an apparatus 200 for determining a source of corpus in accordance with an exemplary embodiment of the present invention; and

FIG. 3 illustrates a flow diagram of a method 300 of determining a source of corpus in accordance with an exemplary embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 shows a block diagram of a computing device 100, according to an example embodiment of the present invention. The computing device 100 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, and the like, or as a personal computer including desktop and notebook computer configurations. Moreover, computing device 100 may also be implemented as part of a small-form factor portable (or mobile) electronic device, such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-browsing device, a personal headset device, an application-specific device, or a hybrid device that include any of the above functions.

In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: the processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. the example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more applications 122, and program data 124. In some embodiments, application 122 may be arranged to operate with program data 124 on an operating system.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

In the present invention, the application 122 of the computing device 100 may include a device 200 configured to determine the source of corpora that implements aspects of the present invention.

Fig. 2 is a block diagram illustrating a structure of an apparatus 200 for determining a source of corpus according to an exemplary embodiment of the present invention. As shown in FIG. 2, the apparatus for determining a source of corpus 200 is coupled to a corpus sample storage device and may include a sample obtaining module 220, a corpus compressing module 240, a ratio calculating module 260, and a source determining module 280. The corpus refers to a text or a code, the corpus sample storage device stores corpus samples from at least one source, and the source of the corpus refers to an author of the corpus.

The sample obtaining module 220 can obtain the corpus samples of at least one source from the corpus sample storage device.

For the corpus sample of each source obtained, the corpus compression module 240 connected to the sample obtaining module 220 combines the corpus sample and the corpus of the source to be determined, and performs data compression according to a predetermined coding algorithm to generate a compressed file.

Specifically, the way of combining the corpus sample and the corpus of the source to be determined may be to add the corpus of the source to be determined into the corpus sample, where the adding position is usually the end or the beginning; it is also possible that the corpus to be sourced and the corpus sample are placed in the same electronic folder on the computing device 100.

In order to increase the compression rate of data compression, data is generally divided according to a predetermined division rule before compression.

After combining the corpus sample and the corpus to be sourced, the corpus compression module 240 may segment the combined corpus sample and corpus to be sourced according to a word or phrase or word code segment, according to an embodiment of the present invention. For example, text is segmented by words or phrases, and code is segmented by word code segments.

According to another embodiment of the present invention, the corpus sample obtained from the corpus sample storage device has been segmented according to words or phrases or word code segments, and the corpus compression module 240 may segment the corpus from which the source is to be determined in the same manner as the obtained corpus sample before combining the corpus sample and the corpus from which the source is to be determined.

It will be appreciated that each author has its own unique style of text and code, i.e. some of the customary text or code. If the corpus is from the same source, the corpus contains more data redundancy, and the compression algorithm utilizes the data redundancy for compression.

According to an embodiment of the present invention, the corpus compression module 240 may perform data compression on the combined corpus sample and the corpus from which the source is to be determined according to huffman coding (huffman coding).

In data processing of a computing device, Huffman coding uses a variable length coding table to code a source symbol (such as a letter in a corpus), wherein the variable length coding table is obtained by a method for evaluating the occurrence probability of the source symbol, letters with high occurrence probability use shorter codes, and conversely letters with low occurrence probability use longer codes, so that the average length and the expected value of a character string after coding are reduced, and the purpose of lossless compression of data is achieved. The principle of huffman coding is briefly described below by way of example.

Suppose the corpus to be compressed is: "beep Boop beer! ", the corpus corresponds to ASCII code 011000100110010101100101011100000010000001100010011011110110111101110000001000000110001001100101011001010111001000100001.

First, the number of times of each character in the corpus is calculated, and the result is as follows:

character(s)	Number of times
		‘b’	3
‘e’	4
		‘p’	2
‘’	2
		‘o’	2
‘r’	1
		‘！’	1

Then, a huffman tree can be created by using a Priority Queue (Priority Queue), and the left branch of the created huffman tree is coded as 0 and the right branch is coded as 1, so that the huffman can be traversed to obtain the character codes, such as: the code for 'b' is 00, the code for 'p' is 101, and the code for 'r' is 1000.

The resulting code table is as follows:

character(s)	Encoding
		‘b’	00
‘e’	11
		‘p’	101
‘’	011
		‘o’	010
‘r’	1000
		‘！’	1001

By referring to the encoding table, the encoding of the compressed corpus according to the huffman encoding is 0011111010110001001010101100111110001001.

After the corpus compression module 240 performs data compression on the combined corpus sample and the corpus sample of which the source is to be determined and generates a compressed file, the ratio calculation module 260 connected to the data compression module 240 may calculate a compression ratio of each compressed file generated by the corpus compression module 240. Specifically, the ratio calculating module 260 may calculate the compression ratio of the compressed file according to the size of the compressed file, the corpus sample included in the compressed file, and the size of the corpus from which the source is to be determined.

According to one embodiment of the present invention, the formula for calculating the compression rate of a compressed file is as follows:

the compression ratio is 1-compressed file size/(corpus sample size + corpus size of the source to be determined).

According to the principle of a compression algorithm, a corpus sample and a corpus of which the source is to be determined are combined and then compressed, and whether the corpus of which the source is to be determined and the corpus sample are the same source can be judged through the compression ratio. Generally, if the corpus and the corpus sample of the source are determined to be the same source, the compression ratio is higher, otherwise, the compression ratio is lower.

Therefore, the source determining module 280 connected to the ratio calculating module 260 may determine the source corresponding to the compressed file with the highest compression rate in the at least one compressed file obtained by the corpus compressing module 240 as the source of the corpus to be determined.

Further, in order to improve the accuracy of the source determination, according to an embodiment of the present invention, the source determining module 280 may further extract a part of the compressed files with a compression rate greater than a predetermined threshold from the at least one compressed file obtained by the corpus compressing module 240, and then determine a source corresponding to a compressed file with a highest compression rate from the extracted part of the compressed files as a source of the corpus of which the source is to be determined.

Thus, different from the traditional source determination method, the device 200 for determining the source of the corpus can simply, effectively and quickly determine the source of the corpus, has low cost and solves the problem that the source is difficult to trace after the corpus is disguised.

After determining the source of the corpus, the source determining module 280 may further add the corpus to the corpus sample from the same source and store the corpus sample in the corpus sample storage device, thereby realizing the accumulation of the corpus sample.

If there is no portion of the obtained at least one compressed file with a compression rate greater than the predetermined threshold, it may be determined that none of the sources of the corpus samples obtained by the sample obtaining module 220 is the source of the corpus of which the source is to be determined.

FIG. 3 illustrates a method 300 for determining a source of a corpus, which may be performed in a computing device 100, according to an exemplary embodiment of the invention, the computing device 100 being coupled to a corpus sample storage device, which stores corpus samples from at least one source. As shown in FIG. 3, the method 300 for determining the source of corpus begins at step S320.

In step S320, a corpus sample of at least one source is obtained from a corpus sample storage device. The corpus may include text and codes, and the source of the corpus may be an author of the corpus.

Then, in step S340, for each corpus sample of the source,

combining the corpus sample and the corpus of which the source is to be determined, and performing data compression according to a preset coding algorithm to generate a compressed file. Wherein the predetermined coding algorithm is typically huffman coding.

According to an embodiment of the present invention, after the corpus sample and the corpus to be sourced are combined together, the combined corpus sample and the corpus to be sourced are segmented according to word or short sentence or word code segments.

According to another embodiment of the present invention, the obtained corpus sample is segmented according to words or phrases or word code segments, so that the corpus to be sourced can be segmented in the same way as the corpus sample before combining the corpus sample and the corpus to be sourced.

After generating the compressed file by combining the corpus sample of each source and the corpus of which the source is to be determined, in step S360, the compression rate of each compressed file is calculated. Specifically, the compression rate of the compressed file is calculated according to the size of the compressed file, the corpus samples included in the compressed file, and the size of the corpus from which the source is to be determined, for example, a formula for calculating the compression rate of the compressed file may be as follows: the compression ratio is 1-compressed file size/(corpus sample size + corpus size of the source to be determined).

After the compression rate of each compressed file is obtained, in step S380, the source corresponding to the compressed file with the highest compression rate in the obtained at least one compressed file is determined as the source of the corpus of which the source is to be determined.

According to another embodiment of the present invention, step S380 may further include: and extracting partial compressed files with the compression rate larger than a preset threshold value from the at least one obtained compressed file, and determining a source corresponding to one compressed file with the highest compression rate from the extracted partial compressed files as the source of the corpus of which the source is to be determined.

The detailed explanation of the corresponding processing of each step has been already made in the detailed description of the principle of the apparatus 200 for determining the source of corpus with reference to fig. 1 to fig. 2, and repeated descriptions are omitted here.

It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the method of determining the source of the corpus of the present invention according to instructions in the program code stored in the memory.

By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.

The invention also includes: a6, the method according to any one of A1-5, wherein the step of determining the source of the corpus from which the source is to be determined further comprises: extracting a part of the obtained at least one compressed file, wherein the compression rate is larger than a preset threshold value; and determining the source corresponding to the compressed file with the highest compression rate in the extracted partial compressed files as the source of the corpus of which the source is to be determined. A7, the method of any one of A1-6, wherein the source of said corpus comprises authors of said corpus. A8, the method of any one of A1-7, wherein the corpus and corpus samples comprise text and code. A9, the method of any one of A1-8, wherein the predetermined coding algorithm is Huffman coding.

B14 the apparatus of B13, wherein the formula for calculating the compression ratio of the compressed file is as follows: the compression ratio is 1-compressed file size/(corpus sample size + corpus size of the source to be determined). B15, the device according to any of B9-14, wherein the source determining module is further adapted to extract a part of the obtained at least one compressed file with a compression rate greater than a predetermined threshold; and determining the source corresponding to the compressed file with the highest compression rate in the extracted partial compressed files as the source of the corpus of which the source is to be determined. B16, the device as in any one of B9-15, wherein the source of the corpus comprises authors of the corpus. B17, the device as in any one of B9-16, wherein the corpus and corpus samples comprise text and code. B18, the apparatus according to any of B9-17, wherein the predetermined coding algorithm is huffman coding.

It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims

1. A method of determining a source of a corpus, adapted to be executed in a computing device coupled to a corpus sample storage device, the corpus sample storage device storing corpus samples from at least one source, the source of the corpus including an author of the corpus, the method comprising the steps of:

obtaining a corpus sample of at least one source from a corpus sample storage device;

for each of the corpus samples from each of the sources,

combining the corpus sample and the corpus of which the source is to be determined, and performing data compression according to a preset coding algorithm to generate a compressed file;

calculating the compression rate of each compressed file; and

and determining the source corresponding to the compressed file with the highest compression rate in the obtained at least one compressed file as the source of the corpus of which the source is to be determined.

2. The method of claim 1, further comprising the steps of:

after combining the corpus sample and the corpus from which the source is to be determined,

and (4) dividing the combined corpus sample and the corpus of which the source is to be determined according to words or short sentences.

3. The method of claim 1, wherein the corpus samples are segmented by words or phrases, the method further comprising the steps of:

prior to combining the corpus sample and the corpus from which the source is to be determined,

and segmenting the corpus of the source to be determined according to the same mode as the corpus sample.

4. The method of claim 1, wherein the step of calculating the compression ratio of the compressed file further comprises:

and calculating the compression ratio of the compressed file according to the size of the compressed file, the corpus samples contained in the compressed file and the size of the corpus of which the source is to be determined.

5. The method of claim 4, wherein the compression rate formula for calculating the compressed file is as follows:

compression ratio = 1-compressed file size/(corpus sample size + corpus size of the source to be determined).

6. The method according to any one of claims 1-5, wherein the step of determining the source of the corpus from which the source is to be determined further comprises:

extracting a part of the obtained at least one compressed file, wherein the compression rate is larger than a preset threshold value;

and determining the source corresponding to the compressed file with the highest compression rate in the extracted partial compressed files as the source of the corpus of which the source is to be determined.

7. The method of any of claims 1-5, wherein the corpus and corpus samples comprise text and code.

8. The method of any one of claims 1-5, wherein the predetermined encoding algorithm is Huffman encoding.

9. An apparatus for determining a source of corpus, coupled to a corpus sample storage device, the corpus sample storage device storing corpus samples from at least one source, the source of corpus including authors of the corpus, the apparatus for determining the source of corpus comprising:

the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is suitable for acquiring a corpus sample of at least one source from corpus sample storage equipment;

the corpus compression module is suitable for combining the corpus sample and the corpus of the source to be determined for the corpus sample of each source, and performing data compression according to a preset coding algorithm to generate a compressed file;

the ratio calculation module is suitable for calculating the compression ratio of each compressed file generated by the corpus compression module; and

and the source determining module is suitable for determining a source corresponding to one compressed file with the highest compression rate in the obtained at least one compressed file as the source of the corpus of which the source is to be determined.

10. The apparatus of claim 9, wherein the corpus compression module is further adapted to compress the corpus

11. The apparatus of claim 9, wherein the corpus samples are segmented by words or phrases, the corpus compression module further adapted to segment the corpus samples by words or phrases

12. The apparatus of claim 9, wherein the ratio calculation module is further adapted to

13. The apparatus of claim 12, wherein a compression rate formula for calculating the compressed file is as follows:

14. The apparatus of any one of claims 9-13, wherein the source determination module is further adapted to

15. The apparatus of any of claims 9-13, wherein the corpus and corpus samples comprise text and code.

16. The apparatus of any one of claims 9-13, wherein the predetermined encoding algorithm is huffman coding.