WO2013038527A1 - 抽出方法、抽出プログラム、抽出装置、および抽出システム - Google Patents
抽出方法、抽出プログラム、抽出装置、および抽出システム Download PDFInfo
- Publication number
- WO2013038527A1 WO2013038527A1 PCT/JP2011/071028 JP2011071028W WO2013038527A1 WO 2013038527 A1 WO2013038527 A1 WO 2013038527A1 JP 2011071028 W JP2011071028 W JP 2011071028W WO 2013038527 A1 WO2013038527 A1 WO 2013038527A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- information
- character
- file
- computer
- predetermined character
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
- G06F16/148—File search processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Definitions
- the present invention relates to an extraction method, an extraction program, an extraction apparatus, and an extraction system for extracting information.
- the present invention provides an extraction method, an extraction program, an extraction device, and an extraction system capable of suppressing an increase in search processing time in accordance with an increase in the number of files in order to solve the above-described problems caused by the prior art. Objective.
- first information indicating whether or not predetermined character information is included for each of a plurality of files, and the plurality of files And at least one of the second information indicating whether or not the predetermined character information is included in the storage means, and when the search request for the predetermined character information is received, the second information Providing an extraction method, an extraction program, an extraction device, and an extraction system that extract a file including the predetermined character information based on the first information when it is detected that the predetermined character information is included Is done.
- FIG. 1 is an explanatory diagram showing the distribution of the compression symbol map according to the present embodiment.
- FIG. 2 is an explanatory diagram showing a server storing segment groups.
- FIG. 3 is an explanatory diagram showing an example of adding a compression symbol map when a target file is added.
- FIG. 4 is an explanatory diagram showing hierarchization of appearance maps.
- FIG. 5 is an explanatory diagram showing hierarchization of the deletion map.
- FIG. 6 is an explanatory diagram showing details of the hierarchized segment group.
- FIG. 7 is an explanatory diagram showing a configuration example of a computer system in which the hierarchical segment group shown in FIG. 6 is implemented.
- FIG. 8 is an explanatory diagram showing an example of narrowing down a compressed file using a hierarchical segment group.
- FIG. 1 is an explanatory diagram showing the distribution of the compression symbol map according to the present embodiment.
- FIG. 2 is an explanatory diagram showing a server storing segment groups.
- FIG. 3 is an explan
- FIG. 9 is a block diagram of a hardware configuration example of the computer according to the embodiment.
- FIG. 10 is an explanatory diagram of a system configuration example according to the present embodiment.
- FIG. 11 is a block diagram illustrating a functional configuration example 1 of the computer or the computer system according to the present embodiment.
- FIG. 12 is an explanatory diagram showing the flow of processing from the counting unit to the second compression unit of the computer shown in FIG.
- FIG. 13 is an explanatory diagram illustrating an example of creating the compression code map Ms by the summation unit and the creation unit.
- FIG. 14 is an explanatory diagram showing details of (1) counting the number of appearances.
- N 11
- FIG. 17 is an explanatory diagram showing a correction result for each character information.
- FIG. 20 is an explanatory diagram showing a leaf structure.
- FIG. 21 is an explanatory diagram showing a structure of specific single characters.
- FIG. 22 is an explanatory diagram of a divided character code structure.
- FIG. 23 is an explanatory diagram of a basic word structure.
- FIG. 24 is an explanatory diagram of an example of generating a compression symbol map.
- FIG. 25 is a flowchart illustrating an example of a compression code map creation processing procedure by the creation unit.
- FIG. 26 is a flowchart illustrating a detailed processing procedure example of the aggregation processing (step S2501) illustrated in FIG.
- FIG. 27 is a flowchart illustrating a detailed processing procedure example of the target file aggregation processing (step S2603) illustrated in FIG.
- FIG. 28 is an explanatory diagram of a character appearance frequency tabulation table.
- FIG. 29 is a flowchart showing a detailed processing procedure example of the basic word totaling process (step S2702) shown in FIG.
- FIG. 30 is an explanatory diagram of a basic word appearance frequency totaling table.
- FIG. 31 is a flowchart showing a detailed processing procedure of the longest match search process (step S2901) shown in FIG.
- FIG. 32 is a flowchart showing a detailed processing procedure example of the map allocation number determination processing (step S2502) shown in FIG.
- FIG. 33 is a flowchart illustrating a detailed processing procedure example of the recounting process (step S2503) illustrated in FIG.
- FIG. 34 is a flowchart illustrating a detailed processing procedure example of the recalculation processing of the target file (step S3303).
- FIG. 35 is an explanatory diagram of an upper divided character code appearance frequency totaling table.
- FIG. 36 is an explanatory diagram of a lower divided character code appearance frequency totaling table.
- FIG. 37 is a flowchart showing a detailed processing procedure of the bi-gram character string specifying process (step S3406) shown in FIG.
- FIG. 38 is an explanatory diagram of a bi-gram character string appearance frequency totaling table.
- FIG. 39 is a flowchart showing a detailed processing procedure example of the Huffman tree generation processing (step S2504) shown in FIG.
- FIG. 40 is a flowchart showing a detailed processing procedure example of the branch number specifying process (step S3904) shown in FIG.
- FIG. 41 is a flowchart showing a detailed processing procedure of the construction process (step S3905) shown in FIG.
- FIG. 42 is a flowchart of a detailed process procedure of the leaf pointer generation process (step S4103) shown in FIG.
- FIG. 43 is a flowchart showing a detailed processing procedure example of the map creation processing (step S2505) shown in FIG.
- FIG. 44 is a flowchart of a detailed process procedure of the target file map creation process (step S4303) depicted in FIG.
- FIG. 45 is a flowchart showing a detailed processing procedure example of the basic word appearance map creation processing (step S4402) shown in FIG.
- FIG. 46 is a flowchart showing a detailed processing procedure example of the specific single character appearance map creation processing (step S4403) shown in FIG.
- FIG. 47 is a flowchart showing a detailed processing procedure example of the divided character code appearance map creation processing (step S4603) shown in FIG.
- FIG. 48 is a flowchart showing a detailed processing procedure example of the bi-gram character string map creation processing (step S4404) shown in FIG.
- FIG. 49 is a flowchart showing a detailed processing procedure example of the bi-gram character string appearance map generation processing (step S4803).
- FIG. 50 is an explanatory diagram illustrating a specific example of compression processing using the 2 N- branch nodeless Huffman tree H.
- FIG. 51 is a flowchart illustrating an example of a compression processing procedure for a target file group using the 2 N- branch nodeless Huffman tree H by the first compression unit.
- FIG. 52 is a flowchart (part 1) showing a detailed processing procedure of the compression processing (step S5103) shown in FIG. FIG.
- FIG. 53 is a flowchart (part 2) showing a detailed processing procedure of the compression processing (step S5103) shown in FIG.
- FIG. 54 is a flowchart (part 3) showing a detailed processing procedure of the compression processing (step S5103) shown in FIG.
- FIG. 55 is an explanatory diagram showing the relationship between the appearance rate and the appearance rate area.
- FIG. 56 is an explanatory diagram of a compression pattern table having compression patterns for each appearance rate area.
- FIG. 57 is an explanatory diagram showing compression patterns in the case of the B region and the B ′ region.
- FIG. 58 is an explanatory diagram showing compression patterns in the case of the C region and the C ′ region.
- FIG. 59 is an explanatory diagram showing compression patterns in the case of the D region and the D ′ region.
- FIG. 60 is an explanatory diagram showing compression patterns for the E region and the E ′ region.
- FIG. 61 is a flowchart showing a compression symbol map compression processing procedure.
- FIG. 62 is a block diagram showing a functional configuration example 2 of the computer or the computer system according to the present embodiment.
- FIG. 63 is an explanatory diagram of a file decompression example.
- FIG. 64 is an explanatory diagram (part 1) illustrating a specific example of the decompression process in FIG. 63.
- FIG. 65 is an explanatory diagram (part 2) of the specific example of the decompression process in FIG. 63.
- FIG. 66 is an explanatory diagram of a specific example of the file addition process.
- FIG. 67 is a flowchart of a detailed process procedure of the segment addition process.
- FIG. 68 is a flowchart (first half) showing a detailed processing procedure of the map update processing (step S6709) using the additional file shown in FIG.
- FIG. 69 is a flowchart (second half) showing a detailed processing procedure of the map update processing (step S6709) using the additional file shown in FIG.
- FIG. 70 is a flowchart showing a detailed processing procedure of the segment hierarchization processing.
- FIG. 71 is a flowchart showing a detailed processing procedure of the selected appearance map aggregation process (step S7004) shown in FIG.
- FIG. 72 is a flowchart showing a detailed processing procedure of the deletion map aggregation processing (step S7005) shown in FIG.
- FIG. 73 is a flowchart showing a search processing procedure according to the present embodiment.
- FIG. 73 is a flowchart showing a search processing procedure according to the present embodiment.
- FIG. 74 is a flowchart (part 1) showing a detailed processing procedure of the pointer specifying process (step S7302) shown in FIG.
- FIG. 75 is a flowchart (part 2) showing a detailed processing procedure of the pointer specifying process (step S7302) shown in FIG.
- FIG. 76 is a flowchart of a detailed process procedure of the file narrowing process (step S7303) depicted in FIG.
- FIG. 77 is a flowchart (part 1) of a detailed process procedure example of the decompression process (step S7304) using the 2 N- branch nodeless Huffman tree H depicted in FIG. 73.
- FIG. 78 is a flowchart (part 2) of a detailed process procedure example of the decompression process (step S7304) using the 2 N- branch nodeless Huffman tree H depicted in FIG.
- character information refers to single characters, basic words, divided character codes, and the like that constitute text data.
- the target file group is, for example, electronic data such as a document file, a Web page, and an e-mail, and is, for example, electronic data in a text format, an HTML (HyperText Markup Language) format, or an XML (Extensible Markup Language) format.
- HTML HyperText Markup Language
- XML Extensible Markup Language
- single character is a character expressed by one character code.
- the character code length of a single character varies depending on the character code type.
- UTF Unicode Transform Format 16 is 16-bit code
- ASCII American Standard Code for Information Interchange
- Shift JIS Japanese Industrial Standard bit
- Japanese characters with a shift JIS code two 8-bit codes are combined.
- “basic words” refer to basic words learned at elementary and junior high schools and reserved words expressed in specific character strings. Taking the English text of “This is a ...” as an example, it is a word such as “This”, “is”, “a”, etc., and is classified into a thousand word level, a two thousand word level, and a several thousand word level. The dictionary is marked with “***”, “**”, and “*”.
- the reserved word is a predetermined character string, and includes, for example, an HTML tag (for example, ⁇ br>).
- the “division character code” is each code obtained by dividing a single character into an upper code and a lower code.
- a single character may be divided into an upper code and a lower code.
- the character code of a single character “turf” is represented by “9D82”, but is divided into an upper divided character code “0x9D” and a lower divided character code “0x82”.
- “gram” is a character unit. For example, for a single character, one character is 1 gram. With regard to the divided character code, the divided character code alone is 1 gram. Therefore, the single character “turf” is 2 grams.
- UTF16 will be described as an example of a character code.
- bit when “bit is ON”, the value of the bit is “1”, and when “bit is OFF”, the value of the bit is “0”.
- bit when “bit is ON”, the value of the bit may be “0”, and when “bit is OFF”, the value of the bit may be “1”.
- “Appearance map” is an index for full-text search, and is a bit string concatenating a pointer that specifies character information and a bit string that indicates whether the character information exists in each target file. At the time of search processing, this bit string can be used as an index indicating whether or not to include character information to be searched according to ON / OFF of bits.
- a pointer for designating character information for example, a compression code of character information is employed.
- the character information itself may be used as the pointer for designating the character information.
- the “compression code map” is a bitmap in which appearance maps for each character information indicated by the pointers of the compression codes are collected.
- the compression code map of the bi-gram character string is a compression code string that combines the compression code of the first gram and the compression code of the second gram.
- “2-gram character string” is a character string in which 1-gram character codes are concatenated.
- the character string “doll play” includes two consecutive characters “doll”, “form turf”, and “play”. Since the two-character “doll” “person” and “shape” are single characters that are not divided, the two-character “doll” is a 2-gram character string as it is.
- the target file group When the target file group is compressed by basic words, it is possible to access with one pass when generating or searching the compression symbol map. If the target file group is not compressed, the character code of the character information may be used as it is as a pointer for specifying the character information.
- the “deletion map” is an index that indicates the existence or deletion of the target file as a bit string.
- the target file can be excluded from the search target by turning off the deletion map corresponding to the target file without deleting the target file itself.
- FIG. 1 is an explanatory diagram showing the distribution of the compression symbol map according to the present embodiment.
- the segment sg0 (1) is a segment having a compression code map for compressed files f1 to fn
- the segment sg0 (2) is a segment having a compression code map for compressed files f (n + 1) to f (2n).
- the segment sg0 (3) is a segment having a compression symbol map from the compressed files f (2n + 1) to f (3n).
- the segment sg0 (1) exists.
- the segment sg0 (2) is generated.
- the segment sg0 (3) Generated.
- the last segment sg0 (K) becomes a segment having a compression symbol map from the compressed files f ((K ⁇ 1) n + 1) to f (Kn) (where K is Indicates the current number of segments, K is an integer greater than or equal to 1.)
- Each segment has management areas A1 to AK (management area group As).
- the management areas A1 to AK include a pointer to the preceding segment, a pointer to the succeeding segment, a pointer to each appearance map constituting the compression symbol map in the own segment, a pointer to the deletion map in the own segment, and the inside of the own segment A pointer to each compressed file is stored.
- pointer to segment sg0 (1) (address of segment sg0 (1)“ 00000000h ”)” is stored in the pointer to the preceding segment of segment sg0 (2). Further, “0FFFFFFFh” is stored in the pointer to the segment subsequent to the segment sg0 (2).
- the compression code maps M1 to MK (compression code map Ms) of each segment have appearance maps with the same character information, but have different file numbers in charge.
- the file number in charge of the compression code maps M1 to MK of each segment is the file number of the compressed file held by the segment.
- the compression symbol map MK of the segment sg0 (K) has a bit string indicating the presence / absence of the file numbers (K-1) n to Kn for the appearance map of each character information.
- the deletion map D1 to DK (deletion map Ds) of each segment also has a different file number in charge, like the compression symbol map group Ms.
- the file number assigned to each of the deletion maps D1 to DK is the file number of the compressed file held by the own segment.
- the deletion map DK of the segment sg0 (K) has a bit string indicating the presence or deletion of the file numbers (K-1) n to Kn for the appearance map of each character information.
- FIG. 2 is an explanatory diagram showing a server storing segment groups.
- the server 200 has a database 201.
- the database 201 stores an archive file 202.
- the archive file 202 includes a batch unit 211 and an adding unit 212.
- the batch unit 211 stores c segments sg0 (1) to sg0 (c) by default.
- the adding unit 212 stores the added segments sg0 (c + 1) to sg0 (K). When there is no more free space in the adding unit 212, it is stored in another server that can communicate with the server 200 via the network.
- FIG. 3 is an explanatory diagram showing an example of adding a compression symbol map when a target file is added.
- the segments sg0 (1) and sg0 (2) have been registered, and in the appearance map for the compression codes P (LT1) to P (LTz) of the character information LT1 to LTz, the file numbers 1 to 2n The index information of is stored.
- (B) shows a state where the appearance map group is compressed from the state of (A).
- the compression method will be described later, it is assumed that the compression is performed, for example, when the number of files of one segment is a multiple of n. In this case, since the number of files is a multiple 2n of n, the bit string which is index information is compressed for each appearance map. Further, when confirming the presence / absence of the character information LT1 to LTz, it is assumed to be expanded.
- the decompression method will also be described later. In this way, it is possible to save memory by usually compressing and storing and decompressing only when necessary.
- (C) shows a state where a new compressed file f (2n + 1) is added from the state of (B).
- the segment sg0 (2) which is the last segment in (B) cannot store the compressed file f (2n + 1), so the segment sg0 (3) is newly set and the compressed file f (2n + 1) is saved. become.
- a bit for the compression file f (2n + 1) is set for each compression code.
- “1” is set for the character information LT1, LT2, and “0” is set for LTz.
- (D) shows a state in which n compressed files f (2n + 1) to f (3n) are added in the segment sg0 (3) from the state of (C).
- FIG. 4 is an explanatory diagram showing hierarchization of appearance maps.
- the bit string serving as index information for each compression code becomes redundant.
- redundancy it is necessary to check ON / OFF of the bit indicating the presence or absence of the total number of files for each compression code, that is, for each character information, but it is useless to check for a nonexistent location, This increases search time. Therefore, when m + 1 segments are generated for each compression code, that is, for each character information, the index information is aggregated in units of m.
- a case where the index information of the compression code P (LTx) of the character information LTx is collected in an upper layer will be described as an example.
- X in “sgX (Y)” indicates a hierarchy number
- Y indicates a segment number. Therefore, in the case of sgX (Y), it is the Yth segment of the Xth hierarchy.
- the segments sg0 (1) to sg0 (K) described so far are the segments of the 0th layer.
- Such aggregation is not only between the 0th layer and the 1st layer, but when the number of segments in the top layer becomes m, a segment in the upper layer is newly generated. For example, when the segment is completed up to the segment sg1 (m) in the first hierarchy, the segment sg2 (1) in the second hierarchy is generated as described above.
- FIG. 4 shows an example up to the second layer, as the number of compressed files to be added increases, the files are consolidated into the third and higher layers.
- FIG. 5 is an explanatory diagram showing hierarchization of the deletion map.
- the deletion map is also aggregated in the upper hierarchy in the segment unit as in FIG.
- FIG. 6 is an explanatory diagram showing details of the hierarchized segment group.
- segment group of FIG. 6 m 2 segments sg0 (1) to sg0 (m 2 ) as shown in FIG. 1 are generated in the 0th hierarchy.
- segments sg1 (1) to sg1 (m) having a similar data structure are generated for the upper layer.
- an appearance map aggregated in the zeroth layer is stored for each compression code.
- a deletion map aggregation deletion map
- a pointer to the preceding segment and a pointer to the subsequent segment are set in each management area.
- a pointer to the aggregate appearance map and a pointer to the aggregate deletion map in the own segment are also stored.
- pointers to lower-layer segments are stored. For example, in segment sg1 (1), a pointer to segment sg0 (1) to a segment sg0 (m) in the lower hierarchy is stored, and segment sg0 (1) to segment sg0 (m) are designated. Can do. Note that the compressed file is not stored in the segment of the first hierarchy or higher.
- FIG. 7 is an explanatory diagram showing a configuration example of a computer system in which the hierarchical segment group SG shown in FIG. 6 is mounted.
- m segments are set as one archive file.
- “AX (Y)” is a code of the archive file, X indicates a hierarchy number, and Y indicates an archive number. Therefore, in the case of AX (Y), it is the Yth archive file in the Xth hierarchy.
- the archive file A0 (1) is a set of segments sg0 (1) to sg0 (m) in the 0th hierarchy.
- the master server MS stores an archive file of the first hierarchy or higher.
- the slave servers S1, S2,..., S (2m + 1),... Store one archive file assigned by the master server MS.
- the allocation of the archive file in FIG. 7 is an example, and the master server MS does not need to be in charge of all the archive files of the first layer and higher, and may be distributed to other servers.
- the slave servers S1, S2,..., S (2m + 1),... May be responsible for not only one archive file but also a plurality of archive files.
- FIG. 8 is an explanatory view showing an example of narrowing down the compressed file using the hierarchical segment group SG.
- the uppermost hierarchy is described as the second hierarchy.
- the solid line arrow indicates that the lower layer segment is designated according to the AND result, and the dotted line arrow is not actually designated, but is illustrated for comparison with the designated segment. Yes.
- FIG. 8 shows a case where “doll” is input as a search character string.
- the AND result is “0”. Accordingly, it is understood that the AND results of the segments sg1 (3) and sg1 (4) are all 0 without performing the AND operations of the segments sg1 (3) and sg1 (4).
- FIG. 9 is a block diagram of a hardware configuration example of the computer according to the embodiment.
- the computer includes a CPU (Central Processing Unit) 901, a ROM (Read Only Memory) 902, a RAM (Random Access Memory) 903, a magnetic disk drive 904, a magnetic disk 905, an optical disk drive 906, An optical disc 907, a display 908, an I / F (Interface) 909, a keyboard 910, a mouse 911, a scanner 912, and a printer 913 are provided.
- a bus 900 Each component is connected by a bus 900.
- the CPU 901 controls the entire computer.
- the ROM 902 stores programs such as a boot program.
- the ROM 902 stores a program for generating and managing the compression code map Ms and a program for performing a search using the compression code map Ms.
- the RAM 903 is used as a work area for the CPU 901, and the CPU 901 can read a program stored in the ROM 902 into the RAM 903 and execute it.
- the magnetic disk drive 904 controls reading / writing of data with respect to the magnetic disk 905 according to the control of the CPU 901.
- the magnetic disk 905 stores data written under the control of the magnetic disk drive 904.
- the optical disk drive 906 controls reading / writing of data with respect to the optical disk 907 according to the control of the CPU 901.
- the optical disk 907 stores data written under the control of the optical disk drive 906, and causes the information processing apparatus to read data stored on the optical disk 907.
- the display 908 displays data such as a document, an image, and function information as well as a cursor, an icon, or a tool box.
- a CRT a CRT
- a TFT liquid crystal display a plasma display, or the like can be adopted.
- I / F An interface (hereinafter abbreviated as “I / F”) 909 is connected to a network 914 such as a LAN (Local Area Network), a WAN (Wide Area Network), or the Internet through a communication line, and the other via the network 914. Connected to other devices.
- the I / F 909 manages an internal interface with the network 914 and controls data input / output from an external device.
- a modem or a LAN adapter may be employed as the I / F 909.
- the keyboard 910 includes keys for inputting characters, numbers, various instructions, etc., and inputs data. Moreover, a touch panel type input pad or a numeric keypad may be used.
- the mouse 911 performs cursor movement, range selection, window movement, size change, and the like.
- a trackball or a joystick may be used as long as they have the same function as a pointing device.
- the scanner 912 optically reads an image and takes in the image data into the computer.
- the scanner 912 may have an OCR (Optical Character Reader) function.
- the printer 913 prints image data and document data.
- a laser printer or an inkjet printer can be employed as the printer 913.
- the computer may be a portable terminal such as a mobile phone, a smartphone, an electronic book terminal, or a notebook personal computer in addition to the above-described various servers and stationary personal computers.
- the present embodiment may be implemented according to a plurality of computers.
- FIG. 10 is an explanatory diagram showing a system configuration example according to the present embodiment.
- the system includes computers 1001 to 1003, a network 1004, a switch 1005, and a radio base station 1007 that can include the hardware shown in FIG.
- the I / F included in the information processing apparatus 1003 has a wireless communication function.
- the computer 1001 executes processing for generating a compression code map for content including a plurality of files, distributes the computer 1002, 1003, and executes search processing for the content distributed by the computers 1002, 1003, respectively. May be.
- each of the computers 1001 to 1003 may be a portable terminal such as a mobile phone, a smartphone, an electronic book terminal, or a notebook personal computer in addition to the above-described various servers and stationary personal computers. Good.
- FIG. 11 is a block diagram showing a functional configuration example 1 of the computer or computer system according to the present embodiment.
- FIG. 12 is a diagram from the totaling unit to the second compression unit of the computer or computer system shown in FIG. It is explanatory drawing which shows the flow of a process.
- a computer or a computer system (hereinafter referred to as “computer 1100”) includes a totaling unit 1101, a first generation unit 1102, a first compression unit 1103, a creation unit 1104, a second generation unit 1105, 2 compression unit 1106.
- the totaling unit 1101 to the second compression unit 1106 extract the functions by causing the CPU 901 to execute a program stored in a storage device such as the ROM 902, the RAM 903, or the magnetic disk 905 shown in FIG. Realized as a device. Note that the totaling unit 1101 to the second compression unit 1106 write the execution results to the storage device and read the execution results of the other units, respectively, and execute the calculations.
- the summary unit 1101 to the second compression unit 1106 will be briefly described below.
- the totaling unit 1101 totals the number of appearances of character information in the target file group. Specifically, for example, the counting unit 1101 counts the number of appearances of character information in the target file group Fs as shown in FIG. The counting unit 1101 counts the number of appearances for each specific single character, upper divided character code, lower divided character code, bi-gram character, and basic word. Detailed processing contents of the counting unit 1101 will be described later.
- generation part 1102 produces
- the 2 N- branching nodeless Huffman tree H is a Huffman tree in which 2 N branches branch from the root, and the leaf is directly pointed by one or a plurality of branches. There are no nodes (inner nodes). Since there is no node and it hits the leaf directly, the expansion speed can be increased compared to a normal Huffman tree having nodes.
- a leaf is a structure including corresponding character information and its compression code. Also called a leaf structure. The number of branches assigned to the leaf depends on the compression code length of the compression code existing in the assignment destination leaf. Detailed processing contents of the first generation unit 1102 will be described later.
- the first compression unit 1103 compresses each target file of the target file group Fs using the 2 N branching no-node Huffman tree H to form a compressed file group fs (FIG. 12C). Detailed processing contents of the first compression unit 1103 will be described later.
- the creating unit 1104 creates a compression code map Ms based on the summation result of the summation unit 1101 and the compression code assigned for each character information in the 2 N -branch nodeless Huffman tree H.
- the creation unit 1104 also creates the compression code map Ms for each specific single character, upper divided character code, lower divided character code, bi-gram character, and basic word.
- the creation unit 1104 turns on the bit of the file number (FIG. 12D). In the initial state, the deletion map Ds is all turned on for each target file. Detailed processing contents of the creation unit 1104 will be described later.
- the second generation unit 1105 generates a no-node Huffman tree h that compresses the appearance map based on the appearance probability of the character information (FIG. 12E). Detailed processing contents of the second generation unit 1105 will be described later.
- the nodeless Huffman tree generated by the second generation unit 1105 of the master server MS is transmitted to the slave servers S1, S2,.
- the second compression unit 1106 compresses each appearance map using the nodeless Huffman tree generated by the second generation unit 1105 (FIG. 12 (F)). Detailed processing contents of the second compression unit 1106 will be described later.
- the slave servers S1, S2,... are transmitted by the second compression unit 1106 using the Huffman tree generated and transmitted by the second generation unit 1105 of the master server MS.
- FIG. 13 is an explanatory diagram showing an example of the totalization by the totalization unit 1101 and the creation of the compression symbol map Ms by the creation unit 1104.
- the computer 1100 counts the appearance count of character information existing in the target file group Fs.
- the tabulation results are sorted in descending order of the number of appearances, and the ascending order is given in descending order of the number of appearances.
- the computer 1100 calculates the compression code length for each character information based on the total result obtained in (1). Specifically, the computer 1100 calculates the appearance rate for each character information. The appearance rate is obtained by dividing the number of appearances of character information by the total number of appearances of all character information. Then, the computer 1100 obtains an occurrence probability corresponding to the appearance rate, and derives a compression code length from the occurrence probability.
- the probability of occurrence is expressed by 1/2 x .
- x is a power number.
- the compression code length is a power number x of the occurrence probability. Specifically, the compression code length is determined depending on which range of the occurrence rate is below the occurrence probability.
- AR is the appearance rate. 1/2 0 > AR ⁇ 1/2 1 ...
- the compression code length is 1 bit. 1/2 1 > AR ⁇ 1/2 2 ...
- the compression code length is 2 bits. 1/2 2 > AR ⁇ 1/2 3 ...
- the compression code length is 3 bits. 1/2 3 > AR ⁇ 1/2 4 ...
- the compression code length is 4 bits. ⁇ ⁇ ⁇ 1/2 N-1> AR ⁇ 1 /2 N ⁇ compression code length N bits.
- the computer 1100 specifies the number of leaves for each compression code length by counting the number of leaves for each compression code length.
- the maximum compression code length is 17 bits.
- the number of leaves is the number of types of character information. Therefore, when the number of leaves of the compression code length of 5 bits is 2, this indicates that there are two character information to which a 5-bit compression code is assigned.
- the total number (number of leaves) of character information to which a compression code having a compression code length of 11 bits is assigned is 1215, but the number of branches per leaf is one.
- the total number (number of leaves) of character information to which a compression code having a compression code length of 6 bits is assigned is 6, but the number of branches per leaf is 32.
- 32 branches are assigned.
- the leaf structure is a data structure in which character information, a compression code length thereof, and a compression code corresponding to the compression code length are associated with each other.
- the compression code length of the character “0” that appears first is 6 bits, and the compression code is “000000”.
- a pointer to a leaf is a bit string obtained by concatenating a bit string corresponding to a number corresponding to the number of branches per leaf to a compression code in the leaf structure that is the point destination. For example, since the compression code length of the compression code “000000” assigned to the character “0” that is the leaf L1 is 6 bits, the number of branches per leaf L1 is 32.
- the first 6 bits of the pointer to the leaf L1 are the compression code “000000”.
- the number of branches per leaf is one, there is one pointer to the leaf, and the compression code and the pointer to the leaf are the same bit string.
- the computer 1100 constructs a 2 N- branch nodeless Huffman tree H.
- a 2 N -branch nodeless Huffman tree H that directly specifies a leaf structure is constructed by using the leaf pointer as a root.
- the compression code string is an 11-bit bit string whose leading 6 bits are “000000”
- the character “0” is generated by the 2 N- branch nodeless Huffman tree H regardless of which of the 32 types of bit strings is the subsequent 5 bits.
- Details of the construction of the 2 N- branch nodeless Huffman tree H will be described with reference to FIG.
- FIG. 14 is an explanatory diagram showing details of (1) counting the number of appearances.
- the computer 1100 executes three phases: (A) aggregation from the target file group Fs, (B) sorting in descending order of appearance frequency, and (C) extraction up to the rank of the target appearance rate.
- A aggregation from the target file group Fs
- B sorting in descending order of appearance frequency
- C extraction up to the rank of the target appearance rate.
- the computer 1100 reads the target file group Fs and counts the appearance frequency (number of appearances) of basic words.
- the computer 1100 refers to the basic word structure, and if a character string matching the basic word in the basic word structure exists in the target file, the computer 1100 sets the appearance frequency of the basic word (initial value is 0) to 1 to add.
- the basic word structure is a data structure in which basic words are described.
- the computer 1100 sorts the basic word appearance frequency count table in descending order of appearance frequency. That is, sorting is performed in descending order of appearance frequency, and ranking is performed from the basic words having the highest appearance frequency.
- the computer 1100 reads the target file group Fs and counts the appearance frequency of a single character. Specifically, the computer 1100 adds 1 to the appearance frequency (initial value is 0) of a single character.
- the computer 1100 sorts the single character appearance frequency totaling table in descending order of appearance frequency. That is, sorting is performed in descending order of appearance frequency, and ranking is performed from a single character having the highest appearance frequency.
- the computer 1100 refers to the basic word appearance frequency tabulation table after sorting (B1), and extracts basic words ranked up to the target appearance rate Pw. Specifically, the computer 1100 uses the sum of the appearance frequencies of all basic words (total appearance frequency) as the denominator, accumulates the appearance frequencies in descending order from the basic word ranked first, and sets the target to each rank. Appearance rate Pw is calculated.
- the target appearance rate Pw is 75 [%]
- the computer 1100 refers to the single character appearance frequency tabulation table after sorting (B2), and extracts single characters having ranks up to the target appearance rate Pc. Specifically, the computer 1100 uses the sum of the appearance frequencies of all single characters (total appearance frequency) as the denominator, accumulates the appearance frequencies in descending order from the single character with the highest rank, and uses them as the numerator. Calculate the rate.
- the target appearance rate Pc is 80 [%]
- single characters up to the y-th place are extracted.
- the single character extracted in (C21) is referred to as “specific single character (group)” in order to distinguish it from the original single character group.
- non-specific single character (group) since a single character excluded from the specific single character group (hereinafter, “non-specific single character (group)”) in the single character group has a lower appearance frequency than each specific single character, its character code Split. Specifically, the character code of a non-specific single character is divided into a character code of upper bits and a character code of lower bits.
- any divided character code is expressed by a code of 0x00 to 0xFF.
- the upper bit character code is the upper divided character code
- the lower bit character code is the lower divided character code.
- the character information table in FIG. 15 is a table reflecting the total result in (1) in FIG. 13, and rank items, decompression type items, code items, character items, appearance count items, total count items, An appearance rate item, an occurrence probability item before correction, and a compression code length item are set. Among them, information from the ranking item to the total number of items is information obtained by the re-sorting result.
- the rank (ascending order) is written in descending order of the number of appearances of the character information.
- the character information type is written in the decompression type item. “16” indicates a 16-bit code (a single character thereof). “8” indicates an 8-bit divided character code. “Base” indicates a basic word.
- Specified single character or divided character code is written in the code item among the character information items. Leave blank for basic words. Of the character information items, characters and basic words are written in the character items. Leave blank for split character codes. In the appearance number field, the number of appearances of character information in the target file group Fs is written. In the total number field, the total number of appearances of all character information is written.
- the appearance rate item a value obtained by dividing the number of appearances by the total number is written as the appearance rate.
- the occurrence probability corresponding to the appearance rate is written in the occurrence probability item of the item before correction.
- the compression code length item the compression code length corresponding to the occurrence probability, that is, the power y of the occurrence probability 1/2 y is written as the compression code length.
- the number of leaves (total number of character information types) in the compression code length unit of the character information table in FIG. 15 is counted as the number of leaves before correction in FIG.
- the correction A the compression code length upper length N (i.e., the maximum branch number 2 N number N powers of the 2 N branch-free node Huffman tree H) the number of leaves assigned to more compression code length
- the correction is concentrated to the upper limit length N of the compression code length.
- the computer 1100 determines whether or not the total occurrence probability is 1 or less.
- Correction B is a correction for updating the number of leaves without changing the compression code length group (5 to 12 bits) in correction A. Specifically, the correction is performed when the total occurrence probability in the correction A is not greater than or equal to the threshold value t and not greater than 1. More specifically, there are two types of correction B.
- correction B + when the total occurrence probability is less than the threshold t, a correction that increases the total occurrence probability until the total occurrence probability reaches a maximum value of 1 or less, for example, converges to the maximum asymptotic value.
- correction B + when the total occurrence probability is greater than 1, the total occurrence probability is decreased until the maximum value less than 1 is obtained after the total occurrence probability is reduced to 1 or less, for example, until convergence to the maximum asymptotic value. Correction (hereinafter, correction B ⁇ ).
- the correction B - 1 st (correction B - 1) in the number of leaves of the correction A in each of the compression code length, in the last correction sum of occurrence probabilities (in this case, the correction A) (1.146)
- the number of leaves is updated by dividing. The decimal part may be rounded down or rounded off.
- the upper limit length N of the compression code length is obtained.
- the computer 1100 determines whether or not the total occurrence probability with the correction B - 1 has converged to the maximum asymptotic value of 1 or less. If probability sum of 1 has not converged to the maximum asymptotic value of 1 or less, the correction B - - Correction B shifts to - the second (2 Correction B). When it converges to the maximum asymptotic value, it does not shift to the correction B - 2 and is determined by the number of leaves for each compression code length at this time. Since the total occurrence probability “1.042” updated by the correction B - 1 is larger than 1, it does not converge to the maximum asymptotic value, and the process proceeds to the correction B - 2.
- correction B - 2 the number of leaves is divided by dividing the number of leaves in correction B - 1 for each compression code length by the total occurrence probability (1.042) of the previous correction (in this case, correction B - 1). Update. The decimal part may be rounded down or rounded off.
- the upper limit length N of the compression code length is obtained by subtracting the total number of leaves (excluding the upper limit length N of the compression code length) of the compression code length in the correction B - 2 from the total number of leaves (1305). Find the number of leaves. In this case, there are 1215.
- the computer 1100 obtains the total occurrence probability with the correction B - 2 by the same calculation process as that for the correction B - 1. Then, the computer 1100 determines whether or not the total occurrence probability with the correction B - 2 has converged to a maximum asymptotic value of 1 or less. If probability sum for 2 has not converged to the maximum asymptotic value of 1 or less, the correction B - - Correction B shifts to - third (3 Correction B). When it converges to the maximum asymptotic value, it does not shift to the correction B - 3, but is determined by the number of leaves for each compression code length at this time. The total occurrence probability “0.982” updated with the correction B - 2 is 1 or less, but since it is unknown whether it has converged to the maximum asymptotic value, the process proceeds to the correction B - 3.
- the number of leaves is calculated by dividing the number of leaves in the correction B - 2 for each compression code length by the total occurrence probability (0.982) of the previous correction (in this case, the correction B - 2). Update. The decimal part may be rounded down or rounded off.
- the upper limit length N of the compression code length is obtained by subtracting the total number of leaves (excluding the upper limit length N of the compression code length) of the compression code length in the correction B - 3 from the total number of leaves (1305). Find the number of leaves. In this case, there are 1215.
- the computer 1100 obtains the total occurrence probability with the correction B - 3 by the same calculation process as that with the correction B - 2. Then, the computer 1100, the correction B - probability sum for 3 determines whether converged to maximum asymptotic value of 1 or less. If probability sum of the three has not converged to the maximum asymptotic value of 1 or less, the correction B - - Correction B shifts to - fourth (4 Correction B). When it converges to the maximum asymptotic value, it does not shift to the correction B - 4, but is determined by the number of leaves for each compression code length at this time.
- the total occurrence probability “0.982” updated with the correction B - 3 is the same value as the total occurrence probability “0.982” updated with the correction B - 2. That is, the number of leaves of each compression code length in the correction B - 3 and the number of leaves of each compression code length in the correction B - 2 are the same. In this case, the computer 1100 determines that the total occurrence probability has converged to the maximum asymptotic value, and determines the number of leaves.
- the correction B ⁇ is continued until the number of leaves is determined.
- the number of leaves for each compression code length is determined by the correction B - 3.
- the computer 1100 calculates the number of branches per leaf for each compression code length.
- the number of branches per leaf will be assigned as 5 and 26 .
- the subtotal of the number of branches is a multiplication result obtained by multiplying the number of branches per leaf by the determined number of leaves for each compression code length.
- FIG. 17 is an explanatory diagram showing a correction result for each character information.
- correction results of correction A and correction B - 1 to correction B - 2 are added to the character information table.
- a short compression code length is assigned from the first character information in the ranking item.
- the number of leaves is 6 when the compression code length is 6 bits
- the number of leaves is 18 when the compression code length is 7 bits
- the number of leaves is 1215 when the compression code length is 11 bits. Therefore, 6-bit compression code length is used for character information with ranks 1 to 6 (for 6 leaves), and 7 for character information with ranks 7 to 24 (for 18 leaves).
- the compression code length of bits ..., Is assigned a compression code length of 11 bits for character information (for 1215 leaves) whose rank is from 91st to 1305th.
- the computer 1100 generates a leaf structure by assigning a compression code for each character information based on the character information, the compression code length assigned to the character information, and the number of leaves for each compression code length. . For example, since the single character “0” with the first appearance rate is assigned a compression code length of 5 bits, the compression code is “000000”. Therefore, a structure of the leaf L1 including the compression code “000000”, the compression code length “6”, and the character information “0” is generated.
- the compression code length is 5 bits to 11 bits.
- the compression code map M of the bi-gram character string may be divided, correction is performed so that the compression code length is an even number of bits. May be. Specifically, for example, the compression code length of 5 bits and 7 bits of character information is 6 bits, 9 bits of character information is 8 bits, and 11 bits of character information is 10 bits.
- FIG. 18 shows a pointer to a leaf when the upper limit N of the compression code length is 11 bits.
- N 11
- the number of leaves with a compression code length of 6 bits is 6, “000000” to “000101” are assigned as compression codes.
- the first 6 bits of the pointer to the leaf are a compression code and the subsequent 5 bits are 32 types of bit strings. Therefore, 32 types of leaf pointers are generated for each compression code having a compression code length of 6 bits.
- the compression code length is 7 bits and the number of leaves is 18, compression codes “0001100” to “0011111” are assigned.
- the first 7 bits of the pointer to the leaf are a compression code and the subsequent 4 bits are 16 types of bit strings. Therefore, 16 types of pointers to leaves are generated for each compression code having a compression code length of 7 bits.
- compression code length is 9 bits and the number of leaves is 23, compression codes “010101110” to “011000100” are assigned.
- the leading 9 bits of the pointer to the leaf are a compression code
- the subsequent 2 bits are four types of bit strings. Therefore, four types of pointers to leaves are generated for each compression code having a compression code length of 9 bits.
- the compression code length is 10 bits and the number of leaves is 20, compression codes “01100001001” to “0110011101” are assigned.
- the first 10 bits of the pointer to the leaf are a compression code and the subsequent 1 bit is two types of bit strings. Therefore, two types of pointers to leaves are generated for each compression code having a compression code length of 10 bits.
- the root structure stores a pointer to a leaf.
- the pointer to the leaf can specify the structure of the leaf pointed to.
- 32 pointers to the leaf are generated as shown in FIG. 18 for the leaf structure in which the compression code having a compression code length of 6 bits is stored. Accordingly, for the structure of the leaf L1, 32 pointers L1P (1) to L1P (32) to the leaf L1 are stored in the root structure. The same applies to the structure of the leaf L2 to the structure of the leaf L6.
- the structures after the leaf L7 are as shown in FIG.
- FIG. 20 is an explanatory view showing a leaf structure.
- the leaf structure is a data structure having a first area to a fourth area.
- a compression code and a compression code length thereof are stored in the first area.
- the second area stores a leaf label, an extension type (see FIG. 15), and an appearance rate (see FIG. 15).
- the third area stores a 16-bit character code that is a specific single character according to the decompression type, an 8-bit divided character code obtained by dividing the character code of a non-specific single character, or a pointer to a basic word. .
- a basic word in the basic word structure is specified by a pointer to the basic word.
- a collation flag is also stored. The collation flag is “0” by default. In the case of “0”, the decompressed character is written in the decompression buffer as it is, and in the case of “1”, it is sandwiched between the ⁇ color> tag and the ⁇ / color> tag and written into the decompression buffer.
- the appearance rate of the stored character information and the appearance rate area of the appearance map are stored.
- the appearance rate is the appearance rate of the character information shown in FIG.
- the appearance rate area of the appearance map will be described with reference to FIGS. 55 and 56.
- the code type and code classification are stored in the third area.
- the code type is information for identifying whether a character code corresponds to a number, an alphabetic character, a special symbol, katakana, hiragana, or a kanji, or a pointer to a basic word.
- the code classification is information for identifying whether the character code is 16 bits or 8 bits. In the case of a 16-bit character code or a reserved word, “1” is assigned as a code division, and in the case of an 8-bit divided character code, “0” is assigned as a code division.
- step S3905 information in the first area to the fourth area is stored in a construction process (step S3905) described later.
- FIG. 21 is an explanatory diagram showing a structure of specific single characters.
- the specific single character structure 2100 is a data structure that stores a specific single character code e # and a pointer to its leaf L #.
- the computer 1100 stores the specific single character code e # in the specific single character structure 2100 when the aggregation result from the target file group Fs is obtained. Then, when the 2 N -branch nodeless Huffman tree H is constructed, the computer 1100 stores a specific single character corresponding to the compression code stored in each leaf structure in the 2 N -branch nodeless Huffman tree H. A pointer to the specific character code e # in the structure 2100 is stored.
- the computer 1100 moves to the leaf corresponding to each specific single character code e # in the 2 N- branch nodeless Huffman tree H. Are stored in association with the corresponding specific single character code e # in the structure 2100 of the specific single character. Thereby, the structure 2100 of a specific single character is generated.
- FIG. 22 is an explanatory diagram of a divided character code structure.
- the divided character code structure 2200 stores a divided character code and a pointer to its leaf L #. Specifically, for example, when the computer 1100 obtains the total result from the target file group Fs, the computer 1100 stores the divided character code in the divided character code structure 2200. Then, when the 2 N -branching nodeless Huffman tree H is constructed, the computer 1100 stores the divided character code corresponding to the compression code stored in each leaf structure in the 2 N -branching nodeless Huffman tree H. A pointer to the divided character code in the structure 2200 is stored.
- the computer 1100 converts the pointer to the leaf corresponding to each divided character code in the 2 N- branch nodeless Huffman tree H to the divided character.
- the code is stored in association with the corresponding divided character code in the code structure 2200. As a result, a divided character code structure 2200 is generated.
- FIG. 23 is an explanatory diagram of a basic word structure.
- the basic word structure 2300 is a data structure that stores a basic word and a pointer to its leaf L #.
- Basic words are stored in the basic word structure 2300 in advance.
- the computer 1100 constructs a basic word structure 2300 corresponding to the compression code stored in the structure of each leaf in the 2 N -branching nodeless Huffman tree H. Stores a pointer to the basic word in
- the computer 1100 converts the pointer to the leaf corresponding to each basic word in the 2 N- branch nodeless Huffman tree H to the basic word structure. It will be stored in association with the corresponding foundation in the body 2300.
- the creation unit 1104 compresses a single character compression code map Ms, a high-order divided character code compression code map Ms, and a low-order divided character code compression A code map Ms, a compression code map Ms for words, and a compression code map Ms for two-gram character strings are created.
- a detailed example of creating a single character compression code map Ms, an upper divided character code compression code map Ms, a lower divided character code compression code map Ms, and a two-gram character string compression code map Ms will be described below.
- the basic word compression code map Ms is omitted because it is performed in the same manner as the single character compression code map Ms.
- FIG. 24 is an explanatory diagram showing an example of generating the compression code map Ms.
- a character string “Ryoma has been removed” is described in the target file Fi.
- the first character “dragon” is the target character. Since the target character “dragon” is a specific single character, the compression code of the specific single character “dragon” is acquired by accessing the 2 N- branch nodeless Huffman tree H, and the appearance map of the specific single character “dragon” is specified. To do. If it has not been generated, an appearance map of the specific single character “dragon” is generated with the compression code of the specific single character “dragon” as a pointer and the bit string indicating the existence of the target file set to all 0. Then, the bit of the target file Fi is turned ON (“0” ⁇ “1”) for the appearance map of the specific single character “dragon”.
- the target character is shifted by 1 gram to make the target character “horse”. Since the target character “horse” is a specific single character, the compression code of the specific single character “horse” is obtained by accessing the 2 N- branch nodeless Huffman tree H, and the appearance map of the specific single character “horse” is specified. To do. If not generated, an appearance map of the specific single character “horse” is generated with the compression code of the specific single character “horse” as a pointer and the bit string indicating the existence of the target file set to all zeros. Then, the bit of the target file Fi is turned ON (“0” ⁇ “1”) for the appearance map of the specific single character “horse”.
- (C) Next, the target character is shifted by 1 gram to change the target character to “ha”.
- the target character “ha” is processed in the same manner as (B), so that the bit of the target file Fi is turned ON (“0” ⁇ “1”) for the appearance map of the specific single character “ha”. Similarly, the bit of the target file Fi is turned ON (“0” ⁇ “1”) in the appearance map of the bi-gram character string “Hamaha”.
- the target character is shifted by 1 gram, and the target character is set to “Remove”. Since the target character “O” is not a specific single character, the character code “0x811” of the target character “O” is divided into an upper divided character code “0x81” and a lower divided character code “0x31”. Then, the target character is set to the upper divided character code “0x81”. The upper divided character code “0x81” is processed in the same manner as the specific single character, so that the bit of the target file Fi is turned ON (“0” ⁇ “1”) for the appearance map of the upper divided character code “0x81”. . Similarly, the bit of the target file Fi is turned ON (“0” ⁇ “1”) in the appearance map of the bi-gram character string “ha 0x81”.
- FIG. 25 is a flowchart showing an example of a compression code map creation processing procedure by the creation unit 1104.
- the computer 1100 executes tabulation processing (step S2501), map allocation number determination processing (step S2502), recounting processing (step S2503), Huffman tree generation processing (step S2504), and map creation processing (step S2505).
- the computer 1100 executes a totaling process (step S2501) to a recounting process (step S2503) by the totaling unit 1101.
- the first generation unit 1102 executes a Huffman tree generation process (step S2504), and the generation unit 1104 executes a map generation process (step S2505).
- the aggregation process (step S2501) is a process of counting the number of appearances (also referred to as appearance frequency) of single characters and basic words in the target file group Fs.
- the map allocation number determination process (step S2502) is a process of determining the map allocation number for the single characters and basic words totaled in the totaling process (step S2501). The single character and the basic word of the appearance order corresponding to the map allocation number become the specific single character and the basic word, respectively.
- the re-counting process is a process of dividing non-specific single characters other than the specific single character among the single characters to obtain upper divided character codes and lower divided character codes, and totaling the number of appearances of each. Further, in the recounting process (step S2503), the number of appearances of the bi-gram character string is also counted.
- the Huffman tree generation process (step S2504) is a process for generating a 2 N -branch nodeless Huffman tree H as shown in FIGS.
- the map creation process (step S2505) is a process of generating a compression code map M for a specific single character, basic word, upper divided character code, lower divided character code, and bi-gram character string.
- FIG. 26 is a flowchart illustrating a detailed processing procedure example of the aggregation processing (step S2501) illustrated in FIG.
- step S2604 If i> n is not satisfied (step S2604: NO), the computer 1100 increments i (step S2605) and returns to step S2602. On the other hand, if i> n is satisfied (step S2604: YES), the computer 1100 shifts to the map allocation number determination process (step S2502) shown in FIG. 25 and ends the aggregation process (step S2501). According to this tabulation process (step S2501), the tabulation process (step S2603) of the target file Fi can be executed for each target file Fi.
- FIG. 27 is a flowchart illustrating a detailed processing procedure example of the target file Fi counting process (step S2603) illustrated in FIG.
- the computer 1100 sets the target character as the first character of the target file Fi (step S2701), and executes basic word tabulation processing (step S2702). Details of the basic word totaling process (step S2702) will be described with reference to FIG. Thereafter, the computer 1100 increments the appearance count of the target character by 1 in the character appearance frequency tabulation table (step S2703).
- FIG. 28 is an explanatory diagram showing a character appearance frequency tabulation table.
- the character appearance frequency totaling table 2800 is stored in a storage device such as the RAM 903 and the magnetic disk 905, and increases the number of appearances by one each time a corresponding character appears.
- the computer 1100 determines whether or not the target character is the last character of the target file Fi (step S2704). If the target character is not the last character of the target file Fi (step S2704: No), the computer 1100 shifts the target character by one character toward the end (step S2705), and returns to step S2702.
- step S2704 if the target character is the last character of the target file Fi (step S2704: Yes), the computer 1100 proceeds to step S2604 and ends the tabulation process of the target file Fi (step S2603). According to the totaling process (step S2603) of the target file Fi, the appearance frequencies of basic words and single characters existing in the target file group Fs can be totaled.
- FIG. 29 is a flowchart showing a detailed processing procedure example of the basic word totaling process (step S2702) shown in FIG.
- the computer 1100 executes the longest match search process (step S2901), and determines whether there is a longest matching basic word (step S2902). Details of the longest match search process (step S2901) will be described with reference to FIG. If there is a longest matching basic word (step S2902: Yes), the computer 1100 increments the longest matching basic word appearance count in the basic word appearance frequency tabulation table by 1 (step S2903), and the process proceeds to step S2703.
- FIG. 30 is an explanatory diagram showing a basic word appearance frequency tabulation table.
- the basic word appearance frequency totaling table 3000 is stored in a storage device such as the RAM 903 and the magnetic disk 905, and increases the number of appearances by one each time the corresponding basic word appears.
- step S2902 determines whether there is no longest matching basic word. If there is no longest matching basic word (step S2902: No), the process proceeds to step S2703. Thereby, the basic word totaling process (step S2702) is terminated. According to this basic word totaling process (step S2702), since the basic words can be counted by the longest match search process (step S2901), basic words having a long character string can be preferentially counted.
- FIG. 31 is a flowchart showing a detailed processing procedure of the longest match search process (step S2901) shown in FIG.
- the computer 1100 performs a binary search for a basic word that matches forward with the target character string from the target character to the c-th character (step S3102). Then, the computer 1100 determines whether or not there is a basic word by the search (step S3103). When the basic word is not hit by the binary search (step S3103: No), the process proceeds to step S3106.
- step S3103: Yes when the basic word is hit by the binary search (step S3103: Yes), the computer 1100 determines whether or not the hit basic word and the target character string completely match (step S3104). And when it does not correspond completely (step S3104: No), it transfers to step S3106. On the other hand, if there is a complete match (step S3104: Yes), the computer 1100 holds the longest match candidate in the storage device (step S3105), and proceeds to step S3106.
- step S3106 the computer 1100 determines whether or not the binary search has been completed for the target character string (step S3106). Specifically, the computer 1100 determines whether a binary search has been performed up to the last basic word. If the binary search has not ended (step S3106: No), the computer 1100 proceeds to step S3102 and continues until the binary search ends.
- step S3106 determines whether or not the binary search is completed for the target character string (step S3106: Yes).
- the computer 1100 determines whether or not the c-th character is the last character of the target file Fi (step S3107).
- the process proceeds to step S3110.
- the computer 1100 determines whether c> cmax is satisfied (step S3108).
- cmax is a preset value, whereby the upper limit number of characters of the target character string is set.
- step S3108: NO the computer 1100 increments c (step S3109) and returns to step S3102.
- step S3108: YES the computer 1100 determines whether there is a longest match candidate. Specifically, the computer 1100 determines whether at least one longest match candidate is held in the memory in step S3105.
- step S3110: Yes If there is the longest match candidate (step S3110: Yes), the computer 1100 determines the longest character string among the longest match candidates as the longest matching basic word (step S3111). Then, control goes to a step S2902. On the other hand, if there is no longest match candidate in step S3110 (step S3110: No), the process proceeds to step S2902. Thus, the longest match search process (step S2901) is terminated. According to the longest match search process (step S2901), it is possible to search for the longest character string as the basic word in the character string that is completely matched from the basic words in the basic word structure.
- FIG. 32 is a flowchart showing a detailed processing procedure example of the map allocation number determination processing (step S2502) shown in FIG.
- Aw is the total number of appearances of the aggregated basic words.
- step S3204 NO
- the computer 1100 increments the appearance rank Rw (step S3205) and returns to step S3203. That is, the appearance rank Rw is continuously lowered until the above formula (1) is satisfied.
- the map allocation number Nw is the number of basic words allocated to the basic word appearance map generated in the map creation process (step S3205), and means the number of records (number of lines) in the basic word appearance map.
- step S3209: NO the computer 1100 increments the appearance rank Rc (step S3210) and returns to step S3208. That is, the appearance rank Rc is continuously lowered until the above formula (2) is satisfied.
- step S3209 YES
- the map allocation number Nc is the number of specific single characters allocated to the specific single character appearance map generated in the map creation process (step S2505), and means the number of records (number of lines) of the specific single character appearance map. To do. Thereafter, the process proceeds to the recounting process (step S2503), and the map allocation number determining process (step S2502) is ended.
- the basic word appearance map can be generated for the number of basic words corresponding to the target appearance rate Pw in the map creation process (step S2505). Therefore, it is not necessary to perform map assignment for all basic words, and the map size can be optimized because it is determined according to the target appearance rate Pw.
- the compression code map M of specific single characters can be generated for the number of single characters corresponding to the target appearance rate Pc in the map creation process (step S2505). Therefore, it is not necessary to perform map assignment for all single characters, and the map size can be optimized because it is determined according to the target appearance rate Pc.
- FIG. 33 is a flowchart illustrating a detailed processing procedure example of the recounting process (step S2503) illustrated in FIG.
- step S3304 If i> n is not satisfied (step S3304: NO), the computer 1100 increments i (step S3305) and returns to step S3302. On the other hand, if i> n is satisfied (step S3304: YES), the computer 1100 proceeds to the Huffman tree generation process (step S3204) shown in FIG. 25 and ends the recounting process (step S3203). According to the recounting process (step S3203), the recounting process (step S3303) of the target file Fi can be executed for each target file Fi.
- FIG. 34 is a flowchart illustrating a detailed processing procedure example of the recalculation processing (step S3303) of the target file Fi.
- the computer 1100 sets the target character as the first character of the target file Fi (step S3401), and determines whether the target character is a specific single character (step S3402). When it is a specific single character (step S3402: Yes), it transfers to step S3404 without dividing
- step S3402 if it is not a specific single character (step S3402: NO), the computer 1100 divides the character code of the target character into an upper divided character code and a lower divided character code (step S3403). Then, the process proceeds to step S3404.
- step S3404 the computer 1100 adds 1 to the upper divided character code appearance frequency tabulation table, the number of appearances of the same divided character code as the upper divided character code obtained in step S3403 (step S3404).
- FIG. 35 is an explanatory diagram of an upper divided character code appearance frequency tabulation table.
- the upper divided character code appearance frequency totaling table 3500 is stored in a storage device such as the RAM 903 and the magnetic disk 905, and increases the number of appearances by one each time the corresponding upper divided character code appears.
- the computer 1100 adds 1 to the lower divided character code appearance frequency tabulation table, the number of appearances of the same divided character code as the lower divided character code obtained in step S3403 (step S3405).
- FIG. 36 is an explanatory diagram showing a lower divided character code appearance frequency tabulation table.
- the lower divided character code appearance frequency totaling table 3600 is stored in a storage device such as the RAM 903 and the magnetic disk 905, and increases the number of appearances by one each time the corresponding lower divided character code appears.
- the computer 1100 executes a bi-gram character string specifying process (step S3406).
- a bi-gram character string specifying process step S3406
- a bi-gram character string having the target character as a base point is specified. Details of the bi-gram character string specifying process (step S3406) will be described with reference to FIG.
- the computer 1100 adds 1 to the number of occurrences of the bi-gram character string specified in the bi-gram character string identification process (step S3406) to the bi-gram character string appearance frequency tabulation table (step S3407).
- FIG. 37 is a flowchart showing a detailed processing procedure of the bi-gram character string specifying process (step S3406) shown in FIG.
- the computer 1100 determines whether the target character has been divided with respect to the target character (step S3701). That is, the computer 1100 determines whether the target character is a divided character code. If it is not divided (step S3701: NO), that is, if it is a single character, the computer 1100 determines whether there is a previous character (step S3702).
- step S3703 determines whether the previous character has been divided. That is, the computer 1100 determines whether the previous character is a divided character code.
- step S3703: No that is, in the case of a single character
- the computer 1100 converts the character string consisting of the single character immediately preceding the target character and the target character (single character) into a bi-gram character string. Determination is made (step S3704). Then, control goes to a step S3407.
- step S3703 YES
- step S3703: YES the computer 1100 determines that the character string including the lower divided character code, which is the previous character, and the target character is a bi-gram character string. Then, control goes to a step S3407.
- step S3702 when there is no previous character (step S3702: No), only the target character is obtained, and the process proceeds to step S3407 without determining the bi-gram character string.
- step S3701 If the target character is divided in step S3701 (step S3701: YES), that is, if it is a divided character code, the computer 1100 determines whether the divided character code is an upper divided character code or a lower divided character code. Is determined (step S3706).
- step S3706 If it is the upper divided character code (step S3706: upper), the computer 1100 determines whether or not the previous character has been divided (step S3707). That is, it is determined whether or not the previous character is a divided character code. When the character is not divided (step S3707: No), that is, when the character is a single character, the computer 1100 determines that the character string composed of the single character preceding the target character and the upper divided character code divided from the target character is 2 The gram character string is determined (step S3708). Then, control goes to a step S3407.
- step S3707 determines a character string composed of the lower divided character code which is the previous character and the upper divided character code divided from the target character as a bi-gram character string (step S3709). Then, control goes to a step S3407.
- step S3706 if the character code is a lower divided character code (step S3706: lower order), the computer 1100 converts the character string composed of the upper divided character code and the lower divided character code divided from the target character into a bi-gram character string. Determination is made (step S3710). Then, control goes to a step S3407.
- this bi-gram character string specifying process (step S3406), it is possible to specify a bi-gram character string even when the target character is divided. Further, since the bi-gram character string is specified according to the 1-character shift, it can be generated simultaneously with the compression code map M of the basic word and the compression code map M of the specific single character.
- the number of basic words and the number of single characters to be created are limited by the target appearance rates Pw and Pc.
- Map size optimization can be realized at the same time.
- a plurality of types of map creation can be executed concurrently by shifting one character, and the efficiency of the creation of a plurality of types of maps used for high-precision search can be improved.
- FIG. 38 is an explanatory view showing a 2-gram character string appearance frequency totaling table.
- the bi-gram character string appearance frequency tabulation table 3800 is stored in a storage device such as the RAM 903 and the magnetic disk 905, and increases the number of appearances by one each time the corresponding bi-gram character string appears.
- step S3408 determines whether or not a subsequent character of the target character exists in the target file Fi (step S3408). If there is a subsequent character (step S3408: Yes), the subsequent character is set as the target character. (Step S3409), it returns to step S3402. On the other hand, if there is no subsequent character (step S3408: No), the recalculation processing of the target file Fi (step S3303) is terminated, and the process proceeds to step S3304.
- the upper divided character code, the lower divided character code, and the number of appearances of the bi-gram character string existing in the target file Fi can be totaled.
- FIG. 39 is a flowchart showing a detailed processing procedure example of the Huffman tree generation processing (step S2504) shown in FIG.
- the computer 1100 determines the upper limit length N of the compression code length (step S3901).
- the computer 1100 executes correction processing (step S3902).
- the correction processing is processing for correcting the occurrence probability and the compression code length for each character information using the upper limit length N of the compression code length, as described with reference to FIGS.
- step S3904 the computer 1100 generates a leaf structure for each character information. Then, the computer 1100 executes the branch number specifying process (step S3904). In the branch number specifying process (step S3904), the number of branches per leaf is specified for each compression code length. Details of the branch number specifying process (step S3904) will be described with reference to FIG.
- step S3905 Since the number of branches for each leaf structure is specified by the branch number specifying process (step S3904), first, the computer 1100 generates a group of pointers to the leaves for the number of branches for each leaf structure. Then, a group of pointers to leaves for each generated leaf structure is aggregated to form a root structure. As a result, a 2 N -branch nodeless Huffman tree H is generated. The generated 2 N -branch nodeless Huffman tree H is stored in a storage device (such as the RAM 903 or the magnetic disk 905) in the computer 1100. Thereafter, the process proceeds to the map creation process (step S2505) in FIG.
- FIG. 40 is a flowchart showing a detailed processing procedure example of the branch number specifying process (step S3904) shown in FIG.
- the computer 1100 calculates the total branch number B (L) of the compression code length CL (step S4005).
- step S4006 the computer 1100 increments j, decrements the compression code length CL (step S4006), returns to step S4003, and determines whether or not the incremented j is j> D.
- N 11
- FIG. 41 is a flowchart showing a detailed processing procedure of the construction process (step S3905) shown in FIG.
- the computer 1100 determines whether or not there is an unselected leaf with the compression code length CL (step S4102). If there is an unselected leaf (step S4102: Yes), the computer 1100 executes a pointer pointer generation process (step S4103) and returns to step S4102.
- a leaf pointer group corresponding to the number of branches corresponding to the compression code length CL is generated for each leaf structure. Details of the leaf pointer generation process (step S4103) will be described with reference to FIG.
- step S4102 determines whether CL> N is satisfied (step S4104). If CL> N is not satisfied (step S4104: NO), the computer 1100 increments CL (step S4105) and returns to step S4102. On the other hand, if CL> N (step S4104: Yes), the 2 N branching no-node Huffman tree H is constructed, and the process proceeds to step S2505. The information in the first area to the fifth area is stored in this construction process (step S3905).
- FIG. 42 is a flowchart showing a detailed processing procedure of the leaf pointer generation processing (step S4103) shown in FIG.
- the computer 1100 sets the bit length of the subsequent bit string of the pointer PL (k) to the selected leaf as a difference obtained by subtracting the compression code length CL of the selected leaf from the maximum compression code length N, and sets the initial value of the subsequent bit sequence to all. It is set to 0 (step S4204).
- k 1, the subsequent bit string is all 0, so the subsequent bit string is “00000” of 5 bits.
- the computer 1100 stores the pointer PL (k) to the selected leaf in the root structure (step S4205). Thereafter, the computer 1100 determines whether k> b (CL) is satisfied (step S4206).
- b (CL) is the number of branches per leaf of the compression code length CL of the selected leaf. If k> b (CL) is not satisfied (step S4206: NO), since the pointers to the leaves have not been generated for all the branches assigned to the selected leaf, the computer 1100 increments k (step S4207).
- step S4208 the computer 1100 increments the current subsequent bit string and concatenates the subsequent bit string after the increment to the end of the previous bit string, thereby newly generating a pointer PL (k) to the selected leaf (step S4208). .
- the computer 1100 stores the pointer PL (k) to the selected leaf in the root structure (step S4209), and returns to step S4206.
- step S4209 a group of pointers to leaves corresponding to the number of branches per leaf is generated.
- step S4206 if k> b (CL) is satisfied (step S4206: YES), the process proceeds to step S4102.
- the maximum branch number 2 N of 2 N branch-free node Huffman tree H can be set to the optimum number, 2 N min
- the size of the branchless node Huffman tree H can be optimized.
- the 2 N branchless nodeless Huffman tree with good compression efficiency H can be generated.
- the computer 1100 converts the structure of each leaf of the 2 N -branch nodeless Huffman tree H, the structure of the basic word, the structure of the specific character code, and the structure of the divided character code into the characters of FIG. Browse and correlate information tables. Specifically, as described above, the leaf structure stores a specific character corresponding to the compression code stored in the leaf, a divided character code, a pointer to the leaf, and a pointer to the basic word.
- the computer 1100 stores a pointer to the leaf storing the corresponding compression code for each basic word of the basic word structure.
- the computer 1100 stores a pointer to a leaf that stores a corresponding compression code for each specific character of the specific character code structure.
- the computer 1100 stores a pointer to a leaf storing a corresponding compression code for each divided character code of the divided character code structure.
- FIG. 43 is a flowchart showing a detailed processing procedure example of the map creation processing (step S2505) shown in FIG.
- step S4304: NO If i> ⁇ is not satisfied (step S4304: NO), the computer 1100 increments i (step S4305) and returns to step S4302. On the other hand, if i> ⁇ (step S4304: YES), the map creation process (step S2505) ends. According to this map creation process (step S2505), the map creation process (step S4303) of the target file Fi can be executed for each target file Fi.
- FIG. 44 is a flowchart showing a detailed processing procedure of the map creation process (step S4303) of the target file Fi shown in FIG.
- the computer 1100 sets the target character as the first character of the target file Fi (step S4401), basic word appearance map creation processing (step S4402), specific single character appearance map creation processing (step S4403), and bi-gram character string appearance map.
- a creation process (step S4404) is executed.
- step S4402 Details of the basic word appearance map creation process (step S4402) will be described with reference to FIG. Details of the specific single character appearance map creation process (step S4403) will be described with reference to FIG. Details of the bi-gram character string appearance map creation process (step S4404) will be described with reference to FIG.
- the computer 1100 determines whether the target character is the last character of the target file Fi (step S4405). If the target character is not the end character of the target file Fi (step S4405: No), the computer 1100 shifts the target character by one character toward the end (step S4406) and returns to step S4402. On the other hand, if the target character is the last character of the target file Fi (step S4405: Yes), the process proceeds to step S4304, and the map creation process of the target file Fi (step S4303) is terminated.
- the basic word appearance map, the specific single character appearance map, and the bi-gram character string appearance map are generated in parallel while shifting the target character one character at a time. be able to.
- FIG. 45 is a flowchart showing a detailed processing procedure example of the basic word appearance map creation processing (step S4402) shown in FIG.
- the computer 1100 executes the longest match search process for the target character (step S4501).
- the detailed processing procedure of the longest match search process (step S4501) is the same as the longest match search process (step S2901) shown in FIG.
- the computer 1100 determines whether or not there is a longest matching basic word, that is, a basic word (step S4502). If there is no longest matching basic word (step S4502: No), the process proceeds to a specific single character appearance map creation process (step S4403). On the other hand, when there is a longest matching basic word (step S4502: Yes), the computer 1100 determines whether a basic word appearance map has been set for the longest matching basic word (step S4503).
- step S4503 If already set (step S4503: YES), the process proceeds to step S4506. On the other hand, if it has not been set (step S4503: No), the computer 1100 accesses the leaf of the longest matching basic word in the 2 N- branch nodeless Huffman tree H and acquires its compression code (step S4504). The computer 1100 sets the acquired compression code as a pointer to the basic word appearance map for the longest matching basic word (step S4505), and proceeds to step S4506. Thereafter, in step S4506, the computer 1100 turns on the bit of the target file Fi of the basic word appearance map for the longest matching basic word (step S4506).
- step S4402 This completes the basic word appearance map creation process (step S4402), and proceeds to the specific single character appearance map creation process (step S4403).
- step S4402 the basic word with the longest match for each target character can be created as a basic word.
- FIG. 46 is a flowchart showing a detailed processing procedure example of the specific single character appearance map creation processing (step S4403) shown in FIG.
- the computer 1100 performs a binary search for the target character with respect to the structure of the specific single character (step S4601), and determines whether or not they match (step S4602). If there is no matching single character (step S4602: NO), the computer 1100 executes a divided character code appearance map creation process (step S4603), and proceeds to a bi-gram character string appearance map creation process (step S4404). . Details of the divided character code appearance map creation processing (step S4603) will be described with reference to FIG.
- step S4602 determines that the single character searched for binary in the 2 N- branch nodeless Huffman tree H The leaf is accessed and the compression code is acquired (step S4604). Then, the computer 1100 determines whether or not a specific single character appearance map has been set for the acquired compressed code (step S4605). If already set (step S4605: YES), the process proceeds to step S4607.
- step S4605 the computer 1100 sets the acquired compression code as a pointer to the specific single character appearance map for the single character searched for in two (step S4606). The process moves to S4607. Thereafter, in step S4607, the bit of the target file Fi of the specific single character appearance map for the single character searched for in two is turned ON (step S4607).
- step S4403 This completes the specific single character appearance map creation process (step S4403) and proceeds to the bi-gram character string appearance map creation process (step S4404).
- step S4403 the target character searched for in two can be created as a specific single character.
- FIG. 47 is a flowchart showing a detailed processing procedure example of the divided character code appearance map creation processing (step S4603) shown in FIG.
- the computer 1100 divides the target character (step S4701), accesses the upper divided character code leaf in the 2 N- branch nodeless Huffman tree H, and obtains a compression code (step S4702). Then, the computer 1100 determines whether or not an upper divided character code appearance map has been set for the acquired compressed code (step S4703).
- step S4703: YES If it has been set (step S4703: YES), the process proceeds to step S4705. On the other hand, if it has not been set (step S4703: No), the computer 1100 sets the acquired compression code as a pointer to the appearance map of the upper divided character code (step S4704), and proceeds to step S4705. Thereafter, in step S4705, the computer 1100 turns on the bit of the target file Fi in the appearance map of the upper divided character code divided from the target character (step S4705).
- the computer 1100 accesses the leaf of the lower-order divided character code in the 2 N- branch nodeless Huffman tree H, and acquires the compression code (step S4706). Then, the computer 1100 determines whether or not the appearance map of the lower divided character code has been set for the acquired compressed code (step S4707). If already set (step S4707: YES), the process proceeds to step S4709.
- step S4707 the computer 1100 sets the acquired compression code as a pointer to the appearance map of the lower divided character code (step S4708), and proceeds to step S4709. Thereafter, in step S4709, the computer 1100 turns on the bit of the target file Fi of the appearance map of the lower divided character code divided from the target character (step S4709).
- step S4603 This completes the divided character code appearance map creation process (step S4603), and proceeds to the bi-gram character string appearance map creation process (step S4404).
- step S4603 a single character lower than the rank corresponding to the target appearance rate Pc has a low appearance frequency, and therefore many OFF bits appear.
- optimization of the map size of the compression symbol map Ms of the specific single character is not performed with respect to the single character lower than the rank corresponding to the target appearance rate Pc by excluding the appearance map of the specific single character appearance map. Can do.
- the map sizes such as the compression code map Ms of the upper divided character code and the compression code map Ms of the lower divided character code are fixed. Set to the map. Therefore, regardless of the appearance rate set as the target appearance rate Pc, an increase in the map size can be prevented and memory saving can be achieved.
- FIG. 48 is a flowchart showing a detailed processing procedure example of the bi-gram character string map creation processing (step S4404) shown in FIG.
- the computer 1100 executes bigram character string specifying processing (step S4801).
- the detailed processing procedure of the bi-gram character string specifying process (step S4801) is the same as the bi-gram character string specifying process (step S4806) shown in FIG.
- step S4801 determines whether or not a bi-gram character string has been identified by the bi-gram character string identification process (step S4801) (step S4802). If not specified (step S4802: NO), the process proceeds to step S4405 in FIG.
- step S4802 YES
- the computer 1100 executes a bi-gram character string appearance map generation process (step S4803), and proceeds to step S4405.
- FIG. 49 is a flowchart showing a detailed processing procedure example of the bi-gram character string appearance map generation processing (step S4803).
- the computer 1100 determines 2 N branches for the first gram (specific single character or divided character code) of the bi-gram character string specified by the bi-gram character string specifying process (step S4801) of FIG.
- the leafless node Huffman tree H is accessed to obtain a compression code (step S4901).
- the computer 1100 accesses the leaves of the 2 N -branch nodeless Huffman tree H for the second gram (specific single character or divided character code), and acquires a compression code (step S4902).
- step S4903 the computer 1100 concatenates the compression code of the first gram and the compression code of the second gram. Then, the computer 1100 determines whether or not an appearance map having the linked compression code as a pointer has been set (step S4904). If already set (step S4904: YES), the process proceeds to step S4906.
- step S4904 sets the concatenated compression code to a pointer to the identified bigram character string appearance map (step S4905). Thereafter, in step S4906, the computer 1100 turns on the bit of the target file Fi in the appearance map of the identified bi-gram character string (step S4906).
- the bi-gram character string appearance map can be directly designated by the concatenated compression code of the bi-gram character string.
- FIG. 50 is an explanatory diagram illustrating a specific example of compression processing using the 2 N- branch nodeless Huffman tree H.
- the computer 1100 acquires the compression target character code of the first character from the target file group Fs, and holds the position on the target file Fi. Then, the computer 1100 performs a binary tree search on the basic word structure 2300. Since the basic word is a character code string of two or more characters, when the compression target character code of the first character is hit, the second character code is acquired as the compression target character code.
- the character code of the second character is searched from the position where the compression target character code of the first character is hit. Even after the third character, the binary tree search is repeated until a mismatched character code to be compressed appears.
- a matching basic word ra (a is a leaf number) is searched, the structure of the leaf La is accessed by a pointer to the leaf La associated with the basic word structure 2300. Then, the computer 1100 searches for the compression code of the basic word ra stored in the structure of the access destination leaf La and stores it in the compression buffer 5000.
- the computer 1100 sets the compression target character code of the first character in the register again, and performs a binary tree search for the structure 2100 of the specific single character.
- the computer 1100 accesses the structure of the leaf Lb by a pointer to the leaf Lb. Then, the computer 1100 searches for the compression code of the character code eb stored in the structure of the access destination leaf Lb and stores it in the compression buffer 5000.
- the computer 1100 divides the upper 8 bits and the lower 8 bits. Then, the computer 1100 performs a binary tree search for the divided character code structure 2200 for the upper 8-bit divided character code.
- a matching divided character code Dc1 (c1 is a leaf number) is searched, the computer 1100 accesses the structure of the leaf Lc1 by using a pointer to the leaf Lc1. Then, the computer 1100 searches for the compression code of the divided character code Dc1 stored in the structure of the access destination leaf Lc1, and stores it in the compression buffer 5000.
- the computer 1100 performs a binary tree search for the divided character code structure with respect to the divided character code of the lower 8 bits.
- a matching divided character code Dc2 (c2 is a leaf number) is found
- the computer 1100 accesses the structure of the leaf Lc2 by using a pointer to the leaf Lc2.
- the computer 1100 searches for the compression code of the divided character code Dc2 stored in the structure of the access destination leaf Lc2, and stores it in the compression buffer 5000.
- the target file Fi is compressed.
- FIG. 51 is a flowchart illustrating an example of a compression processing procedure for the target file group Fs using the 2 N- branch nodeless Huffman tree H by the first compression unit 1103.
- the computer 1100 executes a compression process (step S5103) and increments the file number: p (step S5104). Details of the compression processing (step S5103) will be described with reference to FIG.
- step S5105 determines whether or not p> ⁇ is satisfied.
- ⁇ is the total number of files in the target file group Fs. If p> ⁇ is not satisfied (step S5105: NO), the process returns to step S5102. On the other hand, if p> ⁇ (step S5105: YES), the compression processing of the target file group Fs is terminated.
- FIG. 52 is a flowchart (part 1) showing a detailed processing procedure of the compression processing (step S5103) shown in FIG.
- the computer 1100 determines whether or not there is a compression target character code in the target file group Fs (step S5201). If there is (step S5201: YES), the computer 1100 acquires the compression target character code and sets it in the register (step S5202). Then, the computer 1100 determines whether or not it is the first compression target character code (step S5203).
- the first character code to be compressed is the character code of the uncompressed first character.
- the computer 1100 acquires a pointer that is the position (head position) of the compression target character code on the target file group Fs (step S5204), and proceeds to step S5205.
- the process proceeds to step S5205 without acquiring the head position.
- step S5205 the computer 1100 performs a binary tree search on the basic word structure 2300 (step S5205). If the compression target character codes match (step S5206: YES), the computer 1100 determines whether or not the continuously matched character code strings correspond to basic words (character code strings) (step S5207). If not applicable (step S5207: NO), the computer 1100 returns to step S5202 and acquires the subsequent character code as the compression target character code. In this case, since the subsequent character code is not the head, the head position is not acquired.
- step S5207 if it corresponds to the basic word in step S5207 (step S5207: Yes), the computer 1100 accesses the structure of the leaf L # by the pointer to the leaf L # of the corresponding basic word (step S5208). Then, the computer 1100 extracts the compression code of the basic word stored in the pointed leaf L # structure (step S5209).
- step S5210 stores the extracted compressed code in the compression buffer 5000 (step S5210), and returns to step S5201.
- This loop is the flow of the basic word compression process.
- step S5201 if there is no compression target character code (step S5201: No), the computer 1100 outputs the compressed file fp compressed from the target file Fp from the compression buffer 5000 and stores it (step S5211). Then, control goes to a step S5104. On the other hand, if no match is found in step S5206 (step S5206: No), a 16-bit character code compression processing loop is entered.
- FIG. 53 is a flowchart (part 2) showing a detailed processing procedure of the compression processing (step S5103) shown in FIG.
- the computer 1100 refers to the pointer at the head position acquired in step S5204, acquires the compression target character code from the target file group Fs, and sets it in the register (step S5301).
- the computer 1100 performs a binary tree search for the specific single character structure 2100 for the compression target character code (step S5302). If they match (step S5303: YES), the computer 1100 accesses the structure of the leaf L # with a pointer to the leaf L # of the corresponding character (step S5304). Then, the computer 1100 extracts the compression code of the character code to be compressed stored in the structure of the pointed leaf L # (step S5305).
- step S5306 the computer 1100 stores the searched compression code in the compression buffer 5000 (step S5306), and returns to step S5201.
- This loop is a flow of compression processing of a 16-bit character code.
- step S5303 NO
- the process enters a compression process for the divided character code.
- FIG. 54 is a flowchart (part 3) showing a detailed processing procedure of the compression processing (step S5103) shown in FIG.
- the computer 1100 divides the compression target character code into upper 8 bits and lower 8 bits (step S5401), and extracts upper 8 bits of divided character codes (step S5402). Then, the computer 1100 performs a binary tree search for the divided character code structure 2200 (step S5403).
- the computer 1100 accesses the structure of the leaf L # by using the pointer to the leaf L # of the searched divided character code (step S5404). Then, the computer 1100 extracts the compression code of the divided character code stored in the structure of the pointed leaf L # (step S5405). Thereafter, the computer 1100 stores the searched compression code in the compression buffer 5000 (step S5406).
- step S5407 determines whether or not the lower 8 bits have been searched (step S5407), and if not searched (step S5407: No), the computer 1100 extracts the divided character code of the lower 8 bits. (Step S5408), Steps S5403 to S5406 are executed. On the other hand, if the lower 8 bits have already been searched (step S5407: YES), the process returns to step S5301 to enter the basic word compression processing loop.
- the structure of the leaf L # in which the compression target character code is stored can be immediately specified by the basic word structure, the specific single character code structure, and the divided character code structure. Therefore, it is not necessary to search for the leaves of the 2 N- branch nodeless Huffman tree H, and the compression process can be speeded up. Further, by dividing the low-order character code into the upper bit code and the lower bit code, the non-specific single character can be compressed into 256 types of divided character code compression codes. Therefore, the compression rate can be improved.
- the second compression unit 1106 compresses the appearance map in the compression area, and does not compress the appearance map in the non-compression area.
- the bit string from file number (2n + 1) to ⁇ is an uncompressed area and is not compressed.
- the number of consecutive “0” s in the bit string increases as the file total number ⁇ increases.
- an appearance rate area is set.
- the appearance rate area is a range of appearance rates.
- the Huffman tree h for appearance map compression is assigned according to the appearance rate area.
- FIG. 55 is an explanatory diagram showing the relationship between the appearance rate and the appearance rate area. If the appearance rate is in the range of 0 to 100%, as shown in FIG. 55, the area can be divided into areas A to E and areas A ′ to E ′. Accordingly, the Huffman tree h for appearance map compression is assigned as a compression pattern in accordance with the appearance rate area specified in the areas A to E and A ′ to E ′.
- FIG. 56 is an explanatory diagram showing a compression pattern table having compression patterns for each appearance rate area. Since the appearance rate is stored in the fifth area of the structure of the leaf L # as shown in FIG. 20, the structure of the leaf L # is designated, so that the compression pattern table 5600 is referred to. The compression pattern is specified. Note that the A region and the A ′ region are not compressed, so there is no Huffman tree that becomes a compression pattern.
- FIG. 57 is an explanatory diagram showing compression patterns in the case of the B region and the B ′ region.
- the compression pattern 5700 is a Huffman tree h having 16 types of leaves.
- FIG. 58 is an explanatory diagram showing compression patterns in the case of the C region and the C ′ region.
- the compression pattern 5800 is a Huffman tree h with 16 types and 1 type of leaves.
- the number of places where “0” continues or the place where “1” continues is probabilistically larger than in the B area and the B ′ area. Therefore, the code word “00” is assigned to a bit string whose value is “0” continuously for 16 bits.
- FIG. 59 is an explanatory diagram showing compression patterns in the case of the D region and the D ′ region.
- the compression pattern 5900 is a Huffman tree with 16 types and 1 type of leaves.
- the code word “00” is assigned to a bit string whose value is “0” continuously for 32 bits.
- FIG. 60 is an explanatory diagram showing compression patterns for the E region and the E ′ region.
- the compression pattern 6000 is a Huffman tree with 16 types and 1 type of leaves. In the compressed pattern 6000, there are probabilistically more places where “0” continues or where “1” continues than in the D region and the D ′ region. Therefore, the code word “00” is assigned to a bit string whose value is “0” for 64 consecutive bits. As described above, the number of consecutive “0” s indicating that there is no character code increases in accordance with the appearance rate area, so that the compression efficiency of the compression symbol map Ms is improved according to the appearance rate of the character code. be able to.
- compression symbol map compression processing is a process for compressing a bit string in a compression area. Specifically, the bit string in the compression area of the compression code map Ms is compressed using the compression pattern table 5600 shown in FIG. 56 and the compression patterns 5700 to 6000 (Huffman tree h) shown in FIGS. Hereinafter, the compression symbol map compression processing procedure will be described.
- FIG. 61 is a flowchart showing a compression symbol map compression processing procedure.
- the computer 1100 determines whether or not there is a pointer to an unselected appearance map in the compression symbol map Ms (step S6101). If there is an unselected address (step S6101: Yes), the computer 1100 selects the unselected address to access the structure of the leaf L # (step S6102), and the first area of the structure of the leaf L # The character code is acquired from the list (step S6103). Then, the computer 1100 acquires the appearance rate area from the fifth area of the structure of the access destination leaf L #, thereby specifying the appearance rate area of the acquired character code (step S6104).
- the computer 1100 refers to the compression pattern table 5900 of FIG. 59 to determine whether or not the identified appearance rate area is a non-compressed area (for example, the appearance ratio areas A and A ′) (step). S6105). If it is an uncompressed area (step S6105: YES), the process returns to step S6101, and the next address is selected.
- a non-compressed area for example, the appearance ratio areas A and A ′
- step S6105 the computer 1100 uses the specified appearance rate region to perform the corresponding compression from the compression patterns 5700 to 6000 (Huffman tree h) shown in FIGS. A pattern (Huffman tree h) is selected (step S6106). Further, the computer 1100 extracts a bit string of the compression area in the appearance map of the acquired character code to be compressed (step S6107).
- the computer 1100 determines whether the appearance rate of the acquired character code is 50% or more (step S6108).
- the appearance rate is a value in which the total number of files in the target file group Fs is a population (denominator) and the number of files in which the character information exists is a numerator. Since the appearance rate area is determined according to the appearance rate (see FIG. 55), when the appearance rate area is A to E, it is determined that the appearance rate of the acquired character code is not 50% or more. On the other hand, when the appearance rate area is A ′ to E ′, the computer 1100 determines that the appearance rate of the acquired character code is 50% or more.
- step S6108 If the appearance rate is 50% or more (step S6108: Yes), the computer 1100 inverts the bit string extracted in step S6107 in order to increase the compression efficiency (step S6109). For example, when the extracted bit string is “1110”, it is set to “0001” and the number of “0” is increased. Then, the computer 1100 compresses the inverted bit string using the Huffman tree selected in step S6106 and stores the compressed bit string in a storage device (for example, flash memory or magnetic disk 205) (step S6110). Then, the process returns to step S6101. In this way, by performing bit string inversion, it is not necessary to prepare the Huffman tree h for the appearance rate areas A ′ to E ′, so that memory saving can be achieved.
- a storage device for example, flash memory or magnetic disk 205
- step S6108 when the appearance rate is not 50% or more in step S6108 (step S6108: No), the computer 1100 selects the bit string extracted in step S6107 without performing the bit string inversion (step S6109) in step S6106. Compression is performed using the Huffman tree (step S6110), and the process returns to step S6101. In step S6101, if there is no unselected address (step S6101: NO), the compression symbol map compression process is terminated.
- the bit string in the compression area is compressed according to the appearance rate for each character information.
- the number of consecutive “0” s indicating that character information does not exist increases in accordance with the appearance rate area, so that the compression efficiency of the compression code map Ms is improved in accordance with the appearance rate of character information. be able to.
- the target file is added later, when compressing the added target file, it is necessary to add a bit string indicating the presence or absence of characters to the compression code map Ms.
- the bit string of the appearance map with the file numbers: 1 to ⁇ is compressed by the compression patterns 5700 to 6000, and the code length is different for each record. That is, since it has a variable length, it becomes a compression region.
- the beginning (file number k side) of the compression code string is aligned, but the end (file number 1 side) is not aligned.
- the bit string sequence is assigned in the order of the file numbers: 1 to ⁇ from the pointer to the compression code map Ms (compression code of character information)
- the bit string of the additional file is inserted at the end of the compression code string.
- the compression code string and the bit string of the additional file are discontinuous. Therefore, the bit strings in the compression area of the compression code map Ms are arranged in descending order of the file number p of the target file group Fs from the head position to the tail position in advance.
- an uncompressed area is set between the pointer to the appearance map (compressed code of character information) and the compressed area.
- the file number: 2n + 1 is assigned to the side where the compression code string is aligned among the file numbers: 1 to 2n + 1.
- the bit string can be continued in the order of the file number even if the uncompressed file number: 2n + 1 to 3n is inserted.
- the target file can be narrowed down accurately.
- FIG. 62 is a block diagram showing a functional configuration example 2 of the computer or the computer system according to the present embodiment.
- the computer 1100 includes a designation unit 6201, a first decompression unit 6202, a first compression unit 1103, an input unit 6203, an extraction unit 6204, a second decompression unit 6205, a specifying unit 6206, a segment A generating unit 6207.
- the specification unit 6201 to the segment generation unit 6207 realize their functions by causing the CPU 901 to execute programs stored in a storage device such as the ROM 902, the RAM 903, and the magnetic disk 905 shown in FIG. .
- the designation unit 6201 to the segment generation unit 6207 respectively write the execution results to the storage device and read the execution results of the other units, and execute the respective calculations.
- the designation unit 6201 to segment generation unit 6207 will be briefly described below.
- the designation unit 6201 accepts an open designation of any target file in the target file group Fs. Specifically, when the user operates a keyboard, a mouse, and a touch panel, the designation unit 6201 receives an open designation for the target file Fi. When the open designation is accepted, a pointer to the compressed file fi associated with the file number i of the target file Fi designated for opening is designated in the compression symbol map Ms. As a result, the compressed file fi of the target file Fi designated to be opened, which is stored at the point destination address, is read out.
- the segment having the segment number that matches the quotient when the file number i of the designated target file Fi is divided by the number K of segments in the 0th layer is specified. Thereby, the compressed file fi can be designated from the identified segment.
- the designation unit 6201 accepts additional designation of the target file Fi. Specifically, when the user operates a keyboard, a mouse, and a touch panel, the specification unit 6201 receives an additional specification of the target file Fi. When the additional designation is accepted, the additionally designated target file Fi is compressed by the 1 N compression section 1103 with the 2 N branching no-node Huffman tree H, and stored as the compressed file fi in the last segment of the 0th hierarchy. Is done.
- the designation unit 6201 accepts segment aggregation designation. Specifically, when the user operates a keyboard, a mouse, and a touch panel, the designation unit 6201 accepts segment aggregation designation.
- the segment aggregation designation may be received by a timer at a predetermined time or a predetermined time unit.
- the first decompressing unit 6202 decompresses the compressed file fi of the target file Fi with the 2 N branching no-node Huffman tree H. Specifically, for example, the first decompressing unit 6202 decompresses the compressed file fi of the target file Fi designated by the designating unit 6201 using the 2 N branching no-node Huffman tree H. Further, the target file Fi specified by the specifying unit 6206, which will be described later, is also expanded by the 2 N- branch nodeless Huffman tree H. A specific example of expansion will be described later.
- the input unit 6203 accepts input of a search character string. Specifically, when the user operates a keyboard, a mouse, and a touch panel, the input unit 6203 receives an input of a search character string. The input unit 6203 receives the search character string from the client device via the network, and accepts the input of the search character string.
- the extraction unit 6204 extracts the compression code of the character information in the search character string input by the input unit 6203 from the 2 N branching no-node Huffman tree H. Specifically, for example, the extraction unit 6204 extracts corresponding character information from a search character string among a specific single character, an upper divided character code, a lower divided character code, a bi-gram character string, and a basic word.
- the extraction unit 6204 can identify the compression code of the extracted character information by the 2 N -branch nodeless Huffman tree H and point to the corresponding appearance map of the compression code map Ms. For example, a compressed appearance map of a specific single character “person”, a compressed appearance map of “shape”, and a compressed appearance map of a bi-gram character string “doll” may be pointed to.
- the master server MS extracts character information by the extraction unit 6204 and acquires a compressed code of the extracted character information by the 2 N- branch nodeless Huffman tree H. Since the acquired compressed code becomes a pointer to the appearance map, it is transmitted to the slave servers S1, S2,.
- the second decompressing unit 6205 decompresses the compressed appearance map extracted by the extracting unit 6204. Specifically, since the appearance rate area can be specified from the appearance rate of the character information, the second decompressing unit 6205 uses the map Huffman tree corresponding to the specified appearance rate area to compress the compressed appearance map. Elongate. For example, in the above example, in all archive files (see FIG. 7), the compressed appearance map of the specific single character “person”, the compressed appearance map of “shape”, and the two-gram string “doll” The compressed appearance map of “” is decompressed.
- the extension processing by the second extension unit 6205 is executed in each of the master server MS and the slave servers S1, S2,.
- the identifying unit 6206 identifies the compressed file of the target file including the character information in the search character string from the compressed file group by performing an AND operation on the appearance map group and the deletion map D after being decompressed by the second decompressing unit 6205. To do.
- the specifying unit 6206 includes a compressed appearance map of the specific single character “person”, a compressed appearance map of the “shape”, a compressed appearance map of the bi-gram character string “doll”, An AND operation is performed on the deletion map.
- this AND operation is executed from the highest layer segment, finally narrowed down to the 0th layer segment, and the AND operation is executed on the narrowed down 0th layer segment.
- the compressed appearance map of the bi-gram character string “doll” is omitted for simplification.
- the master server MS narrows down the segment from the highest hierarchy to the first hierarchy by the specifying unit 6206, and manages the file number of the target file including the search character string.
- the slave server that has received the file number narrows down the compressed file by performing an AND operation on the appearance map and the deletion map by the specifying process by the specifying unit 6206.
- the first decompressing unit 6202 decompresses the compressed file identified by the identifying unit 6206 (compressed files f3 and f19 in the above example) with the 2 N branching no-node Huffman tree H.
- the compressed file narrowed down by the slave server is transmitted to the master server MS.
- the master server MS decompresses the compressed file from the slave server with the 2 N- branch nodeless Huffman tree H by the first decompression unit 6202.
- the expanded target file (F3 and F19 in the above example) is displayed on a display device such as a display.
- the master server MS transmits the expanded target file (F3, F19 in the above example) as a search result to the client device. If the compressed file is not specified by the specifying unit 6206, a search result to that effect is returned.
- the segment generation unit 6207 determines whether or not the current total file number ⁇ is a multiple of the number n of files per segment. If it is a multiple of n, the last segment does not have a free area in which the compressed file of the target file that is additionally designated can be stored, so a segment in the 0th layer is newly generated. When a segment is newly generated, as shown in FIG. 1 and FIG. 6, association between management areas is performed. Then, the compressed file added to the new segment is sequentially stored.
- a new segment 0 generation instruction is transmitted to the slave server having the last segment. If the slave server that holds the last segment does not have a free area that holds the new segment, a new segment generation instruction for the 0th hierarchy is transmitted to the other slave servers. Then, when a new segment is generated, the master server MS sequentially transmits the added compressed file. Thus, the added compressed file is sequentially stored in a new segment.
- the segment generation unit 6207 aggregates the appearance map and the deletion map. Specifically, for example, as shown in FIG. 4, the segment generation unit 6207 performs aggregation to an upper hierarchy for each appearance map. Then, as shown in FIG. 4, the segment generation unit 6207 manages the management area of the upper-layer segment (for example, segment sg1 (1)) as the aggregation destination and the upper-layer segment group (for example, segment sg0) as the aggregation source. (1) to sg0 (m)) are associated with each management area. This aggregation process is similarly executed for the deletion map.
- the upper-layer segment for example, segment sg1 (1)
- the upper-layer segment group for example, segment sg0
- FIG. 63 is an explanatory diagram of a file decompression example.
- the processing shown in the file decompression example is executed by the input unit 6203, the extraction unit 6204, the second decompression unit 6205, the specifying unit 6206, and the first decompression unit 6202.
- G1 First, when the search character string “doll” is input by the input unit 6203, the character “person” and “shape” constituting the search character string “doll” are divided into two parts with respect to the structure 2100 of the specific single character. By searching, specific single characters “person” and “form” are searched.
- the specific single character structure 2100 is associated with a pointer to a leaf (specific single character) of the 2 N -branch nodeless Huffman tree H. Therefore, if a hit is made with a structure of a specific single character, the leaves of the 2 N -branch nodeless Huffman tree H can be directly specified.
- the decompressed target file F3 is opened by collating and decompressing the extracted compressed file f3 in the compressed state.
- the collation flag is ON for the “human” and “shaped” leaf structures, so when “human” and “shaped” are expanded, the character string is replaced and expanded.
- “person” and “form” whose collation flag is ON are displayed in bold by expanding between ⁇ B> ⁇ / B> tags. Characters for which the collation flag is OFF are expanded as they are without being sandwiched by ⁇ B> ⁇ / B> tags.
- a compression code string is set in a register, and a compression code is extracted using a mask pattern.
- the extracted compression code is searched from the root of the 2 N -branch nodeless Huffman tree H in one pass (access for one branch). Then, the character code stored in the accessed structure of the leaf L # is read and stored in the decompression buffer.
- the mask position of the mask pattern is offset.
- the initial value of the mask pattern is “0xFFF00000”.
- This mask pattern is a bit string in which the first 12 bits are “1” and the subsequent 20 bits are “0”.
- FIG. 64 and 65 are explanatory diagrams showing a specific example of the decompression process in FIG.
- FIG. 64 shows an extension example (A) for a specific single character “person”.
- the CPU calculates a bit address abi, a byte offset byos, and a bit offset bios.
- the bit address abi is a value indicating the bit position of the extracted compression code
- the current bit address abi is a value obtained by adding the compression code length leg of the previously extracted compression code to the previous bit address abi.
- a block in the memory indicates a 1-byte bit string, and an internal number indicates a byte position that is a byte boundary.
- the compression code string of 4 bytes (shaded in the figure) from the beginning of the compression code string held in the memory is set in the register.
- the character code “0xBA4E” is stored in the structure of the leaf L97, the character code “0xBA4E” is extracted and stored in the decompression buffer. In this case, since the collation flag is ON, the character code “0xBA4E” is sandwiched between ⁇ B> ⁇ / B> tags and stored.
- the mask pattern is “0x3FFC0000”. Therefore, an AND result is obtained by performing an AND operation on the compression code string set in the register and the mask pattern “0x3FFC0000”.
- the character code “0x625F” is stored in the structure of the leaf L105, the character code “0x625F” is extracted and stored in the decompression buffer.
- the file decompression example (G1) it is stored in the decompression buffer as it is, but in the case of the file decompression example (G2), since the collation flag is ON, the character code “0x625F” is set to the ⁇ B> ⁇ / B> tag. Put it in and store it.
- the segment generation unit 6207 adds the target file F (n + 1) to be added and updates the compression code map Ms without expanding the compressed compression code map Ms.
- FIG. 66 is an explanatory diagram showing a specific example of the file addition process.
- the target file F (n + 1) is added will be described as an example.
- the compressed file f3 is expanded from the compressed file group fs and the expanded target file F3 is written on the main memory (for example, the RAM 903) in the file expansion example of FIG.
- the file is changed to the target file F (n + 1) and a new save instruction is given.
- the newly assigned file number n + 1 is assigned to the target file F (n + 1) on the main memory. That is, since the segment sg0 (1) has no free space, the segment sg0 (2) is set and is associated with the segment sg0 (1).
- the target file F (n + 1) is compressed with the 2 N branching no-node Huffman tree H to form a compressed file f (n + 1), which is stored in the segment sg0 (2).
- the presence / absence of character information can be detected by counting the character information of the target file F (n + 1) on the main memory by the counting unit 1101. Therefore, the newly assigned bit of the file number n + 1 is added to the appearance map of each character information (default is OFF), and the bit where the character information appears is turned ON. Further, the bit of the file number n + 1 is added to the deletion map D (default is ON).
- the master server MS sends the compressed file f (m ⁇ n + 1) of the target file F (m ⁇ n + 1) to the slave server S2 that is the allocation destination. Send. It is assumed that the slave server to be assigned is determined in advance.
- the slave server S2 generates the segment sg0 (m + 1) as a subsequent segment of the segment sg0 (m) of the slave server S1, and stores the compression f (m ⁇ n + 1) from the master server MS.
- FIG. 67 is a flowchart showing a detailed processing procedure of the segment addition processing.
- the computer 1100 waits for designation of file addition by the designation unit 6201 (step S6701: No).
- the computer 1100 specifies the storage target segment sg0 (K) (step S6702). Specifically, a segment having the same number as the quotient obtained by dividing the number of files i by the number of files n per segment is set as a storage target segment sg0 (K).
- the computer 1100 increments the number of files i (step S6703), and determines whether i> Kn is satisfied (step S6704). If i> Kn is not satisfied (step S6704: NO), since the compressed file can still be stored in the current segment sg0 (K), the computer 1100 executes map update processing using the additional file (step S6709). Details of the map update processing using the additional file (step S6709) will be described later.
- the computer 1100 compresses the additional file with the 2 N- branch nodeless Huffman tree H (step S6710), and stores the compressed additional file in the storage target segment sg0 (K) (step S6711). Then, the computer 1100 associates a pointer to the compressed additional file with the management area AK of the storage target segment sg0 (K) (step S6712). Thereby, the segment addition process is terminated.
- step S6704 if i> Kn (step S6704: Yes), the computer 1100 maps the compression symbol map of the current segment sg0 (K) because the compressed additional file cannot be stored in the current segment sg0 (K). Compress with the Huffman tree (step S6705). Then, the computer 1100 secures a new segment area (step S6706) and increments the segment number K (step S6707). Thereafter, the computer 1100 executes a pointer linking process between the incremented segment sg0 (K) and the preceding segment (step S6708). As a result, as shown in FIG. 1, it is associated with the preceding segment. Thereafter, the process proceeds to step S6709.
- FIG. 68 is a flowchart (first half) showing a detailed processing procedure of the map update processing (step S6709) using the additional file shown in FIG.
- the computer 1100 sets the bit of the file number of the additional file in the compression code map Ms and the deletion map Ds (step S6801). Specifically, for the appearance map, the OFF bit is set for the file number of the additional file, and for the deletion map D, the ON bit is set for the file number of the additional file.
- the computer 1100 sets the first character in the additional file as the target character (step S6802), and executes the longest match search process for the target character (step S6803).
- the longest match search process is the same process as the process shown in FIG.
- step S6804 determines whether or not the longest matching basic word is in the basic word structure 2300 (step S6804). If not (step S6804: NO), the process proceeds to step S6901 in FIG. On the other hand, if there is one (step S6804: Yes), the computer 1100 identifies the compression code of the longest matching basic word from the 2 N -branch nodeless Huffman tree H, and the longest matching basic word appears by the compression code. A map is designated (step S6805). Then, the computer 1100 turns on the bit corresponding to the file number of the additional file in the designated appearance map (step S6806). Thereafter, the process proceeds to step S6901 in FIG.
- FIG. 69 is a flowchart (second half) showing a detailed processing procedure of the map update processing (step S6709) using the additional file shown in FIG.
- the computer 1100 determines whether or not the target character is a specific single character (step S6901). Specifically, for example, the computer 1100 determines whether or not the target character has hit with a structure of a specific single character.
- step S6901 When the target character is a specific single character (step S6901: Yes), the computer 1100 identifies the compression code of the specific single character that has been hit from the 2 N- branch nodeless Huffman tree H, and hits with the compression code An appearance map of a specific single character is designated (step S6902). Then, the computer 1100 turns on the bit corresponding to the file number of the additional file in the designated appearance map (step S6903). Thereafter, the process proceeds to step S6909.
- step S6901 the computer 1100 divides the target character into an upper divided character code and a lower divided character code (step S6904). Then, the computer 1100 specifies the compression code of the upper divided character code hit in the divided character code structure from the 2 N branching no-node Huffman tree H, and the appearance of the hit upper divided character code by the compression code A map is designated (step S6905). Then, the computer 1100 turns on the bit corresponding to the file number of the additional file in the designated appearance map (step S6906).
- the computer 1100 identifies the compression code of the lower divided character code hit in the divided character code structure from the 2 N branching no-node Huffman tree H, and the hit lower divided character code is identified by the compression code.
- An appearance map is designated (step S6907).
- the computer 1100 turns on the bit corresponding to the file number of the additional file in the designated appearance map (step S6908). Thereafter, the process proceeds to step S6909.
- step S6909 the computer 1100 executes a bi-gram character string specifying process (step S6909).
- the bi-gram character string specifying process (step S6909) is the same as the process shown in FIG.
- the computer 1100 concatenates the compression code of the first gram character (for example, “people”) and the compression code of the last gram character (for example, “shape”) in the 2-gram character string (for example, “doll”). (Step S6910).
- the computer 1100 designates the appearance map of the bi-gram character string using the concatenated compression code (step S6911).
- the computer 1100 turns on the bit corresponding to the file number of the additional file in the designated appearance map (step S6912), and ends the series of processes.
- the segment hierarchization process is a process of aggregating the index information group of the lower layer segment group into the index information of the upper layer.
- the segment generation processing is executed by the segment generation unit 6207.
- FIG. 70 is a flowchart showing a detailed processing procedure of segment stratification processing.
- the computer 1100 waits for segment aggregation designation by the designation unit (step S7001: No).
- the computer 1100 sequentially selects a compression code that is a pointer for designating the appearance map (step S7002).
- step S7002: Yes the computer 1100 selects one unselected compressed code (step S7003), and executes a selected appearance map aggregation process. ). Details of the selection appearance map aggregation processing (step S7004) will be described later.
- step S7004 After the selected appearance map aggregation process (step S7004), the process returns to step S7002. If there is no unselected compression code (step S7002: No), the computer 1100 executes a deletion map aggregation process (step S7005). Details of the deletion map aggregation processing (step S7005) will be described later. Thereby, the segment hierarchization process is completed.
- FIG. 71 is a flowchart showing a detailed processing procedure of the selected appearance map aggregation processing (step S7004) shown in FIG.
- f is a target hierarchy number
- h is a hierarchy number of an upper hierarchy of the target hierarchy.
- j is a segment number.
- the computer 1100 determines whether or not there is a free area of the compressed file in the segments sgf (j) to sgf (j + m ⁇ 1) of the f-th layer that is the target layer (step S7102).
- m is the number of segments that can be aggregated.
- step S7102 If there is no free space (step S7102: No), since the compressed file is stored at the maximum in each of the segments sgf (j) to sgf (j + m ⁇ 1) of the f-th layer, the computer 1100 It is determined whether or not the segment has been aggregated in a segment sgh (j) of a certain h layer (step S7103). Specifically, for example, the computer 1100 determines whether or not the h-th layer segment sgh (j) exists.
- step S7103 NO
- the computer 1100 sets the upper segment sgh (j) and secures an index area for the selected compression code in the upper segment sgh (j) (step S7104).
- a j is set (step S7105).
- a is a variable that identifies the target segment sgh (j).
- the bit string which is the index information of the extracted segment sgf (a) is all 0, that is, the character information about the selected compression code exists in any of the compressed file groups in the segment sgf (a). It is determined whether it is not (step S7108).
- step S7108 Yes
- the computer 1100 writes “0” as the aggregation result in the upper segment sgh (a) (step S7109), and proceeds to step S7111.
- step S7108: No “1” is written in the upper segment sgh (a) as the aggregation result (step S7110), and the process proceeds to step S7111.
- step S7111 a is incremented (step S7111), and the process returns to step S7106.
- FIG. 72 is a flowchart showing a detailed processing procedure of the deletion map aggregation processing (step S7005) shown in FIG.
- f is a target hierarchy number
- h is a hierarchy number of an upper hierarchy of the target hierarchy
- j is a segment number.
- the computer 1100 determines whether or not there is a free area of the compressed file in the segments sgf (j) to sgf (j + m ⁇ 1) of the f-th layer that is the target layer (step S7202).
- m is the number of segments that can be aggregated.
- step S7202 If there is no free space (step S7202: No), since the compressed file is stored at the maximum in each of the segments sgf (j) to sgf (j + m ⁇ 1) of the f-th layer, the computer 1100 It is determined whether or not the segment sgh (j) of a certain h-th layer has been aggregated (step S7203). Specifically, for example, the computer 1100 determines whether or not the h-th layer segment sgh (j) exists.
- a is a variable that identifies the target segment sgh (j).
- the computer 1100 determines whether the bit string that is the index information of the extracted segment sgf (a) is all 0, that is, whether the compressed file group in the segment sgf (a) has been deleted (step S7208).
- step S7208: Yes the computer 1100 writes “0” as the aggregation result in the upper segment sgh (a) (step S7209), and proceeds to step S7211.
- step S7208: No the computer 1100 writes “1” as the aggregation result in the upper segment sgh (a) (step S7210), and proceeds to step S7211.
- step S7211 a is incremented (step S7211), and the process returns to step S7206.
- segment hierarchization is realized as shown in FIGS. Therefore, it is possible to construct a hierarchical structure of archive files as shown in FIG.
- ⁇ Search processing procedure> Next, a search processing procedure according to this embodiment will be described. Specifically, for example, the processing procedure for the file decompression example shown in FIG. 63 is performed.
- FIG. 73 is a flowchart showing a search processing procedure according to the present embodiment.
- the computer 1100 waits for input of a search character string (step S7301: No).
- a search character string is input (step S7301: Yes)
- pointer specifying processing step S7302
- file narrowing processing step S7303
- the decompression process step S7304
- the pointer specifying process step S7302
- step S7303 the compressed file fi of the target file Fi containing the character information constituting the search character string is narrowed from the hierarchical structure segment group. Details of the file narrowing process (step S7303) will be described with reference to FIG.
- step S7304 collates the compression code string to be decompressed with the compressed character string of the search character string in the process of decompressing the compressed file fi narrowed down by the file narrowing process (step S7303). Details of the decompression process (step S7304) will be described with reference to FIGS. 77 and 78.
- FIG. 74 is a flowchart (part 1) showing a detailed processing procedure of the pointer identifying process (step S7302) shown in FIG.
- the computer 1100 sets a search character string as a target character string (step S7401), and executes a longest match search process (step S7402).
- the longest match search process (step S7402) is the same process as the longest match search process (step S2901) shown in FIG.
- the computer 1100 searches the longest match search result obtained in the longest match search process (step S7402) in the basic word structure in two (step S7403).
- the longest match search result is searched for in the basic word structure (step S7403: Yes)
- the compression code of the searched basic word is acquired from the 2 N -branch nodeless Huffman tree H and stored in the search buffer. (Step S7404).
- step S7405 determines whether or not there is a subsequent character string (step S7405). If there is a continuation (step S7405: Yes), the computer 1100 sets the subsequent character string as a target character string (step S7406), and returns to the longest match search process (step S7402). On the other hand, when there is no subsequent (step S7405: No), the pointer specifying process (step S7302) ends, and the process proceeds to the file narrowing process (step S7303).
- step S7403 if the longest match search result is not searched for in the basic word structure (step S7403: No), the process proceeds to step S7501 in FIG. Specifically, if the longest match search result is not registered in the basic word structure, or if there is no longest match candidate in the longest match search (step S7403: No), the process proceeds to step S7501 in FIG. To do.
- FIG. 75 is a flowchart (part 2) showing a detailed processing procedure of the pointer specifying process (step S7302) shown in FIG.
- the computer 1100 sets the first character of the target character string as the target character (step S7501).
- the computer 1100 searches for the target character in two in the structure of the specific single character (step S7502). If the target character is searched (step S7503: YES), the computer 1100 acquires the compression code of the specific single character from the 2 N branching no-node Huffman tree H and stores it in the search buffer (step S7504).
- step S7503 if no search is made in step S7503 (step S7503: No), the computer 1100 divides the target character into upper 8 bits and lower 8 bits (step S7505). Then, the computer 1100 acquires the compression code of the upper divided character code from the 2 N -branch nodeless Huffman tree H and stores it in the search buffer (step S7506).
- the computer 1100 acquires the compression code of the lower divided character code from the 2 N -branch nodeless Huffman tree H and stores it in the search buffer (step S7507). Further, the computer 1100 accesses the leaves of the 2 N- branch nodeless Huffman tree H for the target character and the divided character code divided in step S7505, and turns on the collation flag (step S7508). Thereafter, the computer 1100 executes bigram character string specifying processing (step S7509).
- the bi-gram character string specifying process (step S7509) is the same process as the bi-gram character string specifying process (step S3406) shown in FIG.
- step S7509 step S7510: No
- step S7510: Yes the computer 1100 acquires the compression code of the bi-gram character string from the 2 N branch no-node Huffman tree H and saves it in the search buffer (step S7510).
- step S7511 the computer 1100 obtains and concatenates the compression code of the first gram and the compression code of the second gram by accessing the 2 N branchless nodeless Huffman tree H, and compresses the 2 gram character string.
- An appearance map specified by a concatenated compression code is acquired from the code map M. Then, the process returns to step S7405 of FIG.
- FIG. 76 is a flowchart showing a detailed processing procedure of the file narrowing process (step S7303) shown in FIG.
- the computer 1100 determines whether or not the segment sgH (j) exists (step S7603).
- the computer 1100 designates an appearance map and a deletion map for each compression code in the search buffer (step S7604).
- the computer 1100 extracts index information of the target segment sgH (j) from the specified appearance map and deletion map (step S7605).
- step S7607: Yes the position of “1” obtained by the AND result is the file number, so the compressed file with the number is stored in the search buffer, and the process proceeds to step S7610 ( Step S7609).
- step S7610 the segment number j is incremented (step S7610), and the process returns to step S7603. At this time, since the subsequent segment is associated with the pointer, the subsequent segment can be specified by incrementing the segment number.
- step S7603 if the target segment sgH (j) does not exist (step S7603: No), the computer 1100 decrements the hierarchy number h (step S7611) and determines whether h ⁇ 0 (Ste S7612). If h ⁇ 0 is not satisfied (step S7612: NO), the process returns to step S7602. On the other hand, if h ⁇ 0 (step S7612: Yes), since the compressed file to be decompressed can be identified in step S7609, the process proceeds to the decompression process (step S7304).
- FIG. 77 is a flowchart (part 1) of a detailed process procedure example of the decompression process (step S7304) using the 2 N- branch nodeless Huffman tree H depicted in FIG. 73.
- the computer 1100 sets the compression code string from the position of the byte offset byos in the register r1 (step S7704).
- the computer 1100 shifts the mask pattern set in the register r2 by the bit offset bios toward the end (step S7705), and performs an AND operation with the compression code string set in the register r1 (step S7706). Thereafter, the computer 1100 calculates the register shift number rs (step S7707), and shifts the register r2 after the AND operation to the end by the register shift number rs (step S7708).
- FIG. 78 is a flowchart (part 2) of a detailed process procedure example of the decompression process (step S7304) using the 2 N- branch nodeless Huffman tree H depicted in FIG.
- the computer 1100 extracts the last N bits from the shifted register r2 as a target bit string (step S7801).
- the computer 1100 specifies a pointer from the root structure of the 2 N -branch nodeless Huffman tree H to the leaf L # (step S7802), and the structure of the leaf L # that is the point destination is one pass. Access is made (step S7803). Thereafter, computer 1100 determines whether or not the collation flag of the structure of access destination leaf L # is ON (step S7804).
- step S7804 If the collation flag is ON (step S7804: YES), the computer 1100 writes the replacement character in the decompression buffer for the character information in the structure of the access destination leaf L # (step S7805), and proceeds to step S7807. On the other hand, if the collation flag is OFF (step S7804: NO), the computer 1100 writes the character information (expanded character) in the structure of the access destination leaf L # into the expansion buffer (step S7806), and proceeds to step S7807. To do.
- step S7807 the computer 1100 extracts the compression code length leg from the structure of the access destination leaf L # (step S7807), and updates the bit address abi (step S7808). Thereafter, the computer 1100 determines whether or not there is a compression code string in the memory, specifically, whether or not there is a compression code string that has not been subjected to mask processing using a mask pattern (step S7809). For example, the determination is made based on whether there is a byte position corresponding to the byte offset byos. If there is a compression code string (step S7809: YES), the process returns to step S7702 in FIG. On the other hand, if there is no compression code string (step S7809: NO), the decompression process (step S7304) is terminated.
- step S7304 collation and decompression can be performed in the compressed state, and the decompression speed can be increased.
- the search target file group is divided into a plurality of segments and stored, the search can be performed using the segment unit index information. Therefore, even if the size of the index information increases as the number of search target files increases, an increase in search processing time can be suppressed.
- the hierarchical structure segment group SG is constructed by aggregating segment groups of the same hierarchy to generate upper layer segments. Therefore, by sequentially narrowing down from the index information of the segment of the highest hierarchy of the hierarchical structure segment group SG, it is possible to exclude a segment in which the search character string does not exist and a compressed file existing under the segment from the narrowing target. . In this way, since it is not necessary to use unnecessary narrowing down, it is possible to improve the search speed.
- the master server narrows down to the first layer, and transmits the narrowed file number to the slave server that owns the file number. Therefore, since the slave server that has not been transmitted does not need to execute the narrowing-down process, useless search can be suppressed and search efficiency can be improved.
- the master server MS decompresses the compressed file fi
- the master server MS transmits the 2 N- branch nodeless Huffman tree H to each slave server in advance.
- decompression processing of the compressed file fi may be executed in each slave server.
- the slave server that has received the file number i from the master server MS decompresses the compressed file Fi of the file number i and returns the decompressed target file Fi to the master server MS.
- load concentration on the master server MS can be suppressed and load distribution can be achieved.
- the narrowing target may be an uncompressed target file Fi.
- the above-described compression processing and decompression processing are not necessary.
- the pointer to the appearance map may be a pointer that specifies character information, not a compression code.
- bit string in the compression area of the compression code map Ms is arranged in advance in descending order of the file number p of the target file group Fs from the head position to the tail position.
- all bits corresponding to the file number of the compression code map Ms become a compression area, so the compression code map Ms is compressed by the Huffman tree h. Is done. Thereby, memory saving can be achieved.
- compression is performed in units of a predetermined number of files n (for example, 256), it is possible to simultaneously reduce the calculation load and save memory.
- the information processing method described in this embodiment can be realized by executing a program prepared in advance on a computer 1100 such as a personal computer or a workstation.
- the information processing program is recorded on a recording medium readable by the computer 1100 such as a hard disk, a flexible disk, a CD-ROM, an MO, and a DVD, and is executed by being read from the recording medium by the computer 1100.
- the information processing program may be distributed through a network such as the Internet.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
図9は、実施の形態にかかるコンピュータのハードウェア構成例を示すブロック図である。図9において、コンピュータは、CPU(Central Processing Unit)901と、ROM(Read Only Memory)902と、RAM(Random Access Memory)903と、磁気ディスクドライブ904と、磁気ディスク905と、光ディスクドライブ906と、光ディスク907と、ディスプレイ908と、I/F(Interface)909と、キーボード910と、マウス911と、スキャナ912と、プリンタ913と、を備えている。また、各構成部はバス900によってそれぞれ接続されている。
図11は、本実施の形態にかかるコンピュータまたはコンピュータシステムの機能的構成例1を示すブロック図であり、図12は、図11に示したコンピュータまたはコンピュータシステムの集計部~第2圧縮部までの処理の流れを示す説明図である。図11において、コンピュータまたはコンピュータシステム(以下、「コンピュータ1100」)は、集計部1101と、第1生成部1102と、第1圧縮部1103と、作成部1104と、第2生成部1105と、第2圧縮部1106と、を備える。
つぎに、集計部1101による集計および作成部1104による圧縮符号マップMsの作成の詳細について説明する。圧縮符号マップMsを作成する場合、作成に先立って、集計部1101により、対象ファイル群Fsから文字情報の出現回数を集計し、第1生成部1102により、2N分枝無節点ハフマン木Hを生成しておく必要がある。
まず、コンピュータ1100は、対象ファイル群Fsに存在する文字情報の出現回数を計数する。集計結果は、出現回数の降順にソートされ、出現回数の大きい方から昇順の順位がつけられる。なお、ここでは、文字情報の総種類数は、例として1305個(<2048(=211))とする。(1)出現回数の集計の詳細は図7で説明する。
つぎに、(1)で得られた集計結果を基にして、コンピュータ1100は、文字情報ごとの圧縮符号長を算出する。具体的には、コンピュータ1100は、文字情報ごとに、出現率を算出する。出現率は、文字情報の出現回数を全文字情報の総出現回数で割ることで得られる。そして、コンピュータ1100は、出現率に対応する生起確率を求め、生起確率から圧縮符号長を導き出す。
1/20>AR≧1/21・・・圧縮符号長は1ビット。
1/21>AR≧1/22・・・圧縮符号長は2ビット。
1/22>AR≧1/23・・・圧縮符号長は3ビット。
1/23>AR≧1/24・・・圧縮符号長は4ビット。
・
・
・
1/2N-1>AR≧1/2N・・・圧縮符号長はNビット。
つぎに、コンピュータ1100は、圧縮符号長ごとに葉数を集計することで圧縮符号長ごとの葉数を特定する。ここでは、最大圧縮符号長は17ビットとする。また、葉数とは、文字情報の種類数である。したがって、圧縮符号長5ビットの葉数が2である場合、5ビットの圧縮符号が割り当てられる文字情報が2つ存在することを示している。
つぎに、コンピュータ1100は、葉数を補正する。具体的には、コンピュータ1100は、枝数の上限2Nのべき数Nが最大圧縮符号長となるように補正する。たとえば、べき数N=11の場合、圧縮符号長11ビット~17ビットまでの葉数の総和を、補正後の圧縮符号長11ビットの葉数にする。そして、コンピュータ1100は、圧縮符号長ごとに葉当たりの枝数を割り当てる。具体的には、補正後の圧縮符号長に対し、その降順に、20、21、22、23、24、25、26として葉当たりの枝数を決定する。
つぎに、コンピュータ1100は、葉の構造体を生成する。葉の構造体とは、文字情報とその圧縮符号長とその圧縮符号長での圧縮符号が対応付けられたデータ構造体である。たとえば、出現順位が1位である文字「0」の圧縮符号長は6ビットであり、圧縮符号は「000000」となる。図13の例では、文字情報の種類数(葉数)は1305個であるため、葉L1の構造体~葉L1305の構造体が生成されることとなる。(3)葉数特定~(5)葉の構造体生成の詳細(N=11)は、図16で説明する。
つぎに、コンピュータ1100は、葉の構造体ごとに葉へのポインタを生成する。葉へのポインタは、そのポイント先となる葉の構造体内の圧縮符号に、その葉当たりの枝数分の番号に相当するビット列を連結したビット列である。たとえば、葉L1である文字「0」に割り当てられた圧縮符号「000000」の圧縮符号長は6ビットであるため、葉L1当たりの枝数は32本である。
最後に、コンピュータ1100は、2N分枝無節点ハフマン木Hを構築する。具体的には、葉のポインタを根とすることで、葉の構造体を直接指定する2N分枝無節点ハフマン木Hが構築される。圧縮符号列が、先頭6ビットが「000000」の11ビットのビット列である場合、後続の5ビットが32種のいずれのビット列であっても、2N分枝無節点ハフマン木Hにより文字「0」の葉L1の構造体をポイントすることができる。(7)2N分枝無節点ハフマン木Hの構築の詳細は、図19で説明する。
第1生成部1102により2N分枝無節点ハフマン木Hが生成されると、作成部1104は、単一文字の圧縮符号マップMs、上位分割文字コードの圧縮符号マップMs、下位分割文字コードの圧縮符号マップMs、単語の圧縮符号マップMs、2グラム文字列の圧縮符号マップMsを作成する。以下、単一文字の圧縮符号マップMs、上位分割文字コードの圧縮符号マップMs、下位分割文字コードの圧縮符号マップMs、2グラム文字列の圧縮符号マップMsの詳細な作成例について説明する。なお、基礎単語の圧縮符号マップMsは、単一文字の圧縮符号マップMsと同様に行われるため省略する。
つぎに、作成部1104による圧縮符号マップ作成処理手順例について説明する。
図26は、図25に示した集計処理(ステップS2501)の詳細な処理手順例を示すフローチャートである。まず、コンピュータ1100は、ファイル番号iをi=1に設定し(ステップS2601)、対象ファイルFiを読み込む(ステップS2602)。そして、コンピュータ1100は、対象ファイルFiの集計処理を実行する(ステップS2603)。対象ファイルFiの集計処理(ステップS2603)の詳細については、図27で説明する。このあと、コンピュータ1100は、ファイル番号iがi>n(nは対象ファイルF1~Fnの総数)であるか否かを判断する(ステップS2604)。
図27は、図26に示した対象ファイルFiの集計処理(ステップS2603)の詳細な処理手順例を示すフローチャートである。まず、コンピュータ1100は、対象文字を対象ファイルFiの先頭文字とし(ステップS2701)、基礎単語集計処理を実行する(ステップS2702)。基礎単語集計処理(ステップS2702)の詳細については図29で説明する。このあと、コンピュータ1100は、文字出現頻度集計テーブルにおいて対象文字の出現回数を1増加する(ステップS2703)。
図29は、図27に示した基礎単語集計処理(ステップS2702)の詳細な処理手順例を示すフローチャートである。まず、コンピュータ1100は、最長一致検索処理を実行し(ステップS2901)、最長一致した基礎単語があったか否かを判断する(ステップS2902)。最長一致検索処理(ステップS2901)の詳細については図31で説明する。最長一致した基礎単語があった場合(ステップS2902:Yes)、コンピュータ1100は、基礎単語出現頻度集計テーブルにおいて最長一致した基礎単語の出現回数を1増加し(ステップS2903)、ステップS2703に移行する。
図31は、図29に示した最長一致検索処理(ステップS2901)の詳細な処理手順を示すフローチャートである。まず、コンピュータ1100は、c=1とする(ステップS3101)。cは対象文字からの文字数(対象文字含む)である。c=1の場合は、対象文字だけである。つぎに、コンピュータ1100は、対象文字からc文字目までの対象文字列と前方一致する基礎単語を2分探索する(ステップS3102)。そして、コンピュータ1100は、検索により基礎単語があるか否かを判断する(ステップS3103)。2分探索により基礎単語がヒットしなかった場合(ステップS3103:No)、ステップS3106に移行する。
図32は、図25に示したマップ割当数決定処理(ステップS2502)の詳細な処理手順例を示すフローチャートである。まず、コンピュータ1100は、集計処理(ステップS2501)による基礎単語ごとの出現頻度を示す基礎単語出現頻度集計テーブル3000と単一文字ごとの出現頻度を示す文字出現頻度集計テーブル2800を出現頻度の高い順にソートする(ステップS3201)。そして、コンピュータ1100は、ソート後の基礎単語出現頻度集計テーブル3000を参照して、基礎単語の出現順位RwをRw=1とし(ステップS3202)、出現順位Rwまでの累積出現回数Arwを計数する(ステップS3203)。そして、コンピュータ1100は、下記式(1)を満たすか否かを判断する(ステップS3204)。
Awは集計された基礎単語の総出現回数である。
Acは集計された単一文字の総出現回数である。
図33は、図25に示した再集計処理(ステップS2503)の詳細な処理手順例を示すフローチャートである。まず、コンピュータ1100は、ファイル番号iをi=1に設定し(ステップS3301)、対象ファイルFiを読み込む(ステップS3302)。そして、コンピュータ1100は、対象ファイルFiの再集計処理を実行する(ステップS3303)。対象ファイルFiの再集計処理(ステップS3303)の詳細については、図33で説明する。このあと、コンピュータ1100は、ファイル番号iがi>n(nは対象ファイルF1~Fnの総数)であるか否かを判断する(ステップS3304)。
図34は、対象ファイルFiの再集計処理(ステップS3303)の詳細な処理手順例を示すフローチャートである。まず、コンピュータ1100は、対象文字を対象ファイルFiの先頭文字とし(ステップS3401)、対象文字が特定単一文字であるか否かを判断する(ステップS3402)。特定単一文字である場合(ステップS3402:Yes)、分割せずにステップS3404に移行する。
図39は、図25に示したハフマン木生成処理(ステップS2504)の詳細な処理手順例を示すフローチャートである。図39において、コンピュータ1100は、圧縮符号長の上限長Nを決定する(ステップS3901)。つぎに、コンピュータ1100は、補正処理を実行する(ステップS3902)。ここで、補正処理とは、図15~図17で説明したように、文字情報ごとの生起確率および圧縮符号長を、圧縮符号長の上限長Nを用いて補正する処理である。
図43は、図25に示したマップ作成処理(ステップS2505)の詳細な処理手順例を示すフローチャートである。まず、コンピュータ1100は、ファイル番号iをi=1に設定し(ステップS4301)、対象ファイルFiを読み込む(ステップS4302)。そして、コンピュータ1100は、対象ファイルFiのマップ作成処理を実行する(ステップS4303)。対象ファイルFiのマップ作成処理(ステップS4303)の詳細は、図44で説明する。このあと、コンピュータ1100は、ファイル番号iがi>α(αは対象ファイルFsのファイル総数)であるか否かを判断する(ステップS4304)。
つぎに、対象ファイルFiの圧縮処理の具体例について説明する。上述のように、圧縮符号マップMsを生成した場合は、検索文字列を圧縮した圧縮符号列により圧縮符号マップMs内の出現マップをポイントすることが可能となる。以下、圧縮処理の具体例について説明する。
つぎに、第1圧縮部1103による対象ファイル群Fsの圧縮処理の処理手順について説明する。
つぎに、第2圧縮部1106による圧縮符号マップMs内の各出現マップのマップ圧縮の具体例について説明する。第2圧縮部1106は、圧縮領域内の出現マップを圧縮し、非圧縮領域の出現マップは圧縮しない。圧縮領域とは、ファイル番号1~αまで採番されている場合に、n×(α/nの商)となるファイル番号までの出現マップのビット列である。たとえば、n=256ビットであり、現在の対象ファイル数α=600である場合、α/nの商は2となるため、ファイル番号1~2nまでの出現マップのビット列が圧縮領域となる。そして、ファイル番号(2n+1)~αまでのビット列は非圧縮領域となり圧縮されない。
つぎに、圧縮符号マップ圧縮処理について説明する。圧縮符号マップ圧縮処理は、圧縮領域のビット列を圧縮する処理である。具体的には、図56に示した圧縮パターンテーブル5600と図57~図60に示した圧縮パターン5700~6000(ハフマン木h)を用いて、圧縮符号マップMsの圧縮領域のビット列を圧縮する。以下、圧縮符号マップ圧縮処理手順について説明する。
図62は、本実施の形態にかかるコンピュータまたはコンピュータシステムの機能的構成例2を示すブロック図である。図62において、コンピュータ1100は、指定部6201と、第1伸長部6202と、第1圧縮部1103と、入力部6203と、抽出部6204と、第2伸長部6205と、特定部6206と、セグメント生成部6207と、を備える。指定部6201~セグメント生成部6207は、具体的には、たとえば、図9に示したROM902、RAM903、磁気ディスク905などの記憶装置に記憶されたプログラムをCPU901に実行させることによりその機能を実現する。なお、指定部6201~セグメント生成部6207は、それぞれ実行結果を記憶装置に書き込んだり、他の部の実行結果を読み出したりして、それぞれ演算を実行する。以下、指定部6201~セグメント生成部6207について簡単に説明する。
図63は、ファイル伸長例を示す説明図である。ファイル伸長例で示す処理は、入力部6203、抽出部6204、第2伸長部6205、特定部6206、第1伸長部6202により実行される。(G1)まず、入力部6203により検索文字列「人形」が入力された場合、検索文字列「人形」を構成する文字「人」、「形」について特定単一文字の構造体2100に対し2分探索することで、特定単一文字「人」、「形」が検索される。特定単一文字の構造体2100には、2N分枝無節点ハフマン木Hの葉(特定単一文字)へのポインタが関連付けられている。したがって、特定単一文字の構造体でヒットすると、2N分枝無節点ハフマン木Hの葉を直接指定することができる。
つぎに、図63での伸長処理の具体例について説明する。ここでは、検索文字列「人形」の圧縮符号列を用いて圧縮ファイルfiについて照合しながら伸張する例について説明する。なお、例として、特定単一文字「人」の圧縮符号を「1100010011」(10ビット)とし、特定単一文字「形」の圧縮符号を「0100010010」(10ビット)とする。
つぎに、ファイル追加処理の具体例について説明する。ここでは、圧縮済みの圧縮符号マップMsを伸長することなく、セグメント生成部6207が、追加対象となる対象ファイルF(n+1)の追加と、圧縮符号マップMsの更新とを実行する。
つぎに、セグメント階層化処理について説明する。セグメント階層化処理とは、図4および図5に示したように、下位階層のセグメント群のインデックス情報群を上位階層のインデックス情報に集約していく処理である。セグメント階層化処理は、セグメント生成部6207が実行する。
つぎに、本実施の形態にかかる検索処理手順について説明する。具体的には、たとえば、図63に示したファイル伸長例についての処理手順となる。
6202 第1伸長部
6203 入力部
6204 抽出部
6205 第2伸長部
6206 特定部
6207 セグメント生成部
Claims (11)
- コンピュータに、
複数のファイルのそれぞれについて所定の文字情報を含むか否かを示す第1の情報と、前記複数のファイルの少なくともいずれかが前記所定の文字情報を含むか否かを示す第2の情報と、を記憶手段に記憶し、
前記所定の文字情報についての検索要求を受けた際に、前記第2の情報が前記所定の文字情報を含む旨を示すことが検出されると、前記第1の情報に基づいて前記所定の文字情報を含むファイルを抽出する、
ことを実行させることを特徴とする抽出方法。 - 前記コンピュータに、さらに、
前記検索要求を受けた際に、前記第2の情報が前記所定の文字情報を含む旨を示さないことが検出されると、前記複数のファイルが前記所定の文字情報を含まないと判定する、
ことを実行させることを特徴とする請求項1に記載の抽出方法。 - 前記第1の情報は、前記複数のファイルのそれぞれについて前記所定の文字情報を含むか否かを示すビットによるビット列であり、
前記第2の情報は、前記ビット列に含まれる各ビットを演算して得られるビットである、
ことを特徴とする請求項1又は2に記載の抽出方法。 - 前記コンピュータに、
前記記憶手段に、さらに、前記複数のファイルのそれぞれについて検索対象とするか否かを示す第3の情報を記憶し、
前記検索要求を受けた際に、前記第2の情報が前記所定の文字情報を含む旨を示すことが検出されると、前記第1の情報及び前記第3の情報に基づいて、検索対象であり、且つ前記所定の文字情報を含むファイルを抽出する、
ことを実行させることを特徴とする請求項1~3のいずれか1項に記載の抽出方法。 - 前記コンピュータに、
前記記憶手段に、さらに、前記複数のファイルの少なくともいずれかが検索対象であるか否かを示す第4の情報を記憶し、
前記検索要求を受けた際に、前記第2の情報が前記所定の文字情報を含む旨を示し、且つ前記第4の情報が前記複数のファイルの少なくともいずれかが検索対象である旨を示すことを検出すると、前記第1の情報に基づいて前記所定の文字情報を含むファイルを抽出する、
ことを実行させることを特徴とする請求項1~3のいずれか1項に記載の抽出方法。 - コンピュータに、
複数のファイルのそれぞれについて所定の文字情報を含むか否かを示す第1の情報と、前記複数のファイルのうちの一部のファイルについて、前記一部のファイルの少なくともいずれかが前記所定の文字情報を含むか否かを示す第2の情報と、を記憶手段に記憶し、
前記所定の文字情報についての検索要求を受けた際に、前記第2の情報が前記所定の文字情報を含む旨を示すことが検出されると、前記第1の情報に基づいて前記複数のファイルから前記所定の文字情報を含むファイルを抽出し、
前記所定の文字情報についての検索要求を受けた際に、前記第2の情報が前記所定の文字情報を含む旨を示さないことが検出されると、前記複数のファイルのうちの前記一部のファイルに含まれないファイルから、前記第1の情報に基づいて前記所定の文字情報を含むファイルを抽出する、
ことを実行させることを特徴とする抽出方法。 - コンピュータに、
複数のファイルのそれぞれについて所定の文字情報を含むか否かを示す第1の情報と、前記複数のファイルの少なくともいずれかが前記所定の文字情報を含むか否かを示す第2の情報と、を記憶手段に記憶し、
前記所定の文字情報についての検索要求を受けた際に、前記第2の情報が前記所定の文字情報を含む旨を示すことが検出されると、前記第1の情報に基づいて前記所定の文字情報を含むファイルを抽出する、
ことを実行させることを特徴とする抽出プログラム。 - コンピュータに、
複数のファイルのそれぞれについて所定の文字情報を含むか否かを示す第1の情報と、前記複数のファイルのうちの一部のファイルについて、前記一部のファイルの少なくともいずれかが前記所定の文字情報を含むか否かを示す第2の情報と、を記憶手段に記憶し、
前記所定の文字情報についての検索要求を受けた際に、前記第2の情報が前記所定の文字情報を含む旨を示すことが検出されると、前記第1の情報に基づいて前記複数のファイルから前記所定の文字情報を含むファイルを抽出し、
前記所定の文字情報についての検索要求を受けた際に、前記第2の情報が前記所定の文字情報を含む旨を示さないことが検出されると、前記複数のファイルのうちの前記一部のファイルに含まれないファイルから、前記第1の情報に基づいて前記所定の文字情報を含むファイルを抽出する、
ことを実行させることを特徴とする抽出プログラム。 - 複数のファイルのそれぞれについて所定の文字情報を含むか否かを示す第1の情報と、前記複数のファイルの少なくともいずれかが前記所定の文字情報を含むか否かを示す第2の情報と、を記憶する記憶手段と、
前記所定の文字情報についての検索要求を受けた際に、前記第2の情報が前記所定の文字情報を含む旨を示すことが検出されると、前記第1の情報に基づいて前記所定の文字情報を含むファイルを抽出する抽出手段と、
を備えることを特徴とする抽出装置。 - 複数のファイルのそれぞれについて所定の文字情報を含むか否かを示す第1の情報と、前記複数のファイルのうちの一部のファイルについて、前記一部のファイルの少なくともいずれかが前記所定の文字情報を含むか否かを示す第2の情報と、を記憶する記憶手段と、
前記所定の文字情報についての検索要求を受けた際に、前記第2の情報が前記所定の文字情報を含む旨を示すことが検出されると、前記第1の情報に基づいて前記複数のファイルから前記所定の文字情報を含むファイルを抽出し、前記所定の文字情報についての検索要求を受けた際に、前記第2の情報が前記所定の文字情報を含む旨を示さないことが検出されると、前記複数のファイルのうちの前記一部のファイルに含まれないファイルから、前記第1の情報に基づいて前記所定の文字情報を含むファイルを抽出する抽出手段と、
を備えることを特徴とする抽出装置。 - 複数のコンピュータと、割当装置と、を含む抽出システムであって、
前記割当装置が、
複数のファイルを分割して得られる複数のファイル群のそれぞれについて、それぞれのファイル群に含まれる少なくとも1つのファイルが所定の文字情報を含むかを示す情報を保持する保持手段と、
前記所定の文字情報についての検索要求を受けた場合に、前記保持手段に保持された情報に前記所定の文字情報を含むファイルを少なくとも1つ旨を示されるファイル群の数に応じて、前記複数のファイル群を前記複数のコンピュータのそれぞれに割り当てる割当手段と、を含み、
前記複数のコンピュータのそれぞれが、
前記複数のファイル群のそれぞれについて、それぞれのファイル群に含まれるいずれのファイルが所定の文字情報を含むかを示すインデックス情報を記憶する記憶手段と、
前記記憶手段に記憶された前記複数のファイル群それぞれについてのインデックス情報のうち、前記割当装置に割り当てられたファイル群についての前記インデックス情報に基づいて前記所定の文字情報を含むファイルを抽出する抽出手段と、を含む、
ことを特徴とする抽出システム。
Priority Applications (7)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2011377004A AU2011377004B2 (en) | 2011-09-14 | 2011-09-14 | Extraction method, extraction program, extraction device, and extraction system |
CN201180073519.7A CN103797480B (zh) | 2011-09-14 | 2011-09-14 | 提取方法、提取程序、提取装置、以及提取系统 |
KR1020147006760A KR101560109B1 (ko) | 2011-09-14 | 2011-09-14 | 추출 방법, 추출 프로그램을 기록한 컴퓨터 판독 가능한 기록 매체, 추출 장치, 및 추출 시스템 |
PCT/JP2011/071028 WO2013038527A1 (ja) | 2011-09-14 | 2011-09-14 | 抽出方法、抽出プログラム、抽出装置、および抽出システム |
JP2013533401A JP5741699B2 (ja) | 2011-09-14 | 2011-09-14 | 抽出方法、抽出プログラム、抽出装置、および抽出システム |
EP11872214.9A EP2757488B1 (en) | 2011-09-14 | 2011-09-14 | Extraction method, extraction program, extraction device, and extraction system |
US14/202,429 US9916314B2 (en) | 2011-09-14 | 2014-03-10 | File extraction method, computer product, file extracting apparatus, and file extracting system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2011/071028 WO2013038527A1 (ja) | 2011-09-14 | 2011-09-14 | 抽出方法、抽出プログラム、抽出装置、および抽出システム |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/202,429 Continuation US9916314B2 (en) | 2011-09-14 | 2014-03-10 | File extraction method, computer product, file extracting apparatus, and file extracting system |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013038527A1 true WO2013038527A1 (ja) | 2013-03-21 |
Family
ID=47882786
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2011/071028 WO2013038527A1 (ja) | 2011-09-14 | 2011-09-14 | 抽出方法、抽出プログラム、抽出装置、および抽出システム |
Country Status (7)
Country | Link |
---|---|
US (1) | US9916314B2 (ja) |
EP (1) | EP2757488B1 (ja) |
JP (1) | JP5741699B2 (ja) |
KR (1) | KR101560109B1 (ja) |
CN (1) | CN103797480B (ja) |
AU (1) | AU2011377004B2 (ja) |
WO (1) | WO2013038527A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014215839A (ja) * | 2013-04-25 | 2014-11-17 | 富士通株式会社 | 検索システム、情報処理装置および検索方法 |
US9793920B1 (en) | 2016-04-19 | 2017-10-17 | Fujitsu Limited | Computer-readable recording medium, encoding device, and encoding method |
EP3236369A1 (en) | 2016-04-18 | 2017-10-25 | Fujitsu Limited | Index generation program, index generation device and index generation method, search program |
EP3236367A2 (en) | 2016-04-18 | 2017-10-25 | Fujitsu Limited | Encoding program, encoding method, encoding device, retrieval program, retrieval method, and retrieval device |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10540404B1 (en) * | 2014-02-07 | 2020-01-21 | Amazon Technologies, Inc. | Forming a document collection in a document management and collaboration system |
US9450601B1 (en) * | 2015-04-02 | 2016-09-20 | Microsoft Technology Licensing, Llc | Continuous rounding of differing bit lengths |
KR102006245B1 (ko) * | 2017-09-15 | 2019-08-06 | 주식회사 인사이너리 | 바이너리 파일에 기초하여 오픈소스 소프트웨어 패키지를 식별하는 방법 및 시스템 |
CN109993025B (zh) * | 2017-12-29 | 2021-07-06 | 中移(杭州)信息技术有限公司 | 一种关键帧提取方法及设备 |
US10541708B1 (en) * | 2018-09-24 | 2020-01-21 | Redpine Signals, Inc. | Decompression engine for executable microcontroller code |
CN109855566B (zh) * | 2019-02-28 | 2021-12-03 | 易思维(杭州)科技有限公司 | 一种槽孔特征的提取方法 |
US12072940B2 (en) * | 2022-04-15 | 2024-08-27 | UltraEdit, Inc. | Systems and methods for simultaneously viewing and modifying multiple segments of one or more files |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2986865B2 (ja) | 1989-07-24 | 1999-12-06 | 株式会社日立製作所 | データ検索方法および装置 |
JP3263963B2 (ja) * | 1991-12-25 | 2002-03-11 | 株式会社日立製作所 | 文書検索方法及び装置 |
JP2009048352A (ja) * | 2007-08-17 | 2009-03-05 | Nippon Telegr & Teleph Corp <Ntt> | 情報検索装置、情報検索方法および情報検索プログラム |
JP2011100320A (ja) * | 2009-11-06 | 2011-05-19 | Fujitsu Ltd | 情報処理プログラム、情報検索プログラム、情報処理装置、および情報検索装置 |
JP2011138230A (ja) * | 2009-12-25 | 2011-07-14 | Fujitsu Ltd | 情報処理プログラム、情報検索プログラム、情報処理装置、および情報検索装置 |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5469354A (en) * | 1989-06-14 | 1995-11-21 | Hitachi, Ltd. | Document data processing method and apparatus for document retrieval |
US5748953A (en) * | 1989-06-14 | 1998-05-05 | Hitachi, Ltd. | Document search method wherein stored documents and search queries comprise segmented text data of spaced, nonconsecutive text elements and words segmented by predetermined symbols |
EP0437615B1 (en) | 1989-06-14 | 1998-10-21 | Hitachi, Ltd. | Hierarchical presearch-type document retrieval method, apparatus therefor, and magnetic disc device for this apparatus |
US5745745A (en) * | 1994-06-29 | 1998-04-28 | Hitachi, Ltd. | Text search method and apparatus for structured documents |
US5778361A (en) * | 1995-09-29 | 1998-07-07 | Microsoft Corporation | Method and system for fast indexing and searching of text in compound-word languages |
JPH09138809A (ja) * | 1995-11-15 | 1997-05-27 | Oki Electric Ind Co Ltd | 全文検索方法 |
CA2340531C (en) * | 2001-03-12 | 2006-10-10 | Ibm Canada Limited-Ibm Canada Limitee | Document retrieval system and search method using word set and character look-up tables |
US7149748B1 (en) * | 2003-05-06 | 2006-12-12 | Sap Ag | Expanded inverted index |
CN1567174A (zh) * | 2003-06-09 | 2005-01-19 | 吴胜远 | 对象表示和处理的方法及其装置 |
US20100005072A1 (en) * | 2004-09-09 | 2010-01-07 | Pitts William M | Nomadic File Systems |
US8504565B2 (en) * | 2004-09-09 | 2013-08-06 | William M. Pitts | Full text search capabilities integrated into distributed file systems— incrementally indexing files |
WO2006123429A1 (ja) * | 2005-05-20 | 2006-11-23 | Fujitsu Limited | 情報検索方法、装置、プログラム、該プログラムを記録した記録媒体 |
WO2009063925A1 (ja) * | 2007-11-15 | 2009-05-22 | Nec Corporation | 文書管理・検索システムおよび文書の管理・検索方法 |
CN101452465A (zh) * | 2007-12-05 | 2009-06-10 | 高德软件有限公司 | 大批量文件数据存放和读取方法 |
US8266179B2 (en) * | 2009-09-30 | 2012-09-11 | Hewlett-Packard Development Company, L.P. | Method and system for processing text |
-
2011
- 2011-09-14 KR KR1020147006760A patent/KR101560109B1/ko active IP Right Grant
- 2011-09-14 CN CN201180073519.7A patent/CN103797480B/zh active Active
- 2011-09-14 EP EP11872214.9A patent/EP2757488B1/en active Active
- 2011-09-14 AU AU2011377004A patent/AU2011377004B2/en active Active
- 2011-09-14 JP JP2013533401A patent/JP5741699B2/ja active Active
- 2011-09-14 WO PCT/JP2011/071028 patent/WO2013038527A1/ja active Application Filing
-
2014
- 2014-03-10 US US14/202,429 patent/US9916314B2/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2986865B2 (ja) | 1989-07-24 | 1999-12-06 | 株式会社日立製作所 | データ検索方法および装置 |
JP3263963B2 (ja) * | 1991-12-25 | 2002-03-11 | 株式会社日立製作所 | 文書検索方法及び装置 |
JP2009048352A (ja) * | 2007-08-17 | 2009-03-05 | Nippon Telegr & Teleph Corp <Ntt> | 情報検索装置、情報検索方法および情報検索プログラム |
JP2011100320A (ja) * | 2009-11-06 | 2011-05-19 | Fujitsu Ltd | 情報処理プログラム、情報検索プログラム、情報処理装置、および情報検索装置 |
JP2011138230A (ja) * | 2009-12-25 | 2011-07-14 | Fujitsu Ltd | 情報処理プログラム、情報検索プログラム、情報処理装置、および情報検索装置 |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2014215839A (ja) * | 2013-04-25 | 2014-11-17 | 富士通株式会社 | 検索システム、情報処理装置および検索方法 |
EP3236369A1 (en) | 2016-04-18 | 2017-10-25 | Fujitsu Limited | Index generation program, index generation device and index generation method, search program |
EP3236367A2 (en) | 2016-04-18 | 2017-10-25 | Fujitsu Limited | Encoding program, encoding method, encoding device, retrieval program, retrieval method, and retrieval device |
US10521414B2 (en) | 2016-04-18 | 2019-12-31 | Fujitsu Limited | Computer-readable recording medium, encoding method, encoding device, retrieval method, and retrieval device |
EP3770770A1 (en) | 2016-04-18 | 2021-01-27 | Fujitsu Limited | Index generation program, index generation device and index generation method, search program |
US11080234B2 (en) | 2016-04-18 | 2021-08-03 | Fujitsu Limited | Computer readable recording medium for index generation |
US9793920B1 (en) | 2016-04-19 | 2017-10-17 | Fujitsu Limited | Computer-readable recording medium, encoding device, and encoding method |
Also Published As
Publication number | Publication date |
---|---|
JP5741699B2 (ja) | 2015-07-01 |
US9916314B2 (en) | 2018-03-13 |
CN103797480A (zh) | 2014-05-14 |
EP2757488A1 (en) | 2014-07-23 |
EP2757488B1 (en) | 2019-02-20 |
EP2757488A4 (en) | 2015-04-22 |
JPWO2013038527A1 (ja) | 2015-03-23 |
KR20140061450A (ko) | 2014-05-21 |
US20140229484A1 (en) | 2014-08-14 |
KR101560109B1 (ko) | 2015-10-13 |
AU2011377004A1 (en) | 2014-03-27 |
AU2011377004B2 (en) | 2015-11-12 |
CN103797480B (zh) | 2017-11-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5741699B2 (ja) | 抽出方法、抽出プログラム、抽出装置、および抽出システム | |
JP4707198B2 (ja) | 情報検索プログラム、該プログラムを記録した記録媒体、情報検索方法、および情報検索装置 | |
JP5062131B2 (ja) | 情報処理プログラム、情報処理装置、および情報処理方法 | |
JPWO2012150637A1 (ja) | 抽出方法、情報処理方法、抽出プログラム、情報処理プログラム、抽出装置、および情報処理装置 | |
JP4893805B2 (ja) | 情報処理プログラム、情報検索プログラム、および情報処理装置 | |
JP5605288B2 (ja) | 出現マップ生成方法、ファイル抽出方法、出現マップ生成プログラム、ファイル抽出プログラム、出現マップ生成装置、およびファイル抽出装置 | |
WO2006123448A1 (ja) | 情報検索プログラム | |
JP6609404B2 (ja) | 圧縮プログラム、圧縮方法および圧縮装置 | |
JP5505524B2 (ja) | 生成プログラム、生成装置、および生成方法 | |
JP5621906B2 (ja) | 検索プログラム、検索装置、および検索方法 | |
JP2016149160A (ja) | 情報生成方法、およびインデックス情報 | |
JP6931442B2 (ja) | 符号化プログラム、インデックス生成プログラム、検索プログラム、符号化装置、インデックス生成装置、検索装置、符号化方法、インデックス生成方法および検索方法 | |
JP2016149160A5 (ja) | ||
US10318483B2 (en) | Control method and control device | |
JP5494860B2 (ja) | 情報管理プログラム、情報管理装置および情報管理方法 | |
Allen | A file index for document storage and retrieval utilizing descriptor fragments | |
JP2018060425A (ja) | インデックス生成プログラム、インデックス生成装置、インデックス生成方法、検索プログラム、検索装置および検索方法 | |
JPS62131348A (ja) | マルチインデツクスフアイルアクセス方式 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 11872214 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2013533401 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2011872214 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 20147006760 Country of ref document: KR Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2011377004 Country of ref document: AU Date of ref document: 20110914 Kind code of ref document: A |