CN1932816A - Full text search system based on ciphertext - Google Patents

Full text search system based on ciphertext Download PDF

Info

Publication number
CN1932816A
CN1932816A CN 200610124691 CN200610124691A CN1932816A CN 1932816 A CN1932816 A CN 1932816A CN 200610124691 CN200610124691 CN 200610124691 CN 200610124691 A CN200610124691 A CN 200610124691A CN 1932816 A CN1932816 A CN 1932816A
Authority
CN
China
Prior art keywords
module
user
information
index
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200610124691
Other languages
Chinese (zh)
Other versions
CN100424704C (en
Inventor
李瑞轩
卢正鼎
宋伟
左翠华
张茂元
文坤梅
何云天
万宇涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CNB2006101246911A priority Critical patent/CN100424704C/en
Publication of CN1932816A publication Critical patent/CN1932816A/en
Application granted granted Critical
Publication of CN100424704C publication Critical patent/CN100424704C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

Full text index system based on cryptograph relates to data bank, enter module, demand module, result gather dispose module, electronic document dispose module, index module, audit manage module, user manage module and purview manage module. The demand module relates to demand participle module, demand encryption module, logic combination module, demand sub-module, visiting controlled module and result gather dispose module. Result gather dispose module relates to tabloid module and snapshot module. Index module relates to index participle module, index encryption module and index sub-module. The system offers a kind of participle strategy-the Chinese acceptation combines with automatic syncopation, without the position instance for index words in the original text. During the user visiting the document, add visiting domination to restrict user purview and to keep the security of sensitive information. It can realize full text index with cryptograph and keep the security of sensitive information with strong safety and high efficiency.

Description

Text retrieval system based on ciphertext
Technical field
The invention belongs to the computer search technical field, be specifically related to a kind of text retrieval system based on ciphertext.
Background technology
Information age has produced a large amount of numerical informations, and wherein text message is the most frequently used file layout of fundamental sum, and own required in order to find in vast as the open sea text message, people press for a gopher efficiently.How efficiently storage and this non-structured data of query text are exactly a very good problem to study.This global search technology and full-text database technology focus of becoming Chinese scholars research wherein.
Full-text search is main process object with text data exactly, based on the full text index, and the technology of using natural language to retrieve.In information retrieval field, full-text search is the problem of a more complicated always.Different with the structured data query in the general data library searching, full-text search mainly is that unstructured data is inquired about, and comes compared with the index retrieval, and it is brand-new that full-text search provides, powerful search function makes things convenient for multi-angle, multi-sided comprehensive utilization information resources.
Global search technology has developed comparatively ripely now, and external full-text search software has obtained application earlier.The Chinese Full Text Retrieval technology is consistent with the western language full-text search on principle, but the characteristics of Chinese character itself make the realization of Chinese information processing system more more complicated than western language system.Therefore, external many perfect text retrieval systems are difficult to directly apply to the processing Chinese character information.At present, the autonomous Chinese Full Text Retrieval technology of China has reached higher level, has also obtained very high occupation rate at traditional market.Mainly concentrate on the aspects such as global search technology under Chinese character full-text search, hypertext full-text search, the network environment.
The development of computer networking technology promotes that computer utility is increasingly extensive and deep, also makes the safety problem of computer utility under the network environment complicated day by day and outstanding simultaneously.It is to take precautions against one of main security means of leakage of information in the present computer utility that data are carried out encryption.Data message all is to deposit with the ciphertext form in the computer application system of concerning security matters and security department at present, has guaranteed the security of system and data message to greatest extent.Though global search technology and cryptographic algorithm are all very ripe and have good commercial product to occur, and how to realize full-text search under the ciphertext prerequisite, and be still still blank in correlative study at home and abroad and the product scope.Encryption technology and full-text index be combined with a lot of difficult points, at first, in order to guarantee the safe and reliable of index information, information for index entry must be through encryption, and through after the encryption technology processing, cipher-text information just can not adopt the matching technique under the plaintext state to handle, and the text message after therefore handling through encryption technology can not be realized ciphertext full-text search with machine-processed directly combination of existing full-text search.Secondly, existing text retrieval system makes up full-text index often, therefore the indexed data amount is often very big, and encryption technology can further be brought the increase of quantity of information, as a utility system, efficiency in the encryption technology introducing text retrieval system must be paid attention to and pay much attention to like this.
Summary of the invention
The object of the present invention is to provide a kind of text retrieval system based on ciphertext, this searching system has high safety, carries out the high characteristics of efficient.
A kind of text retrieval system provided by the invention based on ciphertext, this system comprises database, login module, enquiry module, result set processing module, electronic document processing module, index module, audit management module, user management module and authority management module; Wherein,
Database is used to store the information of user and user right aspect, and it comprises user information database, department information storehouse, department's group information bank, class information storehouse and audit information storehouse;
Login module is used to receive the services request from user's input information, by with the information interaction of database, services request is verified be proved to be successful and then allow the user to enter system, if authentication failed, then refusing user's enters system; When the user successfully logins with keeper's identity, select audit management module, user management module and authority management module are managed; When the user successfully logins with domestic consumer's identity, then enter enquiry module;
Enquiry module is used to receive the retrieving information of user's input, and retrieving information is carried out participle, encryption, logical combination handle and obtain query statement, in index database, carry out match query according to query statement then, return all document information of having the right to visit with query statement coupling and user,, give the result set processing module with the result set after the ordering and handle the result set processing of sorting according to the matching degree of document;
The result set processing module is used to receive the result set from enquiry module, and sets up the digest information and the SNAPSHOT INFO of result set according to the information in ciphertext storehouse, and the user is checked that the audit information of SNAPSHOT INFO is stored in the database;
The electronic document processing module is treated the filing electronic document files and is carried out pre-service, and the file of particular file format is converted into text-only file, then these text-only files is carried out encryption, sets up the ciphertext storehouse; Index module provides the content and the heading message of all text-only files;
Index module receives content and the heading message from the text-only file of electronic document processing module, utilize the semantic participle strategy that combines with automatic segmentation of Chinese that the content and the heading message of text-only file are carried out word segmentation processing, obtain index terms, the encrypted indexes speech utilizes index terms and document related information after encrypting to set up index database more then;
The audit management module receives the Query Information from user's input, by with the information interaction of database, utilize IP address, user name and time range to come user's operation is inquired about, obtain to satisfy all records of querying condition;
User management module is used to receive the operation requests from the keeper, user profile is managed, and carry out alternately with database, realize display user's information respectively, add user profile, deletion user profile and modification user profile, and in the operation data-in storehouse with the keeper;
Authority management module is used to receive the operation requests from the keeper, and department privilege and department's group are managed, and carry out alternately with database, and in the operation data-in storehouse with the keeper.
System of the present invention does not comprise the situation of index terms in the positional information of original text at index database, a kind of participle strategy has been proposed---the semantic participle strategy that combines with automatic segmentation of Chinese, and in the process of user capture document, add the authority that access control comes limited subscriber, to guarantee the security of sensitive information.System of the present invention can realize the full text information retrieval under the ciphertext condition, has guaranteed the security retrieval of sensitive data.Particularly, the present invention has following advantage:
(1) high safety: the security of native system mainly reaches by encryption, access control and audit.In native system, all information that is placed on above the server all is ciphertext, has guaranteed the security of sensitive information like this, and in order to prevent the statistical attack to the ciphertext index speech, does not comprise the positional information of index terms at original text in the index database.In the information inquiry process, only have the user who checks documentation level and just can retrieve the document, prevented the leakage of information to the lower-level user.Auditing department's member record some critical operations of all users, be convenient to review, guaranteed the security of system again further.
(2) carry out the efficient height: native system is mainly used in the full text information retrieval of ciphertext, thereby requires to have higher execution efficient.In native system, the process of building index has just taken into full account efficiency, and the participle strategy that has adopted Chinese semanteme to combine with automatic segmentation reduces the index amount as far as possible under the precondition that comprises all significant speech as far as possible.In addition, before the explicit user retrieving information, system is optimized ordering to the information that these retrieve, and allows the user can obtain the information of oneself wanting as soon as possible.
Description of drawings
Fig. 1 is the system assumption diagram of system of the present invention;
Fig. 2 is the structural representation of system of the present invention;
Fig. 3 is the procedure chart of login module;
Fig. 4 is the procedure chart of enquiry module;
Fig. 5 is the procedure chart of result set processing module;
Fig. 6 is that glossarial index makes up structural drawing;
Fig. 7 is the procedure chart of index module.
Embodiment
The present invention is further detailed explanation below in conjunction with accompanying drawing and example.
As shown in Figure 1, the function of system of the present invention can be divided into: make up ciphertext index, ciphertext full-text query and back-stage management, this system comprises database 100, login module 200, enquiry module 300, result set processing module 400, electronic document processing module 500, index module 600, audit management module 700, user management module 800 and authority management module 900.
Database 100 is used to store the information of user and user right aspect.
Login module 200 is used to receive the services request from user's input information, by with the information interaction of database 100, services request is verified, be proved to be successful and then allow the user to enter system, and obtain the relevant information of this user in database 100 in login module 200, be kept in the session.When the user successfully logins with keeper's identity, then enter the interface of back-stage management homepage, and can select audit management module 700, user management module 800 and authority management module 900 these three modules are managed; When the user successfully logins with domestic consumer's identity, then enter enquiry module 300.If authentication failed, then refusing user's enters system.No matter whether the user successful login system, all need in user's the register information adding database 100, so that review in the future.
Enquiry module 300 is used to receive the retrieving information of user's input, and retrieving information is carried out participle, encryption, logical combination handle and obtain query statement, in index database, carry out match query according to query statement then, return all document information (being called result set) of having the right to visit with query statement coupling and user,, give result set processing module 400 with the result set after the ordering and handle the result set processing of sorting according to the matching degree of document.
Result set processing module 400 is used to receive the result set from enquiry module 300, and sets up the digest information and the SNAPSHOT INFO of result set according to the information in ciphertext storehouse, and the recording storage of the user being checked SNAPSHOT INFO is in database 100.
Electronic document processing module 500 is treated the filing electronic document files and is carried out pre-service, and the file of particular file format (as PDF and Doc form) is converted into plain text TXT file, then these text-only files is carried out encryption, sets up the ciphertext storehouse.In addition, electronic document processing module 500 also provides the content and the heading message of all text-only files for index module 600.
Content and heading message that index module 600 receives from the text-only file of electronic document processing module 500, utilize the semantic participle strategy that combines with automatic segmentation of Chinese that the content and the heading message of text-only file are carried out word segmentation processing, obtain index terms, the encrypted indexes speech utilizes index terms and document related information (can consult department as file-level, file) after encrypting to set up index database at last then.
Audit management module 700 mainly is that all operations to the user provides query function, can come user's operation is inquired about by IP address, user name and time range.Audit management module 700 receives the Query Information from user's input, by with the information interaction of database 100, obtain to satisfy all records of querying condition.These records relate generally to foreground user's register and check the record of snapshot operation, the user on backstage and the interpolation of department, deletion, retouching operation record.
User management module 800 is used to receive the operation requests from the keeper, user profile is managed accordingly, and carry out alternately with database 100.Realized display user's information respectively, added user profile, deletion user profile is revised functions such as user profile, and in the operation data-in storehouse 100 with the keeper.
Authority management module 900 is used to receive the operation requests from the keeper, department privilege and department's group is managed accordingly, and carried out alternately with database 100.Wherein the department privilege management has realized demonstration department authority information, adds department's authority information, and deletion department privilege information is revised the department privilege informational function; Department's group management has realized demonstration department group, adds department's group, and deletion department group is revised functions such as department's group information.In addition, in the operation data-in storehouse 100 of authority management module 900 with the keeper.
Respectively each module is described in further detail below.
As shown in Figure 2, the data of database 100 storages comprise: user information database 110, department information storehouse 120 and department's group information bank 130, class information storehouse 140 and audit information storehouse 150.
User information database 110 is used to store user's relevant information, as user name, and password, information such as department, user class grade, sex, mailbox, address, phone.
Department information storehouse 120 is used for the relevant information of storage division, as department's name, and department's tier levels, the information such as department's group set under the department.
Department's group information bank 130 is used for the name information of storage division group.
Class information storehouse 140 is used for the storage level another name and claims and corresponding tier levels information.Normally keeper's predefined is good for it, general less change class information.
Audit information storehouse 150 is used to store the relevant information of user's operation, as the time of user name, operation, user's IP address, operation.
The query requests that database 100 receives from login module 200 is carried out match query in user information database 110, feed back to login module 200, and the record with user login operation adds in the audit information storehouse 150 of database 100 simultaneously; The query requests that database 100 receives from audit management module 700, match query in audit information storehouse 150, feedback information is to audit management module 700; Inquiry, interpolation, modification, deletion action request that database 100 receives from user management module 800 are handled in user information database 110 accordingly, feed back to user management module 800; Inquiry, interpolation, modification, deletion action request that database 100 receives from authority management module 900 are handled in department information storehouse 120, department's group information bank 130, class information storehouse 140 accordingly, feed back to authority management module 900.
Login module 200 is inlets of total system, and it comprises password authentication module 210 and verification module 220.
Password authentication module 210 is used for obtaining user ciphers and being decrypted from the user information database 110 of database 100, and the password with this user's input mates then, sees whether the password that the user inputs is correct.
Whether verification module 220 is used for the password that validation database stores and was changed by malice.When certain user's password has been altered, malicious attacker still can't enter system by this user name and the password of altering, because the verification of password will be failed.The further like this security that guarantees system.
As shown in Figure 3, login module 200 is responsible for: (1) receives the log-on message from user's input, information is submitted to system, whether system can go retrieval to have this user name to exist in the user information database 110 of database 100 according to user name, if this user name does not exist, then forward (6) to, otherwise will from user information database 110, obtain other relevant information (as password, department, user class, check information) of this user name, and be kept in the session; (2) encrypted message that obtains from database is decrypted; (3) whether the encrypted message of checking user's input is consistent with the encrypted message of deciphering in (2), if inconsistent, then forwards (6) to; (4) encrypted message in the calibration database is if the verification failure then forwards (6) to; (5) successfully enter system's (user with domestic consumer's identity login enters enquiry module, and the user who logins with keeper's identity then enters back-stage management), and the record of this login of user is added in the audit information storehouse 150 of database; (6) login failure needs login again, and the record of this login of user is added in the audit information storehouse 150 of database.
Enquiry module 300 is modules that native system offers user search information, and it comprises inquiry word-dividing mode 310, inquiry encrypting module 320, logical combination module 330, inquiry submodule 340, access control module 350 and result set order module 360.
The retrieval command that inquiry word-dividing mode 310 receives from the user adopts the semantic participle strategy that combines with automatic segmentation of Chinese that retrieval command is carried out participle, and the term after the word segmentation processing is sent to inquiry encrypting module 320.
310 couples of users' of inquiry word-dividing mode retrieval command carries out the language lexical analysis, adapt to the document source of different language and multi-form retrieval command, it is responsible for the character string in the inlet flow is converted to the set of a series of marks, these marks will be the base units of setting up index, as to Chinese with Chinese character as basic index unit, and can define filtrator, realize the filtration of Chinese and English stop words.The very comprehensive vocabulary of neither one can comprise the various aspects content at present, and in order to comprise all significant speech as far as possible, native system adopts the semantic participle strategy that combines with automatic segmentation of Chinese.Adopt this participle strategy can reduce the index amount largely, the coupling in the retrieving has semanteme simultaneously, improves the recall ratio and the precision ratio of ciphertext full-text search, and concrete scheme is as described below.
In order to make the speech that the to be inquired about selected ciphertext index storehouse that enters of trying one's best, at first designed the branch word algorithm of a cover automatic segmentation.We define a K value, and to traveling through in full, all speech length are less than or equal to the line index that is combined into of K in the selection in full based on K.Arthmetic statement is shown in algorithm 1.
Algorithm 1
Input: treat in full f of index, the long K value of major term;
Output: segmentation sequence in full.
Wait to return segmentation sequence s;
For(int i=1;i<=K;i++)
{
While (not arriving document f end)
The speech length of document f current location joins s formation end for the participle of i;
Move after the current location of document f; }
}
Return s;
Algorithm finishes.
Based on the text retrieval system of ciphertext mainly towards Chinese ciphertext document process, Chinese vocabulary is 2 words are main mostly, and 3 words are relative with 4 words less, and the above speech of 5 words is with regard to seldom, in order to comprise the query word that the user may propose as far as possible, in native system, choose K=5.Through after above-mentioned minute word algorithm carry out word segmentation processing, the speech that all speech length are less than or equal to K (K=5 here) in the document all can be included into the ciphertext index storehouse.But found through experiments, have a lot of insignificant speech to be added in the ciphertext index storehouse like this, these speech can be thought to be queried to.For example: original text is " a computing machine institute of the Central China University of Science and Technology ", after dividing word algorithm 1 to handle, the segmentation sequence that obtains for China, in, section, skill, big, to learn, meter is calculated, machine is learned institute, Central China, middle section, science and technology, skill is big, and meter is learned by university, calculates, the calculation machine, machine is learned, institute, Central China section, middle science and technology, science and technology is big, skill university, and university's meter is learned and is calculated, computing machine, the calculation machine is learned, machine institute, and Central China science and technology, middle science and technology is big, University of Science and Technology, skill university meter, university is calculated, and learns computing machine, computer, calculation machine institute, Central China science and technology is big, middle University of Science and Technology, University of Science and Technology's meter, skill university is calculated, and university computer is learned computer, computing machine institute }.From segmentation sequence, can find it much is insignificant speech, can make index data base become very big like this.Find by test, adopt above-mentioned minute word algorithm after, the space size in ciphertext index storehouse is to treat 10~20 times of index plain text document size.Greatly influenced search efficiency.If by semantic information, redundant information is descended significantly.In being based on algorithm 1, in conjunction with Chinese semantic participle and stop words, the branch word algorithm that Chinese semanteme combines with automatic segmentation has been proposed, describe as algorithm 2.
Algorithm 2
Input: treat in full f of index, the long K value of major term, stop words sequence t;
Output: segmentation sequence in full.
Wait to return segmentation sequence s;
Treat segmentation sequence g=Chinese semantic participle (f);
While (not arriving the end of sequence g)
If (running into the stop words among the stop words sequence t among the sequence g)
Do not carry out participle;
continue;}
All participles that sequence g current location speech length is less than or equal to K join s formation end;
Move after the current location of sequence g;
}
Return s;
Algorithm finishes.
By algorithm 2, a lot of insignificant redundant participles are removed.Same example, original text are " computing machine institutes of the Central China University of Science and Technology ", at first obtain the semantic segmentation sequence for the treatment of through semantic participle to be: { Central China, science and technology, university, computing machine, institute }; After improving the participle algorithm process, the segmentation sequence that obtains is { Central China, Central China science and technology, science and technology, University of Science and Technology, university, university computer, computing machine, computing machine institute, an institute }.Relative algorithm 1, the semantic branch word algorithm that combines with automatic segmentation of Chinese makes participle quantity greatly reduce, what enter the ciphertext index storehouse all is significant field also, has reduced index database space size, has also improved search efficiency.Through overtesting, the branch word algorithm 2 after utilize improving, the storage size in index database space is to treat 3~7 times of index plain text document size.
Inquiry encrypting module 320 is used for the term after word segmentation processing is carried out encryption, and the term after the encryption is sent to logical combination module 330.For raising speed, preferably select symmetric encipherment algorithm.
Logical combination module 330 is carried out logical combination with the term after the encryption, makes up as the relation that adopts " or ", " and ", and logical combination information is sent to inquiry submodule 340.
Inquiry submodule 340 utilizes logical combination information to search the document information of all couplings in index database, and utilize the document information of 350 pairs of couplings of access control module to screen, from the document information of coupling, select and satisfy that part of document information that access control requires and as a result of collect, and result set is sent to result set order module 360.
Access control module 350 is used for the document information that inquiry submodule 340 utilizes logical combination information to search all couplings that obtain at index database is screened, and makes each user can only retrieve the document in its extent of competence.The information that all has department information and individual tier levels after the validated user login system, if the department under the user is within department's collection of certain document issue, and user's tier levels is higher than the tier levels that this document is issued, then the document satisfies the access control requirement, to be added into result set, otherwise, even meeting retrieval, this document requires also can not be added into result set, specific strategy is as described below.
In text retrieval system based on ciphertext, be described respectively according to user and document, user and document all comprise department and tier levels attribute, and other just constitutes a paritially ordered set to tier levels all in the native system according to level.For example: the attribute complete or collected works of department are A={D 1, D 2, D 3, D 4, the paritially ordered set of authority is described as table 1, and the corresponding authority of the more little expression of Permission Levels is high more.Document description is as shown in table 2, and the user describes as shown in table 3, and wherein last row of table 3 are all documents that can have access to by the user who relatively obtains.In the access control policy of native system, require every part of document to allow to be distributed to a plurality of departments, every part of document only allows to be published on some definite tier levels.The user can only belong to a definite department, and can only have a definite tier levels.
Table 1 authority is described
The rank title Tier levels
R 1 0
R 2 1
R 3 2
R 4 2
R 5 3
Table 2 document description
Document title Department's collection of issue The tier levels that document is issued
S 1 D 1,D 2,D 4 2
S 2 D 2,D 3 1
S 3 D 1 2
S 4 D 3,D 4 2
S 5 D 1,D 2,D 3,D 4 3
Table 3 user right is described
User's name Affiliated function The user class grade Allow the document of visit
U 1 D 1 2 S 1,S 3,S 5
U 2 D 2 1 S 1,S 2,S 5
U 3 D 2 2 S 1,S 5
U 4 D 3 2 S 1,S 4,S 5
U 5 D 4 3 S 5
In access control policy based on the text retrieval system of ciphertext, have only department's attribute kit to be contained in department's property set of document permission issue as the user, and user's tier levels is not more than, and (tier levels is more little on the basis of tier levels that document allows issue, authority is big more), the user just has the authority of access document.This access control policy is a kind of access control policy that the control of secret department document access required before comparison operators was fated.In the process that native system is promoted the use of, can design corresponding access control policy according to the self-demand of each department.
Result set order module 360 is used for the result set from inquiry submodule 340 is carried out a prioritization, and the result set after will sorting sends to result set processing module 400.The highest document of coupling intensity comes the foremost of result set, and the coupling intensity here is to weigh with the length of term, can certainly include the weight of term in limit of consideration.Here for convenience, only select to hit speech length and sort, certainly, this also is that participle strategy with native system is closely connected.
As shown in Figure 4, the treatment scheme of enquiry module is: (1) user imports retrieving information, and system can adopt the semantic participle strategy that combines with automatic segmentation of Chinese that retrieving information is carried out participle, obtains the query and search speech; (2) server carries out encryption to term; (3) server according to the user import retrieving information the logical relation that originally comprises, the searching ciphertext speech after the encryption is carried out logical combination, form query statement; (4) in index database, carry out the ciphertext coupling according to query statement, and in match information, add the restriction of access control, return results collection.Promptly for the document that hits, have only user department the document can be accessed department's collection within and the user class grade be not more than the rank of the document, the document could add result set and returns to the user so; (5) result set that obtains being sorted, mainly is to utilize to hit the long and hit-count of speech and sort, and comes the front of result set with hitting the long and more document of hit-count of speech.
Result set processing module 400 is interfaces that user inquiring shows, it comprises digest module 410 and snapshot module 420.
Digest module 410 is used for showing that the document of ordering back result set includes the digest information of term, and a document the inside has a lot of diverse locations and all includes term, N piece digest information before can selecting to show.Every digest information all is to include the term that highlights, is similar to the Search Results situation in the Baidu.
Snapshot module 420 is used for showing whole plaintext text messages of ordering back result set document, and highlights term, so that user's reading.And read the message of document of user is added in the audit information storehouse 150.Because the text message of preserving in the server all is a ciphertext, so need earlier ciphertext to be decrypted, the mode through communication encryption returns to the user with SNAPSHOT INFO then.
As shown in Figure 5, the treatment scheme of result set processing module is: (1) receives the result set from enquiry module 300; (2) digest information of acquisition result set from the ciphertext storehouse; (3) SNAPSHOT INFO of acquisition result set from the ciphertext storehouse; When (4) user needed SNAPSHOT INFO, the feedback SNAPSHOT INFO was given the user, and the record of this operation is added in the audit information storehouse 150 of database.
Electronic document processing module 500 is pretreatment module of total system, its treatment scheme is: (1) is converted into the plain text format txt file with archive file and is kept in the designated directory, the content of plain text document is sent to index word-dividing mode 610, address, the rank of plain text document, the information that can consult department are sent to index module 600; (2) plain text document is encrypted, set up the ciphertext storehouse.
Index module 600 is parts more important in the native system, and it comprises index word-dividing mode 610, index encrypting module 620 and index submodule 630.
Index word-dividing mode 610 is used for the content of all plain text document is carried out word segmentation processing, obtains index terms, and the index terms after the word segmentation processing is sent to index encrypting module 620, and concrete participle strategy is as described in the inquiry word-dividing mode 310.
Index encrypting module 620 is used for the address information of index terms, plain text document is carried out encryption, and index terms, the address of document information after encrypting is sent to index submodule 630.Wherein, index terms adopts and the identical cryptographic algorithm of inquiry encrypting module, and address of document adopts the higher rivest, shamir, adelman of level of security.
Index submodule 630 is used to utilize index terms after the encryption and address of document, documentation level and document can consult department information to set up index database.As shown in Figure 6, each ciphertext index speech is corresponding with the documents location that all contain this ciphertext index speech, and each document all has the inner structure of oneself, and contains the class information of the document and department's collection information that can be accessed.Be respectively if any three documents: document 1, document 2 and document 3.The content of document 1 is " People's Republic of China (PRC), I like the People's Republic of China (PRC) "; The content of document 2 is " like me China "; The content of document 3 is " a computing machine institute ".Suppose and contain " China " in the index terms that obtains after word segmentation processing, the ciphertext of " China " is as the ciphertext index speech 1 among Fig. 6, and all contain the position of the document of " China " its correspondence, i.e. the position DID2 of the position DID1 of document 1 and document 2.In index database, each document has the inner structure of oneself again.With document 1 is example, supposes that the index terms through obtaining after the word segmentation processing is: Chinese, the Chinese people, the people, republic, people's republic, the rank of the ciphertext of 5 index terms and document 1 and department collection above document 1 just contains in index database so.In building the process of index database, native system is not introduced the positional information of index terms in original text for fear of the attack of aspects such as statistics.Just because of all index terms all do not have positional information, can't utilize positional information to come grammatical term for the character, so the participle strategy that we have adopted Chinese semanteme to combine with automatic segmentation make to comprise all significant speech as far as possible in the index database so that retrieval.
As shown in Figure 7, the treatment scheme of index module 600 is: (1) receives all filing plain text txt file information of electronic document processing module 500; (2) plain text information is carried out word segmentation processing, obtain all index terms; (3) index terms is carried out encryption, adopt with enquiry module 300 in identical cryptographic algorithm; (4) index terms after utilization is encrypted and address of document, documentation level and document can be consulted department information and set up index database.
Audit management module 700 mainly is that all operations to the user provides query function, can come user's operation is inquired about by IP address, user name and time range.
The module of using when user management module 800 is Admin Administration's user profile.The treatment scheme of user management module 800 is: (1) keeper checks user profile, and user management module 800 is instructed user information database 110 in the reading database 100 according to the keeper, and shows all user profile; (2) keeper fills in new user profile to be added, user management module 800 at first in the user information database 110 in the judgment data storehouse 100 this user's user name whether exist, if this user name exists, return miscue, record user information database 110 otherwise add, and the record that adds user's success is added in the audit information storehouse 150 of database; (3) keeper deletes user profile, and user management module 800 is instructed the relevant information of user information database 110 in the delete database 100 according to the keeper, and the record that will delete user's success adds in the audit information storehouse 150 of database; (4) keeper revises user's information, and user management module 800 is according to the corresponding information of user information database 110 in keeper's modifying of order database 100, and the record that will revise user's success adds in the audit information storehouse 150 of database.
The module that authority management module 900 uses when being Admin Administration's authority information.The treatment scheme of authority management module 900 is checked department information for (1) keeper, and the information in department's information bank 120 in the reading database is instructed according to the keeper by system; (2) keeper adds new department information, whether authority management module 900 has at first existed this department in the department information storehouse 120 in the judgment data storehouse, if exist, then return miscue, record department information storehouse 120 otherwise add, and the record of interpolation department success is added in the audit information storehouse 150 of database; (3) keeper deletes department information, authority management module 900 is according to the relative recording in the department information storehouse 120 in the user instruction delete database, simultaneously cascading delete has the relevant user information of this authority, and will delete in the audit information storehouse 150 of record adding database of department and user profile success; (4) keeper revises department information, the fresh information of authority management module 900 use and management persons input, the department information storehouse 120 in the new database more, the also corresponding information in the update user information storehouse 110 simultaneously, and will revise in the audit information storehouse 150 of record adding database of department and user profile success.(5) keeper checks department information, and the information in department's information bank 120 in the reading database is instructed according to the keeper by system; (6) keeper adds new department's group information, whether authority management module 900 has at first existed this department's group in the department's group information bank 130 in the judgment data storehouse, if exist, then return miscue, record department's group information village, often used in village names 130 otherwise add, and the record of interpolation department composition merit is added in the audit information storehouse 150 of data village, often used in village names; (7) keeper deletes department's group information, and authority management module 900 is according to the relative recording of the department's group information bank 130 in the user instruction delete database, and the record that deletion department forms merit is added in the audit information storehouse 150 of database; (8) keeper revises department's group information, the fresh information of authority management module 900 use and management persons input, and the department's group information bank 130 in the new database more, and will revise in the audit information storehouse 150 of record adding database of department's group information success.

Claims (5)

1, a kind of text retrieval system based on ciphertext, this system comprises database (100), login module (200), enquiry module (300), result set processing module (400), electronic document processing module (500), index module (600), audit management module (700), user management module (800) and authority management module (900); Wherein,
Database (100) is used to store the information of user and user right aspect, and it comprises user information database (110), department information storehouse (120), department's group information bank (130), class information storehouse (140) and audit information storehouse (150);
Login module (200) is used to receive the services request from user's input information, by with the information interaction of database (100), services request is verified be proved to be successful and then allow the user to enter system, if authentication failed, then refusing user's enters system; When the user successfully logins with keeper's identity, select audit management module (700), user management module (800) and authority management module (900) are managed; When the user successfully logins with domestic consumer's identity, then enter enquiry module (300);
Enquiry module (300) is used to receive the retrieving information of user's input, and retrieving information is carried out participle, encryption, logical combination handle and obtain query statement, in index database, carry out match query according to query statement then, return all document information of having the right to visit with query statement coupling and user,, give result set processing module (400) with the result set after the ordering and handle the result set processing of sorting according to the matching degree of document;
Result set processing module (400) is used for receiving the result set from enquiry module (300), and sets up the digest information and the SNAPSHOT INFO of result set according to the information in ciphertext storehouse, and the recording storage of the user being checked SNAPSHOT INFO is in database (100);
Electronic document processing module (500) is treated the filing electronic document files and is carried out pre-service, and the file of particular file format is converted into text-only file, then these text-only files is carried out encryption, sets up the ciphertext storehouse; And be content and the heading message that index module (600) provides all text-only files;
Index module (600) receives content and the heading message from the text-only file of electronic document processing module (500), utilize the semantic participle strategy that combines with automatic segmentation of Chinese that the content and the heading message of text-only file are carried out word segmentation processing, obtain index terms, the encrypted indexes speech utilizes index terms and document related information after encrypting to set up index database more then;
Audit management module (700) receives the Query Information from user's input, by with the information interaction of database (100), utilize IP address, user name and time range to come user's operation is inquired about, obtain to satisfy all records of querying condition;
User management module (800) is used to receive the operation requests from the keeper, user profile is managed, and carry out alternately with database (100), realize display user's information respectively, add user profile, deletion user profile and modification user profile, and in the operation data-in storehouse (100) with the keeper;
Authority management module (900) is used to receive the operation requests from the keeper, and department privilege and department's group are managed, and carry out alternately with database (100), and in the operation data-in storehouse (100) with the keeper.
2, text retrieval system according to claim 1 is characterized in that: login module (200) comprises password authentication module (210) and verification module (220);
Password authentication module (210) is used for obtaining user cipher and being decrypted from the user information database (110) of database (100), and the password with this user's input mates then, sees whether the password that the user inputs is correct;
Whether verification module (220) is used for the password that validation database stores and was changed by malice.
3, text retrieval system according to claim 1 is characterized in that: enquiry module (300) comprises inquiry word-dividing mode (310), inquiry encrypting module (320), logical combination module (330), inquiry submodule (340), access control module (350) and result set order module (360);
Inquiry word-dividing mode (310) receives user's retrieval command, adopts the semantic participle strategy that combines with automatic segmentation of Chinese that retrieval command is carried out participle, and the term after the word segmentation processing is sent to inquiry encrypting module (320);
Inquiry encrypting module (320) is used for the term after word segmentation processing is carried out encryption, and the term after the encryption is sent to logical combination module (330);
Logical combination module (330) is carried out logical combination with the term after the encryption, and logical combination information is sent to inquiry submodule (340);
Inquiry submodule (340) utilizes logical combination information to search the document information of all couplings in index database, and utilize access control module (350) that the document information of coupling is screened, from the document information of coupling, select and satisfy that part of document information that access control requires and as a result of collect, and result set is sent to result set order module (360);
Access control module (350) is used for the document information that inquiry submodule (340) utilizes logical combination information to search all couplings that obtain at index database is screened, and makes each user can only retrieve the document in its extent of competence;
Result set order module (360) is used for the result set from inquiry submodule (340) is carried out a prioritization, and the result set after will sorting sends to result set processing module (400).
4, text retrieval system according to claim 1 is characterized in that: result set processing module (400) comprises digest module (410) and snapshot module (420); Wherein,
Digest module (410) is used for showing that the document of ordering back result set includes the digest information of term;
Snapshot module (420) is used for showing whole plaintext text messages of ordering back result set document, and highlights term, so that user's reading, and the user has been read in the message adding audit information storehouse (150) of document.
5, text retrieval system according to claim 1 is characterized in that: index module (600) comprises index word-dividing mode (610), index encrypting module (620) and index submodule (630);
Index word-dividing mode (610) is used for the content of all plain text document is carried out word segmentation processing according to the identical segmenting method of inquiry word-dividing mode (310), obtains index terms, and the index terms after the word segmentation processing is sent to index encrypting module (620);
Index encrypting module 620 is used for the address information of index terms and plain text document is carried out encryption according to the identical cryptographic algorithm of inquiry encrypting module (320), and index terms and the address of document information after encrypting is sent to index submodule (630);
Index submodule (630) is used to utilize index terms after the encryption and address of document, documentation level and document can consult department information to set up index database.
CNB2006101246911A 2006-09-30 2006-09-30 Full text search system based on ciphertext Expired - Fee Related CN100424704C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006101246911A CN100424704C (en) 2006-09-30 2006-09-30 Full text search system based on ciphertext

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006101246911A CN100424704C (en) 2006-09-30 2006-09-30 Full text search system based on ciphertext

Publications (2)

Publication Number Publication Date
CN1932816A true CN1932816A (en) 2007-03-21
CN100424704C CN100424704C (en) 2008-10-08

Family

ID=37878651

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006101246911A Expired - Fee Related CN100424704C (en) 2006-09-30 2006-09-30 Full text search system based on ciphertext

Country Status (1)

Country Link
CN (1) CN100424704C (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751452A (en) * 2008-12-03 2010-06-23 富士施乐株式会社 Information processing apparatus and information processing method
CN101859323A (en) * 2010-05-31 2010-10-13 广西大学 Ciphertext full-text search system
CN101561815B (en) * 2009-05-19 2010-10-13 华中科技大学 Distributed cryptograph full-text retrieval system
CN102262633A (en) * 2010-05-27 2011-11-30 武汉力龙数码信息科技有限公司 Structural data safe retrieving method oriented to full text retrieval
CN101184214B (en) * 2007-12-07 2012-12-19 中兴通讯股份有限公司 Method of managing user authority in monitoring system
CN102855292A (en) * 2010-05-31 2013-01-02 广西大学 Safety overlay network constructing method of ciphertext full text search system and corresponding full text search method
CN103049466A (en) * 2012-05-14 2013-04-17 深圳市朗科科技股份有限公司 Full-text search method and system based on distributed cipher-text storage
CN103136352A (en) * 2013-02-27 2013-06-05 华中师范大学 Full-text retrieval system based on two-level semantic analysis
CN109214198A (en) * 2018-08-13 2019-01-15 苏州泥娃软件科技有限公司 A kind of secure cloud document system encrypting search
CN110324402A (en) * 2019-05-08 2019-10-11 湖南文盾信息技术有限公司 A kind of credible cloud storage service platform and working method based on trusted users front end
CN110609959A (en) * 2019-09-24 2019-12-24 珠海格力电器股份有限公司 Project life cycle-based retrieval method, storage medium and electronic device
CN110807141A (en) * 2019-11-04 2020-02-18 北京联想协同科技有限公司 Data searching method and device and readable storage medium
CN111209586A (en) * 2018-11-21 2020-05-29 郑州科技学院 Document management system and method
CN116029853A (en) * 2023-02-15 2023-04-28 江西科技学院 Accounting data processing method, system, computer and storage medium
CN116402477A (en) * 2023-06-07 2023-07-07 山东韵升科技股份有限公司 File digital information management system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000090093A (en) * 1998-09-08 2000-03-31 Hitachi Software Eng Co Ltd Method and system for full-text retrieval and record medium recording full-text retrieval program
JP3803219B2 (en) * 1999-12-14 2006-08-02 三菱電機株式会社 Full-text search device and full-text search method
US7007015B1 (en) * 2002-05-01 2006-02-28 Microsoft Corporation Prioritized merging for full-text index on relational store
JP2006031209A (en) * 2004-07-14 2006-02-02 Ricoh Co Ltd Full text retrieval system, full text retrieval method, program and recording medium
CN1588365A (en) * 2004-08-02 2005-03-02 中国科学院计算机网络信息中心 Ciphertext global search technology
JP4037859B2 (en) * 2004-09-29 2008-01-23 株式会社東芝 Full-text search system and method

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101184214B (en) * 2007-12-07 2012-12-19 中兴通讯股份有限公司 Method of managing user authority in monitoring system
CN101751452A (en) * 2008-12-03 2010-06-23 富士施乐株式会社 Information processing apparatus and information processing method
CN101751452B (en) * 2008-12-03 2013-12-18 富士施乐株式会社 Information processing apparatus and information processing method
CN101561815B (en) * 2009-05-19 2010-10-13 华中科技大学 Distributed cryptograph full-text retrieval system
CN102262633A (en) * 2010-05-27 2011-11-30 武汉力龙数码信息科技有限公司 Structural data safe retrieving method oriented to full text retrieval
CN102262633B (en) * 2010-05-27 2012-11-28 武汉力龙数码信息科技有限公司 Structural data safe retrieving method oriented to full text retrieval
CN101859323A (en) * 2010-05-31 2010-10-13 广西大学 Ciphertext full-text search system
CN102855292A (en) * 2010-05-31 2013-01-02 广西大学 Safety overlay network constructing method of ciphertext full text search system and corresponding full text search method
CN101859323B (en) * 2010-05-31 2013-01-16 广西大学 Ciphertext full-text search system
CN102855292B (en) * 2010-05-31 2015-04-08 广西大学 Safety overlay network constructing method of ciphertext full text search system and corresponding full text search method
CN103049466A (en) * 2012-05-14 2013-04-17 深圳市朗科科技股份有限公司 Full-text search method and system based on distributed cipher-text storage
CN103049466B (en) * 2012-05-14 2016-04-27 深圳市朗科科技股份有限公司 A kind of text searching method based on distributed cryptograph storage and system
CN103136352B (en) * 2013-02-27 2016-02-03 华中师范大学 Text retrieval system based on double-deck semantic analysis
CN103136352A (en) * 2013-02-27 2013-06-05 华中师范大学 Full-text retrieval system based on two-level semantic analysis
CN109214198A (en) * 2018-08-13 2019-01-15 苏州泥娃软件科技有限公司 A kind of secure cloud document system encrypting search
CN111209586A (en) * 2018-11-21 2020-05-29 郑州科技学院 Document management system and method
CN110324402A (en) * 2019-05-08 2019-10-11 湖南文盾信息技术有限公司 A kind of credible cloud storage service platform and working method based on trusted users front end
CN110324402B (en) * 2019-05-08 2022-03-11 湖南文盾信息技术有限公司 Trusted cloud storage service platform based on trusted user front end and working method
CN110609959A (en) * 2019-09-24 2019-12-24 珠海格力电器股份有限公司 Project life cycle-based retrieval method, storage medium and electronic device
CN110609959B (en) * 2019-09-24 2023-10-24 珠海格力电器股份有限公司 Retrieval method based on project lifecycle, storage medium and electronic equipment
CN110807141A (en) * 2019-11-04 2020-02-18 北京联想协同科技有限公司 Data searching method and device and readable storage medium
CN116029853A (en) * 2023-02-15 2023-04-28 江西科技学院 Accounting data processing method, system, computer and storage medium
CN116029853B (en) * 2023-02-15 2023-06-27 江西科技学院 Accounting data processing method, system, computer and storage medium
CN116402477A (en) * 2023-06-07 2023-07-07 山东韵升科技股份有限公司 File digital information management system

Also Published As

Publication number Publication date
CN100424704C (en) 2008-10-08

Similar Documents

Publication Publication Date Title
CN1932816A (en) Full text search system based on ciphertext
CN101561815B (en) Distributed cryptograph full-text retrieval system
AU2017200641B2 (en) Multi-user search system with methodology for personal searching
US7653623B2 (en) Information searching apparatus and method with mechanism of refining search results
US8046347B2 (en) Method and apparatus for reconstructing a search query
CN100541495C (en) A kind of searching method of individual searching engine
US8909669B2 (en) System and method for locating and retrieving private information on a network
CN101042699A (en) Safety search engine system based on accessing control
US8799291B2 (en) Forensic index method and apparatus by distributed processing
US9946753B2 (en) Method and system for document indexing and data querying
CN104361038A (en) Improved search engine
CN1877583A (en) Accessing identification index system and accessing identification index library generation method
CN1794239A (en) Automatic generating system of template network station possessing searching function and its method
CN1834964A (en) System and method for making search for document in accordance with query of natural language
KR20180022889A (en) Privacy-enhanced personal search index
CN101060539A (en) A method and system integrated with the unified access website address and providing the content of multiple website
CN102325143A (en) Cloud platform based information collection, storage, encryption and retrieval system
Ozdemiray et al. Query performance prediction for aspect weighting in search result diversification
CN107357881A (en) A kind of Chinese Text Classification System based on news data
CN103823805A (en) Community-based related post recommendation system and method
CN110324402B (en) Trusted cloud storage service platform based on trusted user front end and working method
Yue Design of information management system for structural monitoring based on network fragmentation
Nisha et al. Improving the Efficiency of Data Retrieval in Secure Cloud by Introducing Conjunction of Keywords
Ahmad An Approach for Synonym Based Fuzzy Multi Keyword Ranked Search over Encrypted Cloud Data
McCann et al. on the Web and Databases

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20081008

Termination date: 20120930

EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20070321

Assignee: Wuhan kinesisk laborers Anding Information Technology Co., Ltd.

Assignor: Huazhong University of Science and Technology

Contract record no.: 2011420000102

Denomination of invention: Full text search system based on ciphertext

Granted publication date: 20081008

License type: Exclusive License

Record date: 20110527

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model