CN108520002A

CN108520002A - Data processing method, server and computer storage media

Info

Publication number: CN108520002A
Application number: CN201810198710.8A
Authority: CN
Inventors: 张师琲; 侯丽
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-03-12
Filing date: 2018-03-12
Publication date: 2018-09-11
Also published as: WO2019174132A1

Abstract

The invention discloses a kind of data processing method, this method includes：The text data or other kinds of data are established and indexes and generates index file, by index file storage to index database, the index file in the index database is scanned for according to query object, evaluation marking is carried out according to search evaluation method to search result, described search result is ranked up according to scores, the score is carried out output higher than the search result of predetermined threshold value according to predetermined manner to be shown.The present invention also provides a kind of server and computer readable storage mediums.Data processing method, server and computer readable storage medium provided by the invention are capable of the search of rapid pin pair and fuzzy literal, fast implement fuzzy matching.

Description

Data processing method, server and computer storage media

Technical field

The present invention relates to a kind of data analysis technique field more particularly to data processing method, server and computers to deposit Storage media.

Background technology

In the epoch of current information explosion, each unit or individual are made that various tributes in the rapid growth for information It offers.The type of information is also constantly extending, and more and more unstructured information continuously emerge, and include the various reports of enterprise Table, bill, electronic document etc..These unstructured information are stored in database, many times, it would be desirable in the database Retrieval, and for the search with fuzzy literal, the efficiency for directly inquiring database is very slow.Therefore, for fuzzy literal Search, how to improve the efficiency of retrieval information is when next big urgent need to resolve the problem of.

Invention content

In view of this, the present invention proposes a kind of data processing method, server and computer storage media, with solve how The problem of.

First, to achieve the above object, the present invention proposes a kind of data processing method, and the method comprising the steps of：

Obtain database in text data either other kinds of data in database text data or other The data of type are handled；

Based on lucene search engines, to treated, the text data or other kinds of data establish index simultaneously Index file is generated to store the index file to index database；

Query Information input by user is received, carrying out processing to the Query Information generates query object, is looked into according to described It askes object to scan for the index file in the index database, preset search evaluation model carries out evaluation to search result and beats Point；And

Described search result is ranked up according to the sequence of score from high to low according to scores, by score height Output is carried out in the search result of predetermined threshold value according to predetermined manner to show；

Wherein, the predetermined manner is that the scores are generated bar chart, and by the scores according to percentage It exports and shows than form, the predetermined threshold value is 40%.

Preferably, the other kinds of data include pdf file datas, office file datas, described to database In the processing step that is handled of text data or other kinds of data include：

Other kinds of data are converted into text data；

By the text data and the step that is filtered according to word segmentation, part-of-speech tagging and word of the text data in database It is rapid to carry out word segmentation processing；And

Word segmentation result is generated, using filtered word as final word segmentation result, using the final word segmentation result as place The text data after reason or other kinds of data.

Preferably, described " based on lucene search engines to treated the text data or other kinds of number According to establish index and generate index file " the step of include：

Index database is constructed, the position of index database is set, for being stored in index；

Index creation device is constructed, for creating index；And

For the text data or other kinds of data foundation index after participle, created according to different file types Corresponding document description is built, and the content in respective attributes domain is set.

Preferably, carrying out the step of processing generates query object to the Query Information includes：

Word segmentation processing is carried out to the Query Information, the step of word segmentation processing includes：Word segmentation, part-of-speech tagging and word Filtering；

Synonym, near synonym conversion are carried out to the word that participle is concentrated, obtain synonym, the near synonym collection of participle collection；And

The participle is collected, the word that synonym, near synonym are concentrated is as query object.

Preferably, described search evaluation model to described search result give a mark and include the following steps：

The first score of this search is obtained according to the first scoring formula；

The second score of this search is obtained according to smallest edit distance method；And

Obtain the average value of first score and the second score, the final score that the average value is searched for as this.

Preferably, the first scoring formula is：

, wherein the Score is first score, and q is the Query Information, and t is after the Query Information segments Each single item, d are to remove matched document, and function tf (t in d) indicates the frequency that lexical item t occurs in the document, function idf (t )²Indicate that the frequency that lexical item t occurs in all documents, boost (t.field in d) are excitation factor, boost (t.field In d) value of * lengthNorm (t.field in d) indicates in this search result, gives total comprising lexical item in field Number, coord (q, d) indicate that then the marking of word document is higher when the search term for including in a document is more, QueryNorm (q) meters Calculate each query entries variance and.

Preferably, the value of the function tf (t in d) is set as 1, removes the word repeated to first score Influence.

Preferably, described " according to smallest edit distance method obtain this search the second score " the step of include：

Calculate the query object and the editing distance of described search result；

Obtain smallest edit distance；And

Using the value of the smallest edit distance as second score.

In addition, to achieve the above object, the present invention also provides a kind of server, including memory, processor and it is stored in On the memory and the data processing system that can run on the processor, the data processing system is by the processor It is realized such as the step of above-mentioned data processing method when execution.

Further, to achieve the above object, the present invention also provides a kind of computer readable storage medium, the computers Readable storage medium storing program for executing is stored with data processing system, and the data processing system can be executed by least one processor, so that institute At least one processor is stated to execute such as the step of above-mentioned data processing method.

Compared to the prior art, data processing method proposed by the invention, server and computer readable storage medium, First obtain database in text data either other kinds of data to the text data or other types in database Data handled, the text data or other kinds of data are established based on lucene search engines and are indexed and raw At index file, during establishing the index, weight is written to index, by index file storage to index database； Secondly, Query Information input by user is received, carrying out processing to the Query Information generates query object, according to the inquiry pair As being scanned for the index file in the index database, evaluation marking is carried out according to search evaluation method to search result；Most Afterwards, described search result is ranked up according to predetermined manner according to scores, by the score searching higher than predetermined threshold value Hitch fruit carries out output according to predetermined manner and shows.Using data processing method proposed by the invention, server and computer Readable storage medium storing program for executing can fast implement fuzzy matching, compared to the prior art, more with the search of rapid pin pair and fuzzy literal It is convenient, fast, accurate, greatly improve effectiveness of retrieval.

Description of the drawings

Fig. 1 is the schematic diagram of one optional hardware structure of server of the present invention；

Fig. 2 is the program module schematic diagram of data processing system first embodiment of the present invention；

Fig. 3 is the flow diagram of data processing method first embodiment of the present invention；

Fig. 4 is the flow diagram of data processing method second embodiment of the present invention；

Fig. 5 is the flow diagram of data processing method 3rd embodiment of the present invention.

Fig. 6 is the flow diagram of data processing method fourth embodiment of the present invention.

Fig. 7 is the flow diagram of the 5th embodiment of data processing method of the present invention.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work The every other embodiment obtained is put, shall fall within the protection scope of the present invention.

It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot It is interpreted as indicating or implying its relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical solution Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims Protection domain within.

As shown in fig.1, being the schematic diagram of 1 one optional hardware structure of server of the present invention.

In the present embodiment, the server 1 may include, but be not limited only to, and can be in communication with each other connection by system bus and deposit Reservoir 11, processor 12, network interface 13.It should be pointed out that Fig. 2 illustrates only the server 1 with component 11-13, but Be it should be understood that, it is not required that implement all components shown, the implementation that can be substituted is more or less component.

Wherein, the server 1 can be rack-mount server, blade server, tower server or cabinet-type clothes The computing devices such as business device, which can be independent server, can also be the server set that multiple servers are formed Group.

The memory 11 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), static random are visited It asks memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), may be programmed read-only deposit Reservoir (PROM), magnetic storage, disk, CD etc..In some embodiments, the memory 11 can be the server 1 internal storage unit, for example, the server 1 hard disk or memory.In further embodiments, the memory 11 can also It is the External memory equipment of the server 1, such as the plug-in type hard disk being equipped on the server 1, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, described Memory 11 can also both include the server 1 internal storage unit and also including its External memory equipment.In the present embodiment, The memory 11 is installed on the operating system and types of applications software of the server 1 commonly used in storage, such as at data The program code etc. of reason system 2.It has exported or will export in addition, the memory 11 can be also used for temporarily storing Various types of data.

The processor 12 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 12 is commonly used in the control clothes The overall operation of business device 1.In the present embodiment, the processor 12 for run the program code stored in the memory 11 or Person handles data, such as runs the data processing system 2 etc..

The network interface 13 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the server 1 and other electronic equipments.

So far, oneself is through describing the hardware configuration and function of relevant device of the present invention in detail.In the following, above-mentioned introduction will be based on It is proposed each embodiment of the present invention.

First, the present invention proposes a kind of data processing system 2.

As shown in fig.2, being the Program modual graph of 2 first embodiment of data processing system of the present invention.

In the present embodiment, the data processing system 2 includes a series of computer program being stored on memory 11 Instruction, when the computer program instructions are executed by processor 12, may be implemented the data processing operation of various embodiments of the present invention. In some embodiments, the specific operation realized based on the computer program instructions each section, data processing system 2 can be with It is divided into one or more modules.For example, in fig. 2, the data processing system 2 can be divided into index and establish module 21, scoring modules 22 are searched for, sort output module 23.Wherein：

The index establishes module 21, for obtaining text data or other kinds of data in database, logarithm According in library text data or other kinds of data handled, based on lucene search engines to the text data or The other kinds of data of person, which are established, to be indexed and generates index file, and during establishing the index, weight is written to index, By index file storage to index database.

Specifically, Lucene is a set of library of increasing income for full-text search and search, by Apache Software Foundation It supports and provides.It provides a simple powerful application interface, can do full-text index and search.Lucene is One high-performance, telescopic information search library.

Specifically, for the database, the realization method of each specialized company is different, and main type of database is Oracle can also have the various databases of the types such as PostgreSQL, MySQL.

Specifically, weight is written when indexing and is indexed, is read out in inquiry, with the mode multiplied come to some Retrieval result bonus point.

Specifically, in database text data or other kinds of processing mode include a variety of, for example, can be right The Doctype of non-text data carries out turning type so that the document of non-text data can more successfully be established index.

Specifically, index is established to include construction index database, construction index creation device and establish using the index creation device The step of index.

Specifically, index database directory is constructed, for being stored in index, the position of index database, namely index deposit are set Position.

Specifically, construction index creation device IndexWriter.The file index that index creation device is created is stored in index The position in library, if do not indexed in index database, the mode of index creation is newly-built mode；It is otherwise provided as additional mode.

Specifically, index is established to the text data or other kinds of data for acquisition, according to different texts Part type creates corresponding document and describes Document, and the content of respective attributes domain Filed is arranged, such as filename, file road Diameter, file content.

Described search scoring modules 22 are handled the Query Information for receiving Query Information input by user Query object is generated, the index file in the index database is scanned for according to the query object, preset search evaluation Model carries out evaluation marking to search result.

Specifically, the Query Information input by user can be with character string, number, a word, even one section words, to this A little formats and skimble-scamble Query Information can perform some processing so that treated, and Query Information meets described search marking The call format of module 22, for example, the Query Information can be segmented, be filtered, synonym conversion, near synonym conversion etc. Deng processing, so that the Query Information is converted to the satisfactory query object.

Specifically, the emphasis that scoring is the present invention is carried out to the content searched, described search evaluation model uses two kinds Mode carries out evaluation marking to described search result, they are marking formula and minimum volume based on Lucecne engines respectively Distance is collected, the two carries out marking evaluation to search content respectively, then by the different weight factor of determination to the score of the two It is handled to obtain final score.

Wherein, the marking formula based on Lucecne engines is：

Wherein, q is query statement, and t is each single item after q participles, and d is to remove matched document.

Specifically, each function is act as in the marking formula based on Lucecne engines：

The frequency that tf (t in q), this function representation lexical item t occur in the field in the document；Correspond to upper figure In example：It is both the frequency that the lexical item after segmenting occurs in this record.Certainly the number occurred is more, and the value that it is returned is got over Greatly, the importance of this document is also just reflected.For the accuracy of guarantee search result, the value of tf is set as 1, the reason is that： Such as search " Chinese safety ", it is assumed that matched result has 1. safety groups, 2. Chinese safeties, 3. Chinese safety Nanjing safeties point Company, if according to original score foundation, highest result matching degree is third, because " safety " occurs two It is secondary.But according to our normal logics, that highest certainly exactly matched of matching degree, that is, " Chinese safety ". So herein in order to avoid such first phenomenon, the value of tf is changed to 1, the same word it is multiple occur by do not influence score according to According to.Because what we wanted fuzzy matching is customer information, often a very short word, word frequency repeat to should not be used as score height Foundation.It is some higher for the matching degree of phrase in this way.

Idf (t), this function occur twice, also just correspond to idf (t) ^2 in formula, this function is referred to as scramble Rate indicates the frequency that lexical item t occurs in all documents.If the number that it occurs in all documents is more, show this word Item t is more inessential.

Boost (t.field in d) is excitation factor, is just recorded when creating index, and lengthNorm The value of (t.field in d) can calculate in query process；boost(t.field in d)*lengthNorm(t.field In d) value indicate in this search result, the sum of lexical item is included in given field；If value it is bigger, score is lower, citing and Speech, if A documents contain 1000 lexical items, the frequency that keyword occurs is 10；And 20 lexical items of B documents packet, identical key The frequency that word occurs is 8；The marking of apparent B documents should want higher.

Coord (q, d), primary search may include multiple search terms, and in a piece of document may also include multiple search Word, this indicate that when the search term for including in a document is more, then the marking of word document is higher.

QueryNorm (q), this calculate each query entries variance and, this value does not influence to sort, and only so that Score between different query objects can compare.

Specifically, editing distance (Edit Distance), also known as Levenshtein distances refer between two word strings, The minimum edit operation number needed for another is changed by one.The edit operation of license include a character is substituted for it is another A character is inserted into a character, deletes a character.In general, editing distance is smaller, and the similarity of two strings is bigger.It is logical The editing distance for calculating search result and query object is crossed, the second score of search result can be obtained.And the minimum volume obtained It collects most like with query object apart from search result is just represented.

The sequence output module 23, for being arranged described search result according to predetermined manner according to scores The score is carried out output according to predetermined manner higher than the search result of predetermined threshold value and shown by sequence.

Specifically, the predetermined manner can be the mode of percentage, final score is carried out in the form of percentage from Small output is arrived greatly, is conducive to understanding of the user to matching degree height, can also be generated bar chart, it is more intuitive in this way.

Specifically, the purpose of predetermined threshold value setting is filter out most worthy in search result one group, distance and Speech, can be set as 40% by predetermined threshold value.

In addition, the present invention also proposes a kind of data processing method.

As shown in fig.3, being the flow diagram of data processing method first embodiment of the present invention.In the present embodiment, The execution sequence of the step in flow chart shown in Fig. 3 can change according to different requirements, and certain steps can be omitted.

Step S110 obtains the text data in database or other kinds of data, to the textual data in database It is handled based on lucene search engines to the text data or other kinds of number according to either other kinds of data According to establishing index and generating index file, during establishing the index, weight is written to index, by the index file Store index database

Step S120 receives Query Information input by user, and carrying out processing to the Query Information generates query object, root The index file in the index database is scanned for according to the query object, search result is carried out according to search evaluation method Evaluation marking.

Specifically, the Query Information input by user can be with character string, number, a word, even one section words, to this A little formats and skimble-scamble Query Information can perform some processing so that meet described search scoring modules 22 and scans for, For example, the Query Information can be segmented, be filtered, synonym conversion, near synonym conversion etc. processing so that institute It states Query Information and is converted to the satisfactory query object.

Described search result is ranked up by step S130 according to scores according to predetermined manner, by score height Output is carried out in the search result of predetermined threshold value according to predetermined manner to show.

As shown in figure 4, being the flow diagram of the second embodiment of data processing method of the present invention.In the present embodiment, this To the text data and the method that is handled of other kinds of data in database in invention data processing method steps S110 Include the following steps：

Other kinds of data are converted to the text data by step S210.

Specifically, other kinds of data are converted into text data, such as some data are in the form of pdf, office texts Shelves form etc. is stored in server, text is extracted out from office documents, pdf documents by some tools, for example, the work Tool can be apache POI and apache PDFbox etc..

Step S220, the step of text data is filtered according to word segmentation, part-of-speech tagging and word, segment Processing.

Specifically, by the text data obtained in the first step (including the text data in database and transformed text Data) word segmentation processing is carried out, the step of word segmentation processing includes：Word segmentation, part-of-speech tagging and word filtering, wherein word is cut Cutting point mainly carried out to sentence using context relation, the case where avoiding the occurrence of false segmentation because a word difference Slit mode often has different meanings, for example, shoes and clothes, it should which cutting is shoes/and/clothes, when cutting is " shoes When son/kimonos/dress ", hence it is evident that the meaning is tried to go south by driving the chariot north.It, can be right by rule-based and statistics method after word segmentation Word after cutting carries out part-of-speech tagging, and described rule-based and statistics method can be hidden Markov model, for example, part of speech Shoes and clothes can be classified as noun by marking, and incite somebody to action " and " it is classified as conjunction.It is exactly word filtering, word after part-of-speech tagging The effect of filtering is to remove unessential word, can simplify index database in this way, effectiveness of retrieval is improved, for example, by noun " shoes ", " clothes " retain, by conjunction " and " filtering.

Step S230 generates word segmentation result, using filtered word as final word segmentation result.

Specifically, filtered word includes the participle of the text data and other categorical datas in database, participle Synonym, near synonym etc., these words will be used as handling result for next step so that the significantly more efficient retrieval of system.

The execution sequence of step in flow chart shown in Fig. 4 can change, and certain steps can be omitted.

As shown in figure 5, being the flow diagram of the 3rd embodiment of data processing method of the present invention.In the present embodiment, this Based on lucene search engines to the text data or other kinds of data in invention data processing method steps S110 The method for establishing index includes the following steps：

Step S310 constructs index database, the position of index database is arranged, for being stored in index.

Step S320 constructs index creation device, for creating index.

Step S330 establishes index, according to different texts for the text data of acquisition or other kinds of data Part type creates corresponding document description, and the content in respective attributes domain is arranged.

The execution sequence of step in flow chart shown in fig. 5 can change, and certain steps can be omitted.

As shown in fig. 6, being the flow diagram of the fourth embodiment of data processing method of the present invention.In the present embodiment, this It includes following step to carry out the method that processing generates query object to the Query Information in invention data processing method steps S210 Suddenly：

Step S410 carries out word segmentation processing to the Query Information.

Specifically, word segmentation processing is carried out to the Query Information, the step of word segmentation processing includes：Word segmentation, part of speech mark Note and word filtering, wherein word segmentation mainly carries out cutting using context relation to sentence, avoids the occurrence of false segmentation Situation because in short different slit modes often have different meanings, for example, shoes and clothes, it should which cutting is shoes Son/and/clothes, when cutting is " shoes/kimonos/dress ", hence it is evident that the meaning is tried to go south by driving the chariot north.After word segmentation, by being based on The method of rule and statistics can carry out the word after cutting part-of-speech tagging, and described rule-based and statistics method can be hidden Markov model, for example, shoes and clothes can be classified as noun by part-of-speech tagging, and incite somebody to action " and " it is classified as conjunction.Part of speech It is exactly word filtering after mark, the effect of word filtering is to remove unessential word, can simplify index database in this way, is improved Effectiveness of retrieval, for example, by noun " shoes ", " clothes " retain, by conjunction " and " filtering.The filtered word of word, which is formed, to be divided Word set.

Step S420, carries out synonym, near synonym are converted to the word that participle is concentrated, and obtains the synonym, close of participle collection Adopted word set.

Specifically, synonym is carried out to the word that participle is concentrated, near synonym are converted, the synonym of acquisition participle collection, nearly justice Word set, the word that participle collection, synonym, near synonym are concentrated is as query object.The word that participle is concentrated conceptually is expanded Corresponding synonym, near synonym or upper hyponym are transformed into, according to similarity priority algorithm extraction section expansion word or receives use The expansion word of word and restriction that participle is concentrated, is finally transmitted to retrieval module by the expansion word of family selection together as querying condition As query object.For example, if user input " this year China economic form how" system obtains " China ", " economy " two query words, then the expansion word of retrieval message processing module available " China ", such as " continent ", " interiorly ", " country " etc.；Expansion word " GDP ", " trade ", " business ", " finance and economics ", " finance " etc. are can get according to " economy ".

Step S430 collects the participle, and the word that synonym, near synonym are concentrated is as query object

Specifically, it is converted after Query Information input by user being segmented to obtain synonym, the near synonym of participle, profit With word segmentation result and its synonym, near synonym carry out inquiry to the content in index database can be more comprehensive, accurate and rapid, also more Meet the definition of fuzzy search.

The execution sequence of step in flow chart shown in fig. 6 can change, and certain steps can be omitted.

As shown in fig. 7, being the flow diagram of the fourth embodiment of data processing method of the present invention.In the present embodiment, this In invention data processing method steps S210 to search result according to search evaluation method carry out evaluation marking method include with Lower step：

Step S510 obtains the first score of this search according to the first scoring formula.

Specifically, the preset search evaluation method is using including based on the first scoring formula and smallest edit distance method Search score model score described search result, wherein it is described first scoring formula be：

Specifically, each function is act as in the first scoring formula：

Step S520 obtains the second score of this search according to smallest edit distance method.

Specifically, wherein editing distance (Edit Distance), also known as Levenshtein distances refer to two word strings Between, the minimum edit operation number needed for another is changed by one.The edit operation of license includes replacing a character At another character, it is inserted into a character, deletes a character.In general, editing distance is smaller, and the similarity of two strings is got over Greatly.By calculating the editing distance of search result and query object, the second score of search result can be obtained.And it obtains most It is most like with query object that small editing distance just represents search result.

Specifically, the step of step " second score that this search is obtained according to smallest edit distance method " includes：

Obtain smallest edit distance；And

Using the value of the smallest edit distance as second score.

Specifically, different weight factors can be arranged to first score and the second score, by respective weight because It is sub be multiplied respectively with first score and the second score after carry out add operation obtain search result evaluation marking as a result, example As its formula can be：Scoring=the first scores of weight factor A*+the second scores of weight factor B*, the weight factor A and weight The value of factor B is set as desired, and for example, mean value both if desired can be by weight factor A and weight factor B It is set as 0.5.

The execution sequence of step in flow chart shown in Fig. 7 can change, and certain steps can be omitted.

The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical scheme of the present invention substantially in other words does the prior art Going out the part of contribution can be expressed in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disc, CD), including some instructions are used so that a station terminal equipment (can be mobile phone, computer, clothes Be engaged in device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.

It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of data processing method is applied to server, which is characterized in that the method includes the steps：

Obtain database in text data either other kinds of data to the text data or other types in database Data handled；

Based on lucene search engines, to treated, the text data or other kinds of data are established index and are generated Index file, by index file storage to index database；

Query Information input by user is received, carrying out processing to the Query Information generates query object, according to the inquiry pair As being scanned for the index file in the index database, preset search evaluation model carries out evaluation marking to search result； And

Described search result is ranked up according to the sequence of score from high to low according to scores, by the score higher than pre- It is shown if the search result of threshold value carries out output according to predetermined manner；

Wherein, the predetermined manner is that the scores are generated bar chart, and by the scores according to percentage shape Formula output display, the predetermined threshold value are 40%.

2. data processing method as described in claim 1, which is characterized in that the other kinds of data include pdf files The step of data, office file datas, the text data in database or other kinds of data are handled Including：

Other kinds of data are converted into text data；

By in database text data and the text data filtered according to word segmentation, part-of-speech tagging and word the step of into Row word segmentation processing；And

Generate word segmentation result, using filtered word as final word segmentation result, will the final word segmentation result as handling after The text data or other kinds of data.

3. data processing method as claimed in claim 2, which is characterized in that described " based on lucene search engines to processing The rear text data or other kinds of data, which are established, to be indexed and generates index file " the step of include：

Index creation device is constructed, for creating index；And

For the text data or other kinds of data foundation index after participle, phase is created according to different file types The document description answered, and the content in respective attributes domain is set.

4. the data processing method as described in claim 1-3, which is characterized in that carry out processing generation to the Query Information and look into Ask object the step of include：

Word segmentation processing is carried out to the Query Information, the step of word segmentation processing includes：Word segmentation, part-of-speech tagging and word mistake Filter；

5. data processing method as claimed in claim 4, which is characterized in that described search evaluation model is to described search result Marking is carried out to include the following steps：

6. data processing method as claimed in claim 5, which is characterized in that it is described first scoring formula be：

,

Wherein, the Score is first score, and q is the Query Information, and t is each after the Query Information segments , d is to remove matched document, and function tf (t in d) indicates the frequency that lexical item t occurs in the document, function idf (t)²Table Show that the frequency that lexical item t occurs in all documents, boost (t.field in d) are excitation factor, boost (t.field in D) value of * lengthNorm (t.field in d) indicates in this search result, and the sum of lexical item is included in given field, Coord (q, d) indicates that then the marking of word document is higher, and QueryNorm (q) is calculated when the search term for including in a document is more The variance of each query entries and.

7. data processing method as claimed in claim 6, which is characterized in that set the value of the function tf (t in d) to 1, remove influence of the word repeated to first score.

8. data processing method as claimed in claim 7, which is characterized in that described " to obtain this according to smallest edit distance method The step of second score of secondary search " includes：

Obtain smallest edit distance；And

Using the value of the smallest edit distance as second score.

9. a kind of server, which is characterized in that the server includes memory, processor and is stored on the memory simultaneously The data processing system that can be run on the processor is realized when the data processing system is executed by the processor as weighed Profit requires the step of data processing method described in any one of 1-8.

10. a kind of computer readable storage medium, the computer-readable recording medium storage has data processing system, the number It can be executed by least one processor according to processing system, so that at least one processor is executed as appointed in claim 1-8 The step of data processing method described in one.