CN108520002A - Data processing method, server and computer storage media - Google Patents

Data processing method, server and computer storage media Download PDF

Info

Publication number
CN108520002A
CN108520002A CN201810198710.8A CN201810198710A CN108520002A CN 108520002 A CN108520002 A CN 108520002A CN 201810198710 A CN201810198710 A CN 201810198710A CN 108520002 A CN108520002 A CN 108520002A
Authority
CN
China
Prior art keywords
data
index
score
word
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810198710.8A
Other languages
Chinese (zh)
Inventor
张师琲
侯丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810198710.8A priority Critical patent/CN108520002A/en
Priority to PCT/CN2018/089335 priority patent/WO2019174132A1/en
Publication of CN108520002A publication Critical patent/CN108520002A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of data processing method, this method includes:The text data or other kinds of data are established and indexes and generates index file, by index file storage to index database, the index file in the index database is scanned for according to query object, evaluation marking is carried out according to search evaluation method to search result, described search result is ranked up according to scores, the score is carried out output higher than the search result of predetermined threshold value according to predetermined manner to be shown.The present invention also provides a kind of server and computer readable storage mediums.Data processing method, server and computer readable storage medium provided by the invention are capable of the search of rapid pin pair and fuzzy literal, fast implement fuzzy matching.

Description

Data processing method, server and computer storage media
Technical field
The present invention relates to a kind of data analysis technique field more particularly to data processing method, server and computers to deposit Storage media.
Background technology
In the epoch of current information explosion, each unit or individual are made that various tributes in the rapid growth for information It offers.The type of information is also constantly extending, and more and more unstructured information continuously emerge, and include the various reports of enterprise Table, bill, electronic document etc..These unstructured information are stored in database, many times, it would be desirable in the database Retrieval, and for the search with fuzzy literal, the efficiency for directly inquiring database is very slow.Therefore, for fuzzy literal Search, how to improve the efficiency of retrieval information is when next big urgent need to resolve the problem of.
Invention content
In view of this, the present invention proposes a kind of data processing method, server and computer storage media, with solve how The problem of.
First, to achieve the above object, the present invention proposes a kind of data processing method, and the method comprising the steps of:
Obtain database in text data either other kinds of data in database text data or other The data of type are handled;
Based on lucene search engines, to treated, the text data or other kinds of data establish index simultaneously Index file is generated to store the index file to index database;
Query Information input by user is received, carrying out processing to the Query Information generates query object, is looked into according to described It askes object to scan for the index file in the index database, preset search evaluation model carries out evaluation to search result and beats Point;And
Described search result is ranked up according to the sequence of score from high to low according to scores, by score height Output is carried out in the search result of predetermined threshold value according to predetermined manner to show;
Wherein, the predetermined manner is that the scores are generated bar chart, and by the scores according to percentage It exports and shows than form, the predetermined threshold value is 40%.
Preferably, the other kinds of data include pdf file datas, office file datas, described to database In the processing step that is handled of text data or other kinds of data include:
Other kinds of data are converted into text data;
By the text data and the step that is filtered according to word segmentation, part-of-speech tagging and word of the text data in database It is rapid to carry out word segmentation processing;And
Word segmentation result is generated, using filtered word as final word segmentation result, using the final word segmentation result as place The text data after reason or other kinds of data.
Preferably, described " based on lucene search engines to treated the text data or other kinds of number According to establish index and generate index file " the step of include:
Index database is constructed, the position of index database is set, for being stored in index;
Index creation device is constructed, for creating index;And
For the text data or other kinds of data foundation index after participle, created according to different file types Corresponding document description is built, and the content in respective attributes domain is set.
Preferably, carrying out the step of processing generates query object to the Query Information includes:
Word segmentation processing is carried out to the Query Information, the step of word segmentation processing includes:Word segmentation, part-of-speech tagging and word Filtering;
Synonym, near synonym conversion are carried out to the word that participle is concentrated, obtain synonym, the near synonym collection of participle collection;And
The participle is collected, the word that synonym, near synonym are concentrated is as query object.
Preferably, described search evaluation model to described search result give a mark and include the following steps:
The first score of this search is obtained according to the first scoring formula;
The second score of this search is obtained according to smallest edit distance method;And
Obtain the average value of first score and the second score, the final score that the average value is searched for as this.
Preferably, the first scoring formula is:
, wherein the Score is first score, and q is the Query Information, and t is after the Query Information segments Each single item, d are to remove matched document, and function tf (t in d) indicates the frequency that lexical item t occurs in the document, function idf (t )2Indicate that the frequency that lexical item t occurs in all documents, boost (t.field in d) are excitation factor, boost (t.field In d) value of * lengthNorm (t.field in d) indicates in this search result, gives total comprising lexical item in field Number, coord (q, d) indicate that then the marking of word document is higher when the search term for including in a document is more, QueryNorm (q) meters Calculate each query entries variance and.
Preferably, the value of the function tf (t in d) is set as 1, removes the word repeated to first score Influence.
Preferably, described " according to smallest edit distance method obtain this search the second score " the step of include:
Calculate the query object and the editing distance of described search result;
Obtain smallest edit distance;And
Using the value of the smallest edit distance as second score.
In addition, to achieve the above object, the present invention also provides a kind of server, including memory, processor and it is stored in On the memory and the data processing system that can run on the processor, the data processing system is by the processor It is realized such as the step of above-mentioned data processing method when execution.
Further, to achieve the above object, the present invention also provides a kind of computer readable storage medium, the computers Readable storage medium storing program for executing is stored with data processing system, and the data processing system can be executed by least one processor, so that institute At least one processor is stated to execute such as the step of above-mentioned data processing method.
Compared to the prior art, data processing method proposed by the invention, server and computer readable storage medium, First obtain database in text data either other kinds of data to the text data or other types in database Data handled, the text data or other kinds of data are established based on lucene search engines and are indexed and raw At index file, during establishing the index, weight is written to index, by index file storage to index database; Secondly, Query Information input by user is received, carrying out processing to the Query Information generates query object, according to the inquiry pair As being scanned for the index file in the index database, evaluation marking is carried out according to search evaluation method to search result;Most Afterwards, described search result is ranked up according to predetermined manner according to scores, by the score searching higher than predetermined threshold value Hitch fruit carries out output according to predetermined manner and shows.Using data processing method proposed by the invention, server and computer Readable storage medium storing program for executing can fast implement fuzzy matching, compared to the prior art, more with the search of rapid pin pair and fuzzy literal It is convenient, fast, accurate, greatly improve effectiveness of retrieval.
Description of the drawings
Fig. 1 is the schematic diagram of one optional hardware structure of server of the present invention;
Fig. 2 is the program module schematic diagram of data processing system first embodiment of the present invention;
Fig. 3 is the flow diagram of data processing method first embodiment of the present invention;
Fig. 4 is the flow diagram of data processing method second embodiment of the present invention;
Fig. 5 is the flow diagram of data processing method 3rd embodiment of the present invention.
Fig. 6 is the flow diagram of data processing method fourth embodiment of the present invention.
Fig. 7 is the flow diagram of the 5th embodiment of data processing method of the present invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work The every other embodiment obtained is put, shall fall within the protection scope of the present invention.
It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot It is interpreted as indicating or implying its relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical solution Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims Protection domain within.
As shown in fig.1, being the schematic diagram of 1 one optional hardware structure of server of the present invention.
In the present embodiment, the server 1 may include, but be not limited only to, and can be in communication with each other connection by system bus and deposit Reservoir 11, processor 12, network interface 13.It should be pointed out that Fig. 2 illustrates only the server 1 with component 11-13, but Be it should be understood that, it is not required that implement all components shown, the implementation that can be substituted is more or less component.
Wherein, the server 1 can be rack-mount server, blade server, tower server or cabinet-type clothes The computing devices such as business device, which can be independent server, can also be the server set that multiple servers are formed Group.
The memory 11 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory, Hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), static random are visited It asks memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), may be programmed read-only deposit Reservoir (PROM), magnetic storage, disk, CD etc..In some embodiments, the memory 11 can be the server 1 internal storage unit, for example, the server 1 hard disk or memory.In further embodiments, the memory 11 can also It is the External memory equipment of the server 1, such as the plug-in type hard disk being equipped on the server 1, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, described Memory 11 can also both include the server 1 internal storage unit and also including its External memory equipment.In the present embodiment, The memory 11 is installed on the operating system and types of applications software of the server 1 commonly used in storage, such as at data The program code etc. of reason system 2.It has exported or will export in addition, the memory 11 can be also used for temporarily storing Various types of data.
The processor 12 can be in some embodiments central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 12 is commonly used in the control clothes The overall operation of business device 1.In the present embodiment, the processor 12 for run the program code stored in the memory 11 or Person handles data, such as runs the data processing system 2 etc..
The network interface 13 may include radio network interface or wired network interface, which is commonly used in Communication connection is established between the server 1 and other electronic equipments.
So far, oneself is through describing the hardware configuration and function of relevant device of the present invention in detail.In the following, above-mentioned introduction will be based on It is proposed each embodiment of the present invention.
First, the present invention proposes a kind of data processing system 2.
As shown in fig.2, being the Program modual graph of 2 first embodiment of data processing system of the present invention.
In the present embodiment, the data processing system 2 includes a series of computer program being stored on memory 11 Instruction, when the computer program instructions are executed by processor 12, may be implemented the data processing operation of various embodiments of the present invention. In some embodiments, the specific operation realized based on the computer program instructions each section, data processing system 2 can be with It is divided into one or more modules.For example, in fig. 2, the data processing system 2 can be divided into index and establish module 21, scoring modules 22 are searched for, sort output module 23.Wherein:
The index establishes module 21, for obtaining text data or other kinds of data in database, logarithm According in library text data or other kinds of data handled, based on lucene search engines to the text data or The other kinds of data of person, which are established, to be indexed and generates index file, and during establishing the index, weight is written to index, By index file storage to index database.
Specifically, Lucene is a set of library of increasing income for full-text search and search, by Apache Software Foundation It supports and provides.It provides a simple powerful application interface, can do full-text index and search.Lucene is One high-performance, telescopic information search library.
Specifically, for the database, the realization method of each specialized company is different, and main type of database is Oracle can also have the various databases of the types such as PostgreSQL, MySQL.
Specifically, weight is written when indexing and is indexed, is read out in inquiry, with the mode multiplied come to some Retrieval result bonus point.
Specifically, in database text data or other kinds of processing mode include a variety of, for example, can be right The Doctype of non-text data carries out turning type so that the document of non-text data can more successfully be established index.
Specifically, index is established to include construction index database, construction index creation device and establish using the index creation device The step of index.
Specifically, index database directory is constructed, for being stored in index, the position of index database, namely index deposit are set Position.
Specifically, construction index creation device IndexWriter.The file index that index creation device is created is stored in index The position in library, if do not indexed in index database, the mode of index creation is newly-built mode;It is otherwise provided as additional mode.
Specifically, index is established to the text data or other kinds of data for acquisition, according to different texts Part type creates corresponding document and describes Document, and the content of respective attributes domain Filed is arranged, such as filename, file road Diameter, file content.
Described search scoring modules 22 are handled the Query Information for receiving Query Information input by user Query object is generated, the index file in the index database is scanned for according to the query object, preset search evaluation Model carries out evaluation marking to search result.
Specifically, the Query Information input by user can be with character string, number, a word, even one section words, to this A little formats and skimble-scamble Query Information can perform some processing so that treated, and Query Information meets described search marking The call format of module 22, for example, the Query Information can be segmented, be filtered, synonym conversion, near synonym conversion etc. Deng processing, so that the Query Information is converted to the satisfactory query object.
Specifically, the emphasis that scoring is the present invention is carried out to the content searched, described search evaluation model uses two kinds Mode carries out evaluation marking to described search result, they are marking formula and minimum volume based on Lucecne engines respectively Distance is collected, the two carries out marking evaluation to search content respectively, then by the different weight factor of determination to the score of the two It is handled to obtain final score.
Wherein, the marking formula based on Lucecne engines is:
Wherein, q is query statement, and t is each single item after q participles, and d is to remove matched document.
Specifically, each function is act as in the marking formula based on Lucecne engines:
The frequency that tf (t in q), this function representation lexical item t occur in the field in the document;Correspond to upper figure In example:It is both the frequency that the lexical item after segmenting occurs in this record.Certainly the number occurred is more, and the value that it is returned is got over Greatly, the importance of this document is also just reflected.For the accuracy of guarantee search result, the value of tf is set as 1, the reason is that: Such as search " Chinese safety ", it is assumed that matched result has 1. safety groups, 2. Chinese safeties, 3. Chinese safety Nanjing safeties point Company, if according to original score foundation, highest result matching degree is third, because " safety " occurs two It is secondary.But according to our normal logics, that highest certainly exactly matched of matching degree, that is, " Chinese safety ". So herein in order to avoid such first phenomenon, the value of tf is changed to 1, the same word it is multiple occur by do not influence score according to According to.Because what we wanted fuzzy matching is customer information, often a very short word, word frequency repeat to should not be used as score height Foundation.It is some higher for the matching degree of phrase in this way.
Idf (t), this function occur twice, also just correspond to idf (t) ^2 in formula, this function is referred to as scramble Rate indicates the frequency that lexical item t occurs in all documents.If the number that it occurs in all documents is more, show this word Item t is more inessential.
Boost (t.field in d) is excitation factor, is just recorded when creating index, and lengthNorm The value of (t.field in d) can calculate in query process;boost(t.field in d)*lengthNorm(t.field In d) value indicate in this search result, the sum of lexical item is included in given field;If value it is bigger, score is lower, citing and Speech, if A documents contain 1000 lexical items, the frequency that keyword occurs is 10;And 20 lexical items of B documents packet, identical key The frequency that word occurs is 8;The marking of apparent B documents should want higher.
Coord (q, d), primary search may include multiple search terms, and in a piece of document may also include multiple search Word, this indicate that when the search term for including in a document is more, then the marking of word document is higher.
QueryNorm (q), this calculate each query entries variance and, this value does not influence to sort, and only so that Score between different query objects can compare.
Specifically, editing distance (Edit Distance), also known as Levenshtein distances refer between two word strings, The minimum edit operation number needed for another is changed by one.The edit operation of license include a character is substituted for it is another A character is inserted into a character, deletes a character.In general, editing distance is smaller, and the similarity of two strings is bigger.It is logical The editing distance for calculating search result and query object is crossed, the second score of search result can be obtained.And the minimum volume obtained It collects most like with query object apart from search result is just represented.
The sequence output module 23, for being arranged described search result according to predetermined manner according to scores The score is carried out output according to predetermined manner higher than the search result of predetermined threshold value and shown by sequence.
Specifically, the predetermined manner can be the mode of percentage, final score is carried out in the form of percentage from Small output is arrived greatly, is conducive to understanding of the user to matching degree height, can also be generated bar chart, it is more intuitive in this way.
Specifically, the purpose of predetermined threshold value setting is filter out most worthy in search result one group, distance and Speech, can be set as 40% by predetermined threshold value.
In addition, the present invention also proposes a kind of data processing method.
As shown in fig.3, being the flow diagram of data processing method first embodiment of the present invention.In the present embodiment, The execution sequence of the step in flow chart shown in Fig. 3 can change according to different requirements, and certain steps can be omitted.
Step S110 obtains the text data in database or other kinds of data, to the textual data in database It is handled based on lucene search engines to the text data or other kinds of number according to either other kinds of data According to establishing index and generating index file, during establishing the index, weight is written to index, by the index file Store index database
Specifically, Lucene is a set of library of increasing income for full-text search and search, by Apache Software Foundation It supports and provides.It provides a simple powerful application interface, can do full-text index and search.Lucene is One high-performance, telescopic information search library.
Specifically, for the database, the realization method of each specialized company is different, and main type of database is Oracle can also have the various databases of the types such as PostgreSQL, MySQL.
Step S120 receives Query Information input by user, and carrying out processing to the Query Information generates query object, root The index file in the index database is scanned for according to the query object, search result is carried out according to search evaluation method Evaluation marking.
Specifically, the Query Information input by user can be with character string, number, a word, even one section words, to this A little formats and skimble-scamble Query Information can perform some processing so that meet described search scoring modules 22 and scans for, For example, the Query Information can be segmented, be filtered, synonym conversion, near synonym conversion etc. processing so that institute It states Query Information and is converted to the satisfactory query object.
Described search result is ranked up by step S130 according to scores according to predetermined manner, by score height Output is carried out in the search result of predetermined threshold value according to predetermined manner to show.
Specifically, the predetermined manner can be the mode of percentage, final score is carried out in the form of percentage from Small output is arrived greatly, is conducive to understanding of the user to matching degree height, can also be generated bar chart, it is more intuitive in this way.
Specifically, the purpose of predetermined threshold value setting is filter out most worthy in search result one group, distance and Speech, can be set as 40% by predetermined threshold value.
As shown in figure 4, being the flow diagram of the second embodiment of data processing method of the present invention.In the present embodiment, this To the text data and the method that is handled of other kinds of data in database in invention data processing method steps S110 Include the following steps:
Other kinds of data are converted to the text data by step S210.
Specifically, other kinds of data are converted into text data, such as some data are in the form of pdf, office texts Shelves form etc. is stored in server, text is extracted out from office documents, pdf documents by some tools, for example, the work Tool can be apache POI and apache PDFbox etc..
Step S220, the step of text data is filtered according to word segmentation, part-of-speech tagging and word, segment Processing.
Specifically, by the text data obtained in the first step (including the text data in database and transformed text Data) word segmentation processing is carried out, the step of word segmentation processing includes:Word segmentation, part-of-speech tagging and word filtering, wherein word is cut Cutting point mainly carried out to sentence using context relation, the case where avoiding the occurrence of false segmentation because a word difference Slit mode often has different meanings, for example, shoes and clothes, it should which cutting is shoes/and/clothes, when cutting is " shoes When son/kimonos/dress ", hence it is evident that the meaning is tried to go south by driving the chariot north.It, can be right by rule-based and statistics method after word segmentation Word after cutting carries out part-of-speech tagging, and described rule-based and statistics method can be hidden Markov model, for example, part of speech Shoes and clothes can be classified as noun by marking, and incite somebody to action " and " it is classified as conjunction.It is exactly word filtering, word after part-of-speech tagging The effect of filtering is to remove unessential word, can simplify index database in this way, effectiveness of retrieval is improved, for example, by noun " shoes ", " clothes " retain, by conjunction " and " filtering.
Step S230 generates word segmentation result, using filtered word as final word segmentation result.
Specifically, filtered word includes the participle of the text data and other categorical datas in database, participle Synonym, near synonym etc., these words will be used as handling result for next step so that the significantly more efficient retrieval of system.
The execution sequence of step in flow chart shown in Fig. 4 can change, and certain steps can be omitted.
As shown in figure 5, being the flow diagram of the 3rd embodiment of data processing method of the present invention.In the present embodiment, this Based on lucene search engines to the text data or other kinds of data in invention data processing method steps S110 The method for establishing index includes the following steps:
Step S310 constructs index database, the position of index database is arranged, for being stored in index.
Specifically, index database directory is constructed, for being stored in index, the position of index database, namely index deposit are set Position.
Step S320 constructs index creation device, for creating index.
Specifically, construction index creation device IndexWriter.The file index that index creation device is created is stored in index The position in library, if do not indexed in index database, the mode of index creation is newly-built mode;It is otherwise provided as additional mode.
Step S330 establishes index, according to different texts for the text data of acquisition or other kinds of data Part type creates corresponding document description, and the content in respective attributes domain is arranged.
Specifically, index is established to the text data or other kinds of data for acquisition, according to different texts Part type creates corresponding document and describes Document, and the content of respective attributes domain Filed is arranged, such as filename, file road Diameter, file content.
The execution sequence of step in flow chart shown in fig. 5 can change, and certain steps can be omitted.
As shown in fig. 6, being the flow diagram of the fourth embodiment of data processing method of the present invention.In the present embodiment, this It includes following step to carry out the method that processing generates query object to the Query Information in invention data processing method steps S210 Suddenly:
Step S410 carries out word segmentation processing to the Query Information.
Specifically, word segmentation processing is carried out to the Query Information, the step of word segmentation processing includes:Word segmentation, part of speech mark Note and word filtering, wherein word segmentation mainly carries out cutting using context relation to sentence, avoids the occurrence of false segmentation Situation because in short different slit modes often have different meanings, for example, shoes and clothes, it should which cutting is shoes Son/and/clothes, when cutting is " shoes/kimonos/dress ", hence it is evident that the meaning is tried to go south by driving the chariot north.After word segmentation, by being based on The method of rule and statistics can carry out the word after cutting part-of-speech tagging, and described rule-based and statistics method can be hidden Markov model, for example, shoes and clothes can be classified as noun by part-of-speech tagging, and incite somebody to action " and " it is classified as conjunction.Part of speech It is exactly word filtering after mark, the effect of word filtering is to remove unessential word, can simplify index database in this way, is improved Effectiveness of retrieval, for example, by noun " shoes ", " clothes " retain, by conjunction " and " filtering.The filtered word of word, which is formed, to be divided Word set.
Step S420, carries out synonym, near synonym are converted to the word that participle is concentrated, and obtains the synonym, close of participle collection Adopted word set.
Specifically, synonym is carried out to the word that participle is concentrated, near synonym are converted, the synonym of acquisition participle collection, nearly justice Word set, the word that participle collection, synonym, near synonym are concentrated is as query object.The word that participle is concentrated conceptually is expanded Corresponding synonym, near synonym or upper hyponym are transformed into, according to similarity priority algorithm extraction section expansion word or receives use The expansion word of word and restriction that participle is concentrated, is finally transmitted to retrieval module by the expansion word of family selection together as querying condition As query object.For example, if user input " this year China economic form how" system obtains " China ", " economy " two query words, then the expansion word of retrieval message processing module available " China ", such as " continent ", " interiorly ", " country " etc.;Expansion word " GDP ", " trade ", " business ", " finance and economics ", " finance " etc. are can get according to " economy ".
Step S430 collects the participle, and the word that synonym, near synonym are concentrated is as query object
Specifically, it is converted after Query Information input by user being segmented to obtain synonym, the near synonym of participle, profit With word segmentation result and its synonym, near synonym carry out inquiry to the content in index database can be more comprehensive, accurate and rapid, also more Meet the definition of fuzzy search.
The execution sequence of step in flow chart shown in fig. 6 can change, and certain steps can be omitted.
As shown in fig. 7, being the flow diagram of the fourth embodiment of data processing method of the present invention.In the present embodiment, this In invention data processing method steps S210 to search result according to search evaluation method carry out evaluation marking method include with Lower step:
Step S510 obtains the first score of this search according to the first scoring formula.
Specifically, the preset search evaluation method is using including based on the first scoring formula and smallest edit distance method Search score model score described search result, wherein it is described first scoring formula be:
Wherein, q is query statement, and t is each single item after q participles, and d is to remove matched document.
Specifically, each function is act as in the first scoring formula:
The frequency that tf (t in q), this function representation lexical item t occur in the field in the document;Correspond to upper figure In example:It is both the frequency that the lexical item after segmenting occurs in this record.Certainly the number occurred is more, and the value that it is returned is got over Greatly, the importance of this document is also just reflected.For the accuracy of guarantee search result, the value of tf is set as 1, the reason is that: Such as search " Chinese safety ", it is assumed that matched result has 1. safety groups, 2. Chinese safeties, 3. Chinese safety Nanjing safeties point Company, if according to original score foundation, highest result matching degree is third, because " safety " occurs two It is secondary.But according to our normal logics, that highest certainly exactly matched of matching degree, that is, " Chinese safety ". So herein in order to avoid such first phenomenon, the value of tf is changed to 1, the same word it is multiple occur by do not influence score according to According to.Because what we wanted fuzzy matching is customer information, often a very short word, word frequency repeat to should not be used as score height Foundation.It is some higher for the matching degree of phrase in this way.
Idf (t), this function occur twice, also just correspond to idf (t) ^2 in formula, this function is referred to as scramble Rate indicates the frequency that lexical item t occurs in all documents.If the number that it occurs in all documents is more, show this word Item t is more inessential.
Boost (t.field in d) is excitation factor, is just recorded when creating index, and lengthNorm The value of (t.field in d) can calculate in query process;boost(t.field in d)*lengthNorm(t.field In d) value indicate in this search result, the sum of lexical item is included in given field;If value it is bigger, score is lower, citing and Speech, if A documents contain 1000 lexical items, the frequency that keyword occurs is 10;And 20 lexical items of B documents packet, identical key The frequency that word occurs is 8;The marking of apparent B documents should want higher.
Coord (q, d), primary search may include multiple search terms, and in a piece of document may also include multiple search Word, this indicate that when the search term for including in a document is more, then the marking of word document is higher.
QueryNorm (q), this calculate each query entries variance and, this value does not influence to sort, and only so that Score between different query objects can compare.
Step S520 obtains the second score of this search according to smallest edit distance method.
Specifically, wherein editing distance (Edit Distance), also known as Levenshtein distances refer to two word strings Between, the minimum edit operation number needed for another is changed by one.The edit operation of license includes replacing a character At another character, it is inserted into a character, deletes a character.In general, editing distance is smaller, and the similarity of two strings is got over Greatly.By calculating the editing distance of search result and query object, the second score of search result can be obtained.And it obtains most It is most like with query object that small editing distance just represents search result.
Specifically, the step of step " second score that this search is obtained according to smallest edit distance method " includes:
Calculate the query object and the editing distance of described search result;
Obtain smallest edit distance;And
Using the value of the smallest edit distance as second score.
Obtain the average value of first score and the second score, the final score that the average value is searched for as this.
Specifically, different weight factors can be arranged to first score and the second score, by respective weight because It is sub be multiplied respectively with first score and the second score after carry out add operation obtain search result evaluation marking as a result, example As its formula can be:Scoring=the first scores of weight factor A*+the second scores of weight factor B*, the weight factor A and weight The value of factor B is set as desired, and for example, mean value both if desired can be by weight factor A and weight factor B It is set as 0.5.
The execution sequence of step in flow chart shown in Fig. 7 can change, and certain steps can be omitted.
The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical scheme of the present invention substantially in other words does the prior art Going out the part of contribution can be expressed in the form of software products, which is stored in a storage medium In (such as ROM/RAM, magnetic disc, CD), including some instructions are used so that a station terminal equipment (can be mobile phone, computer, clothes Be engaged in device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.
It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of data processing method is applied to server, which is characterized in that the method includes the steps:
Obtain database in text data either other kinds of data to the text data or other types in database Data handled;
Based on lucene search engines, to treated, the text data or other kinds of data are established index and are generated Index file, by index file storage to index database;
Query Information input by user is received, carrying out processing to the Query Information generates query object, according to the inquiry pair As being scanned for the index file in the index database, preset search evaluation model carries out evaluation marking to search result; And
Described search result is ranked up according to the sequence of score from high to low according to scores, by the score higher than pre- It is shown if the search result of threshold value carries out output according to predetermined manner;
Wherein, the predetermined manner is that the scores are generated bar chart, and by the scores according to percentage shape Formula output display, the predetermined threshold value are 40%.
2. data processing method as described in claim 1, which is characterized in that the other kinds of data include pdf files The step of data, office file datas, the text data in database or other kinds of data are handled Including:
Other kinds of data are converted into text data;
By in database text data and the text data filtered according to word segmentation, part-of-speech tagging and word the step of into Row word segmentation processing;And
Generate word segmentation result, using filtered word as final word segmentation result, will the final word segmentation result as handling after The text data or other kinds of data.
3. data processing method as claimed in claim 2, which is characterized in that described " based on lucene search engines to processing The rear text data or other kinds of data, which are established, to be indexed and generates index file " the step of include:
Index database is constructed, the position of index database is set, for being stored in index;
Index creation device is constructed, for creating index;And
For the text data or other kinds of data foundation index after participle, phase is created according to different file types The document description answered, and the content in respective attributes domain is set.
4. the data processing method as described in claim 1-3, which is characterized in that carry out processing generation to the Query Information and look into Ask object the step of include:
Word segmentation processing is carried out to the Query Information, the step of word segmentation processing includes:Word segmentation, part-of-speech tagging and word mistake Filter;
Synonym, near synonym conversion are carried out to the word that participle is concentrated, obtain synonym, the near synonym collection of participle collection;And
The participle is collected, the word that synonym, near synonym are concentrated is as query object.
5. data processing method as claimed in claim 4, which is characterized in that described search evaluation model is to described search result Marking is carried out to include the following steps:
The first score of this search is obtained according to the first scoring formula;
The second score of this search is obtained according to smallest edit distance method;And
Obtain the average value of first score and the second score, the final score that the average value is searched for as this.
6. data processing method as claimed in claim 5, which is characterized in that it is described first scoring formula be:
,
Wherein, the Score is first score, and q is the Query Information, and t is each after the Query Information segments , d is to remove matched document, and function tf (t in d) indicates the frequency that lexical item t occurs in the document, function idf (t)2Table Show that the frequency that lexical item t occurs in all documents, boost (t.field in d) are excitation factor, boost (t.field in D) value of * lengthNorm (t.field in d) indicates in this search result, and the sum of lexical item is included in given field, Coord (q, d) indicates that then the marking of word document is higher, and QueryNorm (q) is calculated when the search term for including in a document is more The variance of each query entries and.
7. data processing method as claimed in claim 6, which is characterized in that set the value of the function tf (t in d) to 1, remove influence of the word repeated to first score.
8. data processing method as claimed in claim 7, which is characterized in that described " to obtain this according to smallest edit distance method The step of second score of secondary search " includes:
Calculate the query object and the editing distance of described search result;
Obtain smallest edit distance;And
Using the value of the smallest edit distance as second score.
9. a kind of server, which is characterized in that the server includes memory, processor and is stored on the memory simultaneously The data processing system that can be run on the processor is realized when the data processing system is executed by the processor as weighed Profit requires the step of data processing method described in any one of 1-8.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has data processing system, the number It can be executed by least one processor according to processing system, so that at least one processor is executed as appointed in claim 1-8 The step of data processing method described in one.
CN201810198710.8A 2018-03-12 2018-03-12 Data processing method, server and computer storage media Pending CN108520002A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810198710.8A CN108520002A (en) 2018-03-12 2018-03-12 Data processing method, server and computer storage media
PCT/CN2018/089335 WO2019174132A1 (en) 2018-03-12 2018-05-31 Data processing method, server and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810198710.8A CN108520002A (en) 2018-03-12 2018-03-12 Data processing method, server and computer storage media

Publications (1)

Publication Number Publication Date
CN108520002A true CN108520002A (en) 2018-09-11

Family

ID=63433123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810198710.8A Pending CN108520002A (en) 2018-03-12 2018-03-12 Data processing method, server and computer storage media

Country Status (2)

Country Link
CN (1) CN108520002A (en)
WO (1) WO2019174132A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110310731A (en) * 2019-07-08 2019-10-08 苏州阿基米德网络科技有限公司 A kind of information matches inquiry system and its querying method
CN110377620A (en) * 2019-07-16 2019-10-25 四川康佳智能终端科技有限公司 A kind of material searching method, computer and storage medium based on BOM tool
CN111177532A (en) * 2019-12-02 2020-05-19 平安资产管理有限责任公司 Vertical search method, device, computer system and readable storage medium
CN111209462A (en) * 2020-01-02 2020-05-29 北京字节跳动网络技术有限公司 Data processing method, device and equipment
CN111859066A (en) * 2020-06-03 2020-10-30 广东电网有限责任公司 Query recommendation method and device for operation and maintenance work order
CN112507133A (en) * 2020-12-16 2021-03-16 国泰君安证券股份有限公司 Method, device, processor and storage medium for realizing association search based on financial product knowledge graph
WO2021077741A1 (en) * 2019-10-25 2021-04-29 浪潮(北京)电子信息产业有限公司 Gene data query method, system and device, and storage medium
CN113190649A (en) * 2021-04-16 2021-07-30 量子数聚(北京)科技有限公司 Enterprise name searching and matching method and device based on ElasticSearch

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909725B (en) * 2019-10-18 2023-09-19 平安科技(深圳)有限公司 Method, device, equipment and storage medium for recognizing text
CN111078731A (en) * 2019-11-25 2020-04-28 国网冀北电力有限公司 Hbase-based power grid operation data collaborative query method and device and storage medium
CN111143666A (en) * 2019-12-04 2020-05-12 深圳市智微智能软件开发有限公司 Steel mesh inventory query method and system
CN111159285B (en) * 2019-12-05 2023-04-21 北京机电工程研究所 Enterprise cross-system retrieval method based on distributed index service deployment
CN111062193B (en) * 2019-12-16 2023-04-25 医渡云(北京)技术有限公司 Medical data labeling method and device, storage medium and electronic equipment
CN111125417B (en) * 2019-12-30 2023-03-31 深圳云天励飞技术有限公司 Data searching method and device, electronic equipment and storage medium
CN111488736B (en) * 2020-03-31 2023-05-26 上海七印信息科技有限公司 Self-learning word segmentation method, device, computer equipment and storage medium
CN111814040B (en) * 2020-06-15 2024-06-21 深圳市明睿数据科技有限公司 Maintenance case searching method, device, terminal equipment and storage medium
CN111737607B (en) * 2020-06-22 2023-11-10 中国银行股份有限公司 Data processing method, device, electronic equipment and storage medium
CN111881309B (en) * 2020-07-30 2023-12-26 浪潮云信息技术股份公司 Electronic license retrieval method, device and computer readable medium
CN113065065B (en) * 2021-03-30 2024-06-14 广联达科技股份有限公司 Method, device and equipment for evaluating search performance and readable storage medium
CN113377896A (en) * 2021-05-19 2021-09-10 朗新科技集团股份有限公司 Full-text quick retrieval method and device, electronic equipment and storage medium
CN114996550B (en) * 2021-05-24 2024-03-19 中移互联网有限公司 Information retrieval method and device
CN114443728B (en) * 2022-01-04 2022-11-15 广州粤建三和软件股份有限公司 Detection report searching method and device based on Elasticissearch
CN114969310B (en) * 2022-06-07 2024-04-05 南京云问网络技术有限公司 Multi-dimensional data-oriented sectional search ordering system design method
CN115563356B (en) * 2022-09-30 2023-07-18 上海柯林布瑞信息技术有限公司 Method and device for dynamically collecting and inquiring system interaction information based on monitoring service

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080076585A (en) * 2007-02-16 2008-08-20 강민수 Network research server providing research function and method thereof, image forming apparatus providing research function, network security system providing research function and computer-readable recording medium
CN101930438A (en) * 2009-06-19 2010-12-29 阿里巴巴集团控股有限公司 Search result generating method and information search system
CN105045852A (en) * 2015-07-06 2015-11-11 华东师范大学 Full-text search engine system for teaching resources
CN106528846A (en) * 2016-11-21 2017-03-22 广州华多网络科技有限公司 Retrieval method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360367A (en) * 2011-09-29 2012-02-22 广州中浩控制技术有限公司 XBRL (Extensible Business Reporting Language) data search method and search engine
CN103838732A (en) * 2012-11-21 2014-06-04 大连灵动科技发展有限公司 Vertical search engine in life service field
CN103838785A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Vertical search engine in patent field
CN105930490A (en) * 2016-05-03 2016-09-07 北京优宇通教育科技有限公司 Intelligent selecting system for teaching resources

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080076585A (en) * 2007-02-16 2008-08-20 강민수 Network research server providing research function and method thereof, image forming apparatus providing research function, network security system providing research function and computer-readable recording medium
CN101930438A (en) * 2009-06-19 2010-12-29 阿里巴巴集团控股有限公司 Search result generating method and information search system
CN105045852A (en) * 2015-07-06 2015-11-11 华东师范大学 Full-text search engine system for teaching resources
CN106528846A (en) * 2016-11-21 2017-03-22 广州华多网络科技有限公司 Retrieval method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
苏勇等, 北京航空航天大学出版社 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110310731A (en) * 2019-07-08 2019-10-08 苏州阿基米德网络科技有限公司 A kind of information matches inquiry system and its querying method
CN110377620A (en) * 2019-07-16 2019-10-25 四川康佳智能终端科技有限公司 A kind of material searching method, computer and storage medium based on BOM tool
WO2021077741A1 (en) * 2019-10-25 2021-04-29 浪潮(北京)电子信息产业有限公司 Gene data query method, system and device, and storage medium
CN111177532A (en) * 2019-12-02 2020-05-19 平安资产管理有限责任公司 Vertical search method, device, computer system and readable storage medium
CN111209462A (en) * 2020-01-02 2020-05-29 北京字节跳动网络技术有限公司 Data processing method, device and equipment
CN111209462B (en) * 2020-01-02 2021-05-18 北京字节跳动网络技术有限公司 Data processing method, device and equipment
CN111859066A (en) * 2020-06-03 2020-10-30 广东电网有限责任公司 Query recommendation method and device for operation and maintenance work order
CN111859066B (en) * 2020-06-03 2023-01-20 广东电网有限责任公司 Query recommendation method and device for operation and maintenance work order
CN112507133A (en) * 2020-12-16 2021-03-16 国泰君安证券股份有限公司 Method, device, processor and storage medium for realizing association search based on financial product knowledge graph
CN112507133B (en) * 2020-12-16 2024-02-06 国泰君安证券股份有限公司 Method, device, processor and storage medium for realizing association search based on financial product knowledge graph
CN113190649A (en) * 2021-04-16 2021-07-30 量子数聚(北京)科技有限公司 Enterprise name searching and matching method and device based on ElasticSearch

Also Published As

Publication number Publication date
WO2019174132A1 (en) 2019-09-19

Similar Documents

Publication Publication Date Title
CN108520002A (en) Data processing method, server and computer storage media
CN108038096A (en) Knowledge database documents method for quickly retrieving, application server computer readable storage medium storing program for executing
Ding et al. Entity discovery and assignment for opinion mining applications
US9183288B2 (en) System and method of structuring data for search using latent semantic analysis techniques
US20170161375A1 (en) Clustering documents based on textual content
CN111104794A (en) Text similarity matching method based on subject words
JP5092165B2 (en) Data construction method and system
US20090119281A1 (en) Granular knowledge based search engine
CN111213140A (en) Method and system for semantic search in large database
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN102999625A (en) Method for realizing semantic extension on retrieval request
CN107844493B (en) File association method and system
CN112000773B (en) Search engine technology-based data association relation mining method and application
CN111767716A (en) Method and device for determining enterprise multilevel industry information and computer equipment
CN113190687B (en) Knowledge graph determining method and device, computer equipment and storage medium
CN113407785B (en) Data processing method and system based on distributed storage system
CN112115227A (en) Data query method and device, electronic equipment and storage medium
CN106844482B (en) Search engine-based retrieval information matching method and device
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN114722137A (en) Security policy configuration method and device based on sensitive data identification and electronic equipment
CN113065070A (en) Intelligent sorting method, system, equipment and computer storage medium for mobile internet information search and retrieval
CN112328805A (en) Entity mapping method of vulnerability description information and database table based on NLP
Wu et al. Searching online book documents and analyzing book citations
CN110008407B (en) Information retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180911