CN108520002A - Data processing method, server and computer storage media - Google Patents
Data processing method, server and computer storage media Download PDFInfo
- Publication number
- CN108520002A CN108520002A CN201810198710.8A CN201810198710A CN108520002A CN 108520002 A CN108520002 A CN 108520002A CN 201810198710 A CN201810198710 A CN 201810198710A CN 108520002 A CN108520002 A CN 108520002A
- Authority
- CN
- China
- Prior art keywords
- data
- index
- score
- word
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 36
- 238000000034 method Methods 0.000 claims abstract description 25
- 238000011156 evaluation Methods 0.000 claims abstract description 15
- 230000011218 segmentation Effects 0.000 claims description 32
- 230000015654 memory Effects 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 16
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000013210 evaluation model Methods 0.000 claims description 6
- 241001269238 Data Species 0.000 claims description 4
- 230000005284 excitation Effects 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 13
- 238000001914 filtration Methods 0.000 description 9
- 230000008859 change Effects 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 102000003712 Complement factor B Human genes 0.000 description 3
- 108090000056 Complement factor B Proteins 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of data processing method, this method includes:The text data or other kinds of data are established and indexes and generates index file, by index file storage to index database, the index file in the index database is scanned for according to query object, evaluation marking is carried out according to search evaluation method to search result, described search result is ranked up according to scores, the score is carried out output higher than the search result of predetermined threshold value according to predetermined manner to be shown.The present invention also provides a kind of server and computer readable storage mediums.Data processing method, server and computer readable storage medium provided by the invention are capable of the search of rapid pin pair and fuzzy literal, fast implement fuzzy matching.
Description
Technical field
The present invention relates to a kind of data analysis technique field more particularly to data processing method, server and computers to deposit
Storage media.
Background technology
In the epoch of current information explosion, each unit or individual are made that various tributes in the rapid growth for information
It offers.The type of information is also constantly extending, and more and more unstructured information continuously emerge, and include the various reports of enterprise
Table, bill, electronic document etc..These unstructured information are stored in database, many times, it would be desirable in the database
Retrieval, and for the search with fuzzy literal, the efficiency for directly inquiring database is very slow.Therefore, for fuzzy literal
Search, how to improve the efficiency of retrieval information is when next big urgent need to resolve the problem of.
Invention content
In view of this, the present invention proposes a kind of data processing method, server and computer storage media, with solve how
The problem of.
First, to achieve the above object, the present invention proposes a kind of data processing method, and the method comprising the steps of:
Obtain database in text data either other kinds of data in database text data or other
The data of type are handled;
Based on lucene search engines, to treated, the text data or other kinds of data establish index simultaneously
Index file is generated to store the index file to index database;
Query Information input by user is received, carrying out processing to the Query Information generates query object, is looked into according to described
It askes object to scan for the index file in the index database, preset search evaluation model carries out evaluation to search result and beats
Point;And
Described search result is ranked up according to the sequence of score from high to low according to scores, by score height
Output is carried out in the search result of predetermined threshold value according to predetermined manner to show;
Wherein, the predetermined manner is that the scores are generated bar chart, and by the scores according to percentage
It exports and shows than form, the predetermined threshold value is 40%.
Preferably, the other kinds of data include pdf file datas, office file datas, described to database
In the processing step that is handled of text data or other kinds of data include:
Other kinds of data are converted into text data;
By the text data and the step that is filtered according to word segmentation, part-of-speech tagging and word of the text data in database
It is rapid to carry out word segmentation processing;And
Word segmentation result is generated, using filtered word as final word segmentation result, using the final word segmentation result as place
The text data after reason or other kinds of data.
Preferably, described " based on lucene search engines to treated the text data or other kinds of number
According to establish index and generate index file " the step of include:
Index database is constructed, the position of index database is set, for being stored in index;
Index creation device is constructed, for creating index;And
For the text data or other kinds of data foundation index after participle, created according to different file types
Corresponding document description is built, and the content in respective attributes domain is set.
Preferably, carrying out the step of processing generates query object to the Query Information includes:
Word segmentation processing is carried out to the Query Information, the step of word segmentation processing includes:Word segmentation, part-of-speech tagging and word
Filtering;
Synonym, near synonym conversion are carried out to the word that participle is concentrated, obtain synonym, the near synonym collection of participle collection;And
The participle is collected, the word that synonym, near synonym are concentrated is as query object.
Preferably, described search evaluation model to described search result give a mark and include the following steps:
The first score of this search is obtained according to the first scoring formula;
The second score of this search is obtained according to smallest edit distance method;And
Obtain the average value of first score and the second score, the final score that the average value is searched for as this.
Preferably, the first scoring formula is:
, wherein the Score is first score, and q is the Query Information, and t is after the Query Information segments
Each single item, d are to remove matched document, and function tf (t in d) indicates the frequency that lexical item t occurs in the document, function idf (t
)2Indicate that the frequency that lexical item t occurs in all documents, boost (t.field in d) are excitation factor, boost (t.field
In d) value of * lengthNorm (t.field in d) indicates in this search result, gives total comprising lexical item in field
Number, coord (q, d) indicate that then the marking of word document is higher when the search term for including in a document is more, QueryNorm (q) meters
Calculate each query entries variance and.
Preferably, the value of the function tf (t in d) is set as 1, removes the word repeated to first score
Influence.
Preferably, described " according to smallest edit distance method obtain this search the second score " the step of include:
Calculate the query object and the editing distance of described search result;
Obtain smallest edit distance;And
Using the value of the smallest edit distance as second score.
In addition, to achieve the above object, the present invention also provides a kind of server, including memory, processor and it is stored in
On the memory and the data processing system that can run on the processor, the data processing system is by the processor
It is realized such as the step of above-mentioned data processing method when execution.
Further, to achieve the above object, the present invention also provides a kind of computer readable storage medium, the computers
Readable storage medium storing program for executing is stored with data processing system, and the data processing system can be executed by least one processor, so that institute
At least one processor is stated to execute such as the step of above-mentioned data processing method.
Compared to the prior art, data processing method proposed by the invention, server and computer readable storage medium,
First obtain database in text data either other kinds of data to the text data or other types in database
Data handled, the text data or other kinds of data are established based on lucene search engines and are indexed and raw
At index file, during establishing the index, weight is written to index, by index file storage to index database;
Secondly, Query Information input by user is received, carrying out processing to the Query Information generates query object, according to the inquiry pair
As being scanned for the index file in the index database, evaluation marking is carried out according to search evaluation method to search result;Most
Afterwards, described search result is ranked up according to predetermined manner according to scores, by the score searching higher than predetermined threshold value
Hitch fruit carries out output according to predetermined manner and shows.Using data processing method proposed by the invention, server and computer
Readable storage medium storing program for executing can fast implement fuzzy matching, compared to the prior art, more with the search of rapid pin pair and fuzzy literal
It is convenient, fast, accurate, greatly improve effectiveness of retrieval.
Description of the drawings
Fig. 1 is the schematic diagram of one optional hardware structure of server of the present invention;
Fig. 2 is the program module schematic diagram of data processing system first embodiment of the present invention;
Fig. 3 is the flow diagram of data processing method first embodiment of the present invention;
Fig. 4 is the flow diagram of data processing method second embodiment of the present invention;
Fig. 5 is the flow diagram of data processing method 3rd embodiment of the present invention.
Fig. 6 is the flow diagram of data processing method fourth embodiment of the present invention.
Fig. 7 is the flow diagram of the 5th embodiment of data processing method of the present invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not
For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work
The every other embodiment obtained is put, shall fall within the protection scope of the present invention.
It should be noted that the description for being related to " first ", " second " etc. in the present invention is used for description purposes only, and cannot
It is interpreted as indicating or implying its relative importance or implicitly indicates the quantity of indicated technical characteristic.Define as a result, " the
One ", the feature of " second " can explicitly or implicitly include at least one of the features.In addition, the skill between each embodiment
Art scheme can be combined with each other, but must can be implemented as basis with those of ordinary skill in the art, when technical solution
Will be understood that the combination of this technical solution is not present in conjunction with there is conflicting or cannot achieve when, also not the present invention claims
Protection domain within.
As shown in fig.1, being the schematic diagram of 1 one optional hardware structure of server of the present invention.
In the present embodiment, the server 1 may include, but be not limited only to, and can be in communication with each other connection by system bus and deposit
Reservoir 11, processor 12, network interface 13.It should be pointed out that Fig. 2 illustrates only the server 1 with component 11-13, but
Be it should be understood that, it is not required that implement all components shown, the implementation that can be substituted is more or less component.
Wherein, the server 1 can be rack-mount server, blade server, tower server or cabinet-type clothes
The computing devices such as business device, which can be independent server, can also be the server set that multiple servers are formed
Group.
The memory 11 include at least a type of readable storage medium storing program for executing, the readable storage medium storing program for executing include flash memory,
Hard disk, multimedia card, card-type memory (for example, SD or DX memories etc.), random access storage device (RAM), static random are visited
It asks memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), may be programmed read-only deposit
Reservoir (PROM), magnetic storage, disk, CD etc..In some embodiments, the memory 11 can be the server
1 internal storage unit, for example, the server 1 hard disk or memory.In further embodiments, the memory 11 can also
It is the External memory equipment of the server 1, such as the plug-in type hard disk being equipped on the server 1, intelligent memory card (Smart
Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Certainly, described
Memory 11 can also both include the server 1 internal storage unit and also including its External memory equipment.In the present embodiment,
The memory 11 is installed on the operating system and types of applications software of the server 1 commonly used in storage, such as at data
The program code etc. of reason system 2.It has exported or will export in addition, the memory 11 can be also used for temporarily storing
Various types of data.
The processor 12 can be in some embodiments central processing unit (Central Processing Unit,
CPU), controller, microcontroller, microprocessor or other data processing chips.The processor 12 is commonly used in the control clothes
The overall operation of business device 1.In the present embodiment, the processor 12 for run the program code stored in the memory 11 or
Person handles data, such as runs the data processing system 2 etc..
The network interface 13 may include radio network interface or wired network interface, which is commonly used in
Communication connection is established between the server 1 and other electronic equipments.
So far, oneself is through describing the hardware configuration and function of relevant device of the present invention in detail.In the following, above-mentioned introduction will be based on
It is proposed each embodiment of the present invention.
First, the present invention proposes a kind of data processing system 2.
As shown in fig.2, being the Program modual graph of 2 first embodiment of data processing system of the present invention.
In the present embodiment, the data processing system 2 includes a series of computer program being stored on memory 11
Instruction, when the computer program instructions are executed by processor 12, may be implemented the data processing operation of various embodiments of the present invention.
In some embodiments, the specific operation realized based on the computer program instructions each section, data processing system 2 can be with
It is divided into one or more modules.For example, in fig. 2, the data processing system 2 can be divided into index and establish module
21, scoring modules 22 are searched for, sort output module 23.Wherein:
The index establishes module 21, for obtaining text data or other kinds of data in database, logarithm
According in library text data or other kinds of data handled, based on lucene search engines to the text data or
The other kinds of data of person, which are established, to be indexed and generates index file, and during establishing the index, weight is written to index,
By index file storage to index database.
Specifically, Lucene is a set of library of increasing income for full-text search and search, by Apache Software Foundation
It supports and provides.It provides a simple powerful application interface, can do full-text index and search.Lucene is
One high-performance, telescopic information search library.
Specifically, for the database, the realization method of each specialized company is different, and main type of database is
Oracle can also have the various databases of the types such as PostgreSQL, MySQL.
Specifically, weight is written when indexing and is indexed, is read out in inquiry, with the mode multiplied come to some
Retrieval result bonus point.
Specifically, in database text data or other kinds of processing mode include a variety of, for example, can be right
The Doctype of non-text data carries out turning type so that the document of non-text data can more successfully be established index.
Specifically, index is established to include construction index database, construction index creation device and establish using the index creation device
The step of index.
Specifically, index database directory is constructed, for being stored in index, the position of index database, namely index deposit are set
Position.
Specifically, construction index creation device IndexWriter.The file index that index creation device is created is stored in index
The position in library, if do not indexed in index database, the mode of index creation is newly-built mode;It is otherwise provided as additional mode.
Specifically, index is established to the text data or other kinds of data for acquisition, according to different texts
Part type creates corresponding document and describes Document, and the content of respective attributes domain Filed is arranged, such as filename, file road
Diameter, file content.
Described search scoring modules 22 are handled the Query Information for receiving Query Information input by user
Query object is generated, the index file in the index database is scanned for according to the query object, preset search evaluation
Model carries out evaluation marking to search result.
Specifically, the Query Information input by user can be with character string, number, a word, even one section words, to this
A little formats and skimble-scamble Query Information can perform some processing so that treated, and Query Information meets described search marking
The call format of module 22, for example, the Query Information can be segmented, be filtered, synonym conversion, near synonym conversion etc.
Deng processing, so that the Query Information is converted to the satisfactory query object.
Specifically, the emphasis that scoring is the present invention is carried out to the content searched, described search evaluation model uses two kinds
Mode carries out evaluation marking to described search result, they are marking formula and minimum volume based on Lucecne engines respectively
Distance is collected, the two carries out marking evaluation to search content respectively, then by the different weight factor of determination to the score of the two
It is handled to obtain final score.
Wherein, the marking formula based on Lucecne engines is:
Wherein, q is query statement, and t is each single item after q participles, and d is to remove matched document.
Specifically, each function is act as in the marking formula based on Lucecne engines:
The frequency that tf (t in q), this function representation lexical item t occur in the field in the document;Correspond to upper figure
In example:It is both the frequency that the lexical item after segmenting occurs in this record.Certainly the number occurred is more, and the value that it is returned is got over
Greatly, the importance of this document is also just reflected.For the accuracy of guarantee search result, the value of tf is set as 1, the reason is that:
Such as search " Chinese safety ", it is assumed that matched result has 1. safety groups, 2. Chinese safeties, 3. Chinese safety Nanjing safeties point
Company, if according to original score foundation, highest result matching degree is third, because " safety " occurs two
It is secondary.But according to our normal logics, that highest certainly exactly matched of matching degree, that is, " Chinese safety ".
So herein in order to avoid such first phenomenon, the value of tf is changed to 1, the same word it is multiple occur by do not influence score according to
According to.Because what we wanted fuzzy matching is customer information, often a very short word, word frequency repeat to should not be used as score height
Foundation.It is some higher for the matching degree of phrase in this way.
Idf (t), this function occur twice, also just correspond to idf (t) ^2 in formula, this function is referred to as scramble
Rate indicates the frequency that lexical item t occurs in all documents.If the number that it occurs in all documents is more, show this word
Item t is more inessential.
Boost (t.field in d) is excitation factor, is just recorded when creating index, and lengthNorm
The value of (t.field in d) can calculate in query process;boost(t.field in d)*lengthNorm(t.field
In d) value indicate in this search result, the sum of lexical item is included in given field;If value it is bigger, score is lower, citing and
Speech, if A documents contain 1000 lexical items, the frequency that keyword occurs is 10;And 20 lexical items of B documents packet, identical key
The frequency that word occurs is 8;The marking of apparent B documents should want higher.
Coord (q, d), primary search may include multiple search terms, and in a piece of document may also include multiple search
Word, this indicate that when the search term for including in a document is more, then the marking of word document is higher.
QueryNorm (q), this calculate each query entries variance and, this value does not influence to sort, and only so that
Score between different query objects can compare.
Specifically, editing distance (Edit Distance), also known as Levenshtein distances refer between two word strings,
The minimum edit operation number needed for another is changed by one.The edit operation of license include a character is substituted for it is another
A character is inserted into a character, deletes a character.In general, editing distance is smaller, and the similarity of two strings is bigger.It is logical
The editing distance for calculating search result and query object is crossed, the second score of search result can be obtained.And the minimum volume obtained
It collects most like with query object apart from search result is just represented.
The sequence output module 23, for being arranged described search result according to predetermined manner according to scores
The score is carried out output according to predetermined manner higher than the search result of predetermined threshold value and shown by sequence.
Specifically, the predetermined manner can be the mode of percentage, final score is carried out in the form of percentage from
Small output is arrived greatly, is conducive to understanding of the user to matching degree height, can also be generated bar chart, it is more intuitive in this way.
Specifically, the purpose of predetermined threshold value setting is filter out most worthy in search result one group, distance and
Speech, can be set as 40% by predetermined threshold value.
In addition, the present invention also proposes a kind of data processing method.
As shown in fig.3, being the flow diagram of data processing method first embodiment of the present invention.In the present embodiment,
The execution sequence of the step in flow chart shown in Fig. 3 can change according to different requirements, and certain steps can be omitted.
Step S110 obtains the text data in database or other kinds of data, to the textual data in database
It is handled based on lucene search engines to the text data or other kinds of number according to either other kinds of data
According to establishing index and generating index file, during establishing the index, weight is written to index, by the index file
Store index database
Specifically, Lucene is a set of library of increasing income for full-text search and search, by Apache Software Foundation
It supports and provides.It provides a simple powerful application interface, can do full-text index and search.Lucene is
One high-performance, telescopic information search library.
Specifically, for the database, the realization method of each specialized company is different, and main type of database is
Oracle can also have the various databases of the types such as PostgreSQL, MySQL.
Step S120 receives Query Information input by user, and carrying out processing to the Query Information generates query object, root
The index file in the index database is scanned for according to the query object, search result is carried out according to search evaluation method
Evaluation marking.
Specifically, the Query Information input by user can be with character string, number, a word, even one section words, to this
A little formats and skimble-scamble Query Information can perform some processing so that meet described search scoring modules 22 and scans for,
For example, the Query Information can be segmented, be filtered, synonym conversion, near synonym conversion etc. processing so that institute
It states Query Information and is converted to the satisfactory query object.
Described search result is ranked up by step S130 according to scores according to predetermined manner, by score height
Output is carried out in the search result of predetermined threshold value according to predetermined manner to show.
Specifically, the predetermined manner can be the mode of percentage, final score is carried out in the form of percentage from
Small output is arrived greatly, is conducive to understanding of the user to matching degree height, can also be generated bar chart, it is more intuitive in this way.
Specifically, the purpose of predetermined threshold value setting is filter out most worthy in search result one group, distance and
Speech, can be set as 40% by predetermined threshold value.
As shown in figure 4, being the flow diagram of the second embodiment of data processing method of the present invention.In the present embodiment, this
To the text data and the method that is handled of other kinds of data in database in invention data processing method steps S110
Include the following steps:
Other kinds of data are converted to the text data by step S210.
Specifically, other kinds of data are converted into text data, such as some data are in the form of pdf, office texts
Shelves form etc. is stored in server, text is extracted out from office documents, pdf documents by some tools, for example, the work
Tool can be apache POI and apache PDFbox etc..
Step S220, the step of text data is filtered according to word segmentation, part-of-speech tagging and word, segment
Processing.
Specifically, by the text data obtained in the first step (including the text data in database and transformed text
Data) word segmentation processing is carried out, the step of word segmentation processing includes:Word segmentation, part-of-speech tagging and word filtering, wherein word is cut
Cutting point mainly carried out to sentence using context relation, the case where avoiding the occurrence of false segmentation because a word difference
Slit mode often has different meanings, for example, shoes and clothes, it should which cutting is shoes/and/clothes, when cutting is " shoes
When son/kimonos/dress ", hence it is evident that the meaning is tried to go south by driving the chariot north.It, can be right by rule-based and statistics method after word segmentation
Word after cutting carries out part-of-speech tagging, and described rule-based and statistics method can be hidden Markov model, for example, part of speech
Shoes and clothes can be classified as noun by marking, and incite somebody to action " and " it is classified as conjunction.It is exactly word filtering, word after part-of-speech tagging
The effect of filtering is to remove unessential word, can simplify index database in this way, effectiveness of retrieval is improved, for example, by noun
" shoes ", " clothes " retain, by conjunction " and " filtering.
Step S230 generates word segmentation result, using filtered word as final word segmentation result.
Specifically, filtered word includes the participle of the text data and other categorical datas in database, participle
Synonym, near synonym etc., these words will be used as handling result for next step so that the significantly more efficient retrieval of system.
The execution sequence of step in flow chart shown in Fig. 4 can change, and certain steps can be omitted.
As shown in figure 5, being the flow diagram of the 3rd embodiment of data processing method of the present invention.In the present embodiment, this
Based on lucene search engines to the text data or other kinds of data in invention data processing method steps S110
The method for establishing index includes the following steps:
Step S310 constructs index database, the position of index database is arranged, for being stored in index.
Specifically, index database directory is constructed, for being stored in index, the position of index database, namely index deposit are set
Position.
Step S320 constructs index creation device, for creating index.
Specifically, construction index creation device IndexWriter.The file index that index creation device is created is stored in index
The position in library, if do not indexed in index database, the mode of index creation is newly-built mode;It is otherwise provided as additional mode.
Step S330 establishes index, according to different texts for the text data of acquisition or other kinds of data
Part type creates corresponding document description, and the content in respective attributes domain is arranged.
Specifically, index is established to the text data or other kinds of data for acquisition, according to different texts
Part type creates corresponding document and describes Document, and the content of respective attributes domain Filed is arranged, such as filename, file road
Diameter, file content.
The execution sequence of step in flow chart shown in fig. 5 can change, and certain steps can be omitted.
As shown in fig. 6, being the flow diagram of the fourth embodiment of data processing method of the present invention.In the present embodiment, this
It includes following step to carry out the method that processing generates query object to the Query Information in invention data processing method steps S210
Suddenly:
Step S410 carries out word segmentation processing to the Query Information.
Specifically, word segmentation processing is carried out to the Query Information, the step of word segmentation processing includes:Word segmentation, part of speech mark
Note and word filtering, wherein word segmentation mainly carries out cutting using context relation to sentence, avoids the occurrence of false segmentation
Situation because in short different slit modes often have different meanings, for example, shoes and clothes, it should which cutting is shoes
Son/and/clothes, when cutting is " shoes/kimonos/dress ", hence it is evident that the meaning is tried to go south by driving the chariot north.After word segmentation, by being based on
The method of rule and statistics can carry out the word after cutting part-of-speech tagging, and described rule-based and statistics method can be hidden
Markov model, for example, shoes and clothes can be classified as noun by part-of-speech tagging, and incite somebody to action " and " it is classified as conjunction.Part of speech
It is exactly word filtering after mark, the effect of word filtering is to remove unessential word, can simplify index database in this way, is improved
Effectiveness of retrieval, for example, by noun " shoes ", " clothes " retain, by conjunction " and " filtering.The filtered word of word, which is formed, to be divided
Word set.
Step S420, carries out synonym, near synonym are converted to the word that participle is concentrated, and obtains the synonym, close of participle collection
Adopted word set.
Specifically, synonym is carried out to the word that participle is concentrated, near synonym are converted, the synonym of acquisition participle collection, nearly justice
Word set, the word that participle collection, synonym, near synonym are concentrated is as query object.The word that participle is concentrated conceptually is expanded
Corresponding synonym, near synonym or upper hyponym are transformed into, according to similarity priority algorithm extraction section expansion word or receives use
The expansion word of word and restriction that participle is concentrated, is finally transmitted to retrieval module by the expansion word of family selection together as querying condition
As query object.For example, if user input " this year China economic form how" system obtains " China ",
" economy " two query words, then the expansion word of retrieval message processing module available " China ", such as " continent ", " interiorly ",
" country " etc.;Expansion word " GDP ", " trade ", " business ", " finance and economics ", " finance " etc. are can get according to " economy ".
Step S430 collects the participle, and the word that synonym, near synonym are concentrated is as query object
Specifically, it is converted after Query Information input by user being segmented to obtain synonym, the near synonym of participle, profit
With word segmentation result and its synonym, near synonym carry out inquiry to the content in index database can be more comprehensive, accurate and rapid, also more
Meet the definition of fuzzy search.
The execution sequence of step in flow chart shown in fig. 6 can change, and certain steps can be omitted.
As shown in fig. 7, being the flow diagram of the fourth embodiment of data processing method of the present invention.In the present embodiment, this
In invention data processing method steps S210 to search result according to search evaluation method carry out evaluation marking method include with
Lower step:
Step S510 obtains the first score of this search according to the first scoring formula.
Specifically, the preset search evaluation method is using including based on the first scoring formula and smallest edit distance method
Search score model score described search result, wherein it is described first scoring formula be:
Wherein, q is query statement, and t is each single item after q participles, and d is to remove matched document.
Specifically, each function is act as in the first scoring formula:
The frequency that tf (t in q), this function representation lexical item t occur in the field in the document;Correspond to upper figure
In example:It is both the frequency that the lexical item after segmenting occurs in this record.Certainly the number occurred is more, and the value that it is returned is got over
Greatly, the importance of this document is also just reflected.For the accuracy of guarantee search result, the value of tf is set as 1, the reason is that:
Such as search " Chinese safety ", it is assumed that matched result has 1. safety groups, 2. Chinese safeties, 3. Chinese safety Nanjing safeties point
Company, if according to original score foundation, highest result matching degree is third, because " safety " occurs two
It is secondary.But according to our normal logics, that highest certainly exactly matched of matching degree, that is, " Chinese safety ".
So herein in order to avoid such first phenomenon, the value of tf is changed to 1, the same word it is multiple occur by do not influence score according to
According to.Because what we wanted fuzzy matching is customer information, often a very short word, word frequency repeat to should not be used as score height
Foundation.It is some higher for the matching degree of phrase in this way.
Idf (t), this function occur twice, also just correspond to idf (t) ^2 in formula, this function is referred to as scramble
Rate indicates the frequency that lexical item t occurs in all documents.If the number that it occurs in all documents is more, show this word
Item t is more inessential.
Boost (t.field in d) is excitation factor, is just recorded when creating index, and lengthNorm
The value of (t.field in d) can calculate in query process;boost(t.field in d)*lengthNorm(t.field
In d) value indicate in this search result, the sum of lexical item is included in given field;If value it is bigger, score is lower, citing and
Speech, if A documents contain 1000 lexical items, the frequency that keyword occurs is 10;And 20 lexical items of B documents packet, identical key
The frequency that word occurs is 8;The marking of apparent B documents should want higher.
Coord (q, d), primary search may include multiple search terms, and in a piece of document may also include multiple search
Word, this indicate that when the search term for including in a document is more, then the marking of word document is higher.
QueryNorm (q), this calculate each query entries variance and, this value does not influence to sort, and only so that
Score between different query objects can compare.
Step S520 obtains the second score of this search according to smallest edit distance method.
Specifically, wherein editing distance (Edit Distance), also known as Levenshtein distances refer to two word strings
Between, the minimum edit operation number needed for another is changed by one.The edit operation of license includes replacing a character
At another character, it is inserted into a character, deletes a character.In general, editing distance is smaller, and the similarity of two strings is got over
Greatly.By calculating the editing distance of search result and query object, the second score of search result can be obtained.And it obtains most
It is most like with query object that small editing distance just represents search result.
Specifically, the step of step " second score that this search is obtained according to smallest edit distance method " includes:
Calculate the query object and the editing distance of described search result;
Obtain smallest edit distance;And
Using the value of the smallest edit distance as second score.
Obtain the average value of first score and the second score, the final score that the average value is searched for as this.
Specifically, different weight factors can be arranged to first score and the second score, by respective weight because
It is sub be multiplied respectively with first score and the second score after carry out add operation obtain search result evaluation marking as a result, example
As its formula can be:Scoring=the first scores of weight factor A*+the second scores of weight factor B*, the weight factor A and weight
The value of factor B is set as desired, and for example, mean value both if desired can be by weight factor A and weight factor B
It is set as 0.5.
The execution sequence of step in flow chart shown in Fig. 7 can change, and certain steps can be omitted.
The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Based on this understanding, technical scheme of the present invention substantially in other words does the prior art
Going out the part of contribution can be expressed in the form of software products, which is stored in a storage medium
In (such as ROM/RAM, magnetic disc, CD), including some instructions are used so that a station terminal equipment (can be mobile phone, computer, clothes
Be engaged in device, air conditioner or the network equipment etc.) execute method described in each embodiment of the present invention.
It these are only the preferred embodiment of the present invention, be not intended to limit the scope of the invention, it is every to utilize this hair
Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills
Art field, is included within the scope of the present invention.
Claims (10)
1. a kind of data processing method is applied to server, which is characterized in that the method includes the steps:
Obtain database in text data either other kinds of data to the text data or other types in database
Data handled;
Based on lucene search engines, to treated, the text data or other kinds of data are established index and are generated
Index file, by index file storage to index database;
Query Information input by user is received, carrying out processing to the Query Information generates query object, according to the inquiry pair
As being scanned for the index file in the index database, preset search evaluation model carries out evaluation marking to search result;
And
Described search result is ranked up according to the sequence of score from high to low according to scores, by the score higher than pre-
It is shown if the search result of threshold value carries out output according to predetermined manner;
Wherein, the predetermined manner is that the scores are generated bar chart, and by the scores according to percentage shape
Formula output display, the predetermined threshold value are 40%.
2. data processing method as described in claim 1, which is characterized in that the other kinds of data include pdf files
The step of data, office file datas, the text data in database or other kinds of data are handled
Including:
Other kinds of data are converted into text data;
By in database text data and the text data filtered according to word segmentation, part-of-speech tagging and word the step of into
Row word segmentation processing;And
Generate word segmentation result, using filtered word as final word segmentation result, will the final word segmentation result as handling after
The text data or other kinds of data.
3. data processing method as claimed in claim 2, which is characterized in that described " based on lucene search engines to processing
The rear text data or other kinds of data, which are established, to be indexed and generates index file " the step of include:
Index database is constructed, the position of index database is set, for being stored in index;
Index creation device is constructed, for creating index;And
For the text data or other kinds of data foundation index after participle, phase is created according to different file types
The document description answered, and the content in respective attributes domain is set.
4. the data processing method as described in claim 1-3, which is characterized in that carry out processing generation to the Query Information and look into
Ask object the step of include:
Word segmentation processing is carried out to the Query Information, the step of word segmentation processing includes:Word segmentation, part-of-speech tagging and word mistake
Filter;
Synonym, near synonym conversion are carried out to the word that participle is concentrated, obtain synonym, the near synonym collection of participle collection;And
The participle is collected, the word that synonym, near synonym are concentrated is as query object.
5. data processing method as claimed in claim 4, which is characterized in that described search evaluation model is to described search result
Marking is carried out to include the following steps:
The first score of this search is obtained according to the first scoring formula;
The second score of this search is obtained according to smallest edit distance method;And
Obtain the average value of first score and the second score, the final score that the average value is searched for as this.
6. data processing method as claimed in claim 5, which is characterized in that it is described first scoring formula be:
,
Wherein, the Score is first score, and q is the Query Information, and t is each after the Query Information segments
, d is to remove matched document, and function tf (t in d) indicates the frequency that lexical item t occurs in the document, function idf (t)2Table
Show that the frequency that lexical item t occurs in all documents, boost (t.field in d) are excitation factor, boost (t.field in
D) value of * lengthNorm (t.field in d) indicates in this search result, and the sum of lexical item is included in given field,
Coord (q, d) indicates that then the marking of word document is higher, and QueryNorm (q) is calculated when the search term for including in a document is more
The variance of each query entries and.
7. data processing method as claimed in claim 6, which is characterized in that set the value of the function tf (t in d) to
1, remove influence of the word repeated to first score.
8. data processing method as claimed in claim 7, which is characterized in that described " to obtain this according to smallest edit distance method
The step of second score of secondary search " includes:
Calculate the query object and the editing distance of described search result;
Obtain smallest edit distance;And
Using the value of the smallest edit distance as second score.
9. a kind of server, which is characterized in that the server includes memory, processor and is stored on the memory simultaneously
The data processing system that can be run on the processor is realized when the data processing system is executed by the processor as weighed
Profit requires the step of data processing method described in any one of 1-8.
10. a kind of computer readable storage medium, the computer-readable recording medium storage has data processing system, the number
It can be executed by least one processor according to processing system, so that at least one processor is executed as appointed in claim 1-8
The step of data processing method described in one.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810198710.8A CN108520002A (en) | 2018-03-12 | 2018-03-12 | Data processing method, server and computer storage media |
PCT/CN2018/089335 WO2019174132A1 (en) | 2018-03-12 | 2018-05-31 | Data processing method, server and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810198710.8A CN108520002A (en) | 2018-03-12 | 2018-03-12 | Data processing method, server and computer storage media |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108520002A true CN108520002A (en) | 2018-09-11 |
Family
ID=63433123
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810198710.8A Pending CN108520002A (en) | 2018-03-12 | 2018-03-12 | Data processing method, server and computer storage media |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108520002A (en) |
WO (1) | WO2019174132A1 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110310731A (en) * | 2019-07-08 | 2019-10-08 | 苏州阿基米德网络科技有限公司 | A kind of information matches inquiry system and its querying method |
CN110377620A (en) * | 2019-07-16 | 2019-10-25 | 四川康佳智能终端科技有限公司 | A kind of material searching method, computer and storage medium based on BOM tool |
CN111177532A (en) * | 2019-12-02 | 2020-05-19 | 平安资产管理有限责任公司 | Vertical search method, device, computer system and readable storage medium |
CN111209462A (en) * | 2020-01-02 | 2020-05-29 | 北京字节跳动网络技术有限公司 | Data processing method, device and equipment |
CN111859066A (en) * | 2020-06-03 | 2020-10-30 | 广东电网有限责任公司 | Query recommendation method and device for operation and maintenance work order |
CN112507133A (en) * | 2020-12-16 | 2021-03-16 | 国泰君安证券股份有限公司 | Method, device, processor and storage medium for realizing association search based on financial product knowledge graph |
WO2021077741A1 (en) * | 2019-10-25 | 2021-04-29 | 浪潮(北京)电子信息产业有限公司 | Gene data query method, system and device, and storage medium |
CN113190649A (en) * | 2021-04-16 | 2021-07-30 | 量子数聚(北京)科技有限公司 | Enterprise name searching and matching method and device based on ElasticSearch |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909725B (en) * | 2019-10-18 | 2023-09-19 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for recognizing text |
CN111078731A (en) * | 2019-11-25 | 2020-04-28 | 国网冀北电力有限公司 | Hbase-based power grid operation data collaborative query method and device and storage medium |
CN111143666A (en) * | 2019-12-04 | 2020-05-12 | 深圳市智微智能软件开发有限公司 | Steel mesh inventory query method and system |
CN111159285B (en) * | 2019-12-05 | 2023-04-21 | 北京机电工程研究所 | Enterprise cross-system retrieval method based on distributed index service deployment |
CN111062193B (en) * | 2019-12-16 | 2023-04-25 | 医渡云(北京)技术有限公司 | Medical data labeling method and device, storage medium and electronic equipment |
CN111125417B (en) * | 2019-12-30 | 2023-03-31 | 深圳云天励飞技术有限公司 | Data searching method and device, electronic equipment and storage medium |
CN111488736B (en) * | 2020-03-31 | 2023-05-26 | 上海七印信息科技有限公司 | Self-learning word segmentation method, device, computer equipment and storage medium |
CN111814040B (en) * | 2020-06-15 | 2024-06-21 | 深圳市明睿数据科技有限公司 | Maintenance case searching method, device, terminal equipment and storage medium |
CN111737607B (en) * | 2020-06-22 | 2023-11-10 | 中国银行股份有限公司 | Data processing method, device, electronic equipment and storage medium |
CN111881309B (en) * | 2020-07-30 | 2023-12-26 | 浪潮云信息技术股份公司 | Electronic license retrieval method, device and computer readable medium |
CN113065065B (en) * | 2021-03-30 | 2024-06-14 | 广联达科技股份有限公司 | Method, device and equipment for evaluating search performance and readable storage medium |
CN113377896A (en) * | 2021-05-19 | 2021-09-10 | 朗新科技集团股份有限公司 | Full-text quick retrieval method and device, electronic equipment and storage medium |
CN114996550B (en) * | 2021-05-24 | 2024-03-19 | 中移互联网有限公司 | Information retrieval method and device |
CN114443728B (en) * | 2022-01-04 | 2022-11-15 | 广州粤建三和软件股份有限公司 | Detection report searching method and device based on Elasticissearch |
CN114969310B (en) * | 2022-06-07 | 2024-04-05 | 南京云问网络技术有限公司 | Multi-dimensional data-oriented sectional search ordering system design method |
CN115563356B (en) * | 2022-09-30 | 2023-07-18 | 上海柯林布瑞信息技术有限公司 | Method and device for dynamically collecting and inquiring system interaction information based on monitoring service |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20080076585A (en) * | 2007-02-16 | 2008-08-20 | 강민수 | Network research server providing research function and method thereof, image forming apparatus providing research function, network security system providing research function and computer-readable recording medium |
CN101930438A (en) * | 2009-06-19 | 2010-12-29 | 阿里巴巴集团控股有限公司 | Search result generating method and information search system |
CN105045852A (en) * | 2015-07-06 | 2015-11-11 | 华东师范大学 | Full-text search engine system for teaching resources |
CN106528846A (en) * | 2016-11-21 | 2017-03-22 | 广州华多网络科技有限公司 | Retrieval method and device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102360367A (en) * | 2011-09-29 | 2012-02-22 | 广州中浩控制技术有限公司 | XBRL (Extensible Business Reporting Language) data search method and search engine |
CN103838732A (en) * | 2012-11-21 | 2014-06-04 | 大连灵动科技发展有限公司 | Vertical search engine in life service field |
CN103838785A (en) * | 2012-11-27 | 2014-06-04 | 大连灵动科技发展有限公司 | Vertical search engine in patent field |
CN105930490A (en) * | 2016-05-03 | 2016-09-07 | 北京优宇通教育科技有限公司 | Intelligent selecting system for teaching resources |
-
2018
- 2018-03-12 CN CN201810198710.8A patent/CN108520002A/en active Pending
- 2018-05-31 WO PCT/CN2018/089335 patent/WO2019174132A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20080076585A (en) * | 2007-02-16 | 2008-08-20 | 강민수 | Network research server providing research function and method thereof, image forming apparatus providing research function, network security system providing research function and computer-readable recording medium |
CN101930438A (en) * | 2009-06-19 | 2010-12-29 | 阿里巴巴集团控股有限公司 | Search result generating method and information search system |
CN105045852A (en) * | 2015-07-06 | 2015-11-11 | 华东师范大学 | Full-text search engine system for teaching resources |
CN106528846A (en) * | 2016-11-21 | 2017-03-22 | 广州华多网络科技有限公司 | Retrieval method and device |
Non-Patent Citations (1)
Title |
---|
苏勇等, 北京航空航天大学出版社 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110310731A (en) * | 2019-07-08 | 2019-10-08 | 苏州阿基米德网络科技有限公司 | A kind of information matches inquiry system and its querying method |
CN110377620A (en) * | 2019-07-16 | 2019-10-25 | 四川康佳智能终端科技有限公司 | A kind of material searching method, computer and storage medium based on BOM tool |
WO2021077741A1 (en) * | 2019-10-25 | 2021-04-29 | 浪潮(北京)电子信息产业有限公司 | Gene data query method, system and device, and storage medium |
CN111177532A (en) * | 2019-12-02 | 2020-05-19 | 平安资产管理有限责任公司 | Vertical search method, device, computer system and readable storage medium |
CN111209462A (en) * | 2020-01-02 | 2020-05-29 | 北京字节跳动网络技术有限公司 | Data processing method, device and equipment |
CN111209462B (en) * | 2020-01-02 | 2021-05-18 | 北京字节跳动网络技术有限公司 | Data processing method, device and equipment |
CN111859066A (en) * | 2020-06-03 | 2020-10-30 | 广东电网有限责任公司 | Query recommendation method and device for operation and maintenance work order |
CN111859066B (en) * | 2020-06-03 | 2023-01-20 | 广东电网有限责任公司 | Query recommendation method and device for operation and maintenance work order |
CN112507133A (en) * | 2020-12-16 | 2021-03-16 | 国泰君安证券股份有限公司 | Method, device, processor and storage medium for realizing association search based on financial product knowledge graph |
CN112507133B (en) * | 2020-12-16 | 2024-02-06 | 国泰君安证券股份有限公司 | Method, device, processor and storage medium for realizing association search based on financial product knowledge graph |
CN113190649A (en) * | 2021-04-16 | 2021-07-30 | 量子数聚(北京)科技有限公司 | Enterprise name searching and matching method and device based on ElasticSearch |
Also Published As
Publication number | Publication date |
---|---|
WO2019174132A1 (en) | 2019-09-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108520002A (en) | Data processing method, server and computer storage media | |
CN108038096A (en) | Knowledge database documents method for quickly retrieving, application server computer readable storage medium storing program for executing | |
Ding et al. | Entity discovery and assignment for opinion mining applications | |
US9183288B2 (en) | System and method of structuring data for search using latent semantic analysis techniques | |
US20170161375A1 (en) | Clustering documents based on textual content | |
CN111104794A (en) | Text similarity matching method based on subject words | |
JP5092165B2 (en) | Data construction method and system | |
US20090119281A1 (en) | Granular knowledge based search engine | |
CN111213140A (en) | Method and system for semantic search in large database | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN103678576A (en) | Full-text retrieval system based on dynamic semantic analysis | |
CN102999625A (en) | Method for realizing semantic extension on retrieval request | |
CN107844493B (en) | File association method and system | |
CN112000773B (en) | Search engine technology-based data association relation mining method and application | |
CN111767716A (en) | Method and device for determining enterprise multilevel industry information and computer equipment | |
CN113190687B (en) | Knowledge graph determining method and device, computer equipment and storage medium | |
CN113407785B (en) | Data processing method and system based on distributed storage system | |
CN112115227A (en) | Data query method and device, electronic equipment and storage medium | |
CN106844482B (en) | Search engine-based retrieval information matching method and device | |
CN115563313A (en) | Knowledge graph-based document book semantic retrieval system | |
CN114722137A (en) | Security policy configuration method and device based on sensitive data identification and electronic equipment | |
CN113065070A (en) | Intelligent sorting method, system, equipment and computer storage medium for mobile internet information search and retrieval | |
CN112328805A (en) | Entity mapping method of vulnerability description information and database table based on NLP | |
Wu et al. | Searching online book documents and analyzing book citations | |
CN110008407B (en) | Information retrieval method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180911 |