CN106528846A - Retrieval method and device - Google Patents

Retrieval method and device Download PDF

Info

Publication number
CN106528846A
CN106528846A CN201611048869.9A CN201611048869A CN106528846A CN 106528846 A CN106528846 A CN 106528846A CN 201611048869 A CN201611048869 A CN 201611048869A CN 106528846 A CN106528846 A CN 106528846A
Authority
CN
China
Prior art keywords
retrieval
similarity
term
retrieval object
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611048869.9A
Other languages
Chinese (zh)
Other versions
CN106528846B (en
Inventor
夏集球
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huaduo Network Technology Co Ltd
Original Assignee
Guangzhou Huaduo Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huaduo Network Technology Co Ltd filed Critical Guangzhou Huaduo Network Technology Co Ltd
Priority to CN201611048869.9A priority Critical patent/CN106528846B/en
Publication of CN106528846A publication Critical patent/CN106528846A/en
Application granted granted Critical
Publication of CN106528846B publication Critical patent/CN106528846B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a retrieval method and device. The method is applied to a distributed retrieval platform established on the basis of a plurality of retrieval servers. The method comprises the following steps: receiving retrieval terms sent by a user through a retrieval client; traversing the retrieval terms, and performing adjacent term segmentation on the retrieval terms to generate multiple double-term groups; based on the multiple double-term groups, searching retrieval objects corresponding to the multiple double-term groups in a preset retrieval database, and generating a retrieval object list; and calculating the similarity between the retrieval terms and the retrieval objects in the retrieval object list, and returning multiple retrieval objects with similarity higher than a preset threshold to the retrieval client to serve as retrieval results, and displaying the retrieval objects to the user. By adoption of the retrieval method provided by the invention, the text retrieval accuracy in ancient Chinese can be effectively improved.

Description

A kind of search method and device
Technical field
The application is related to computer communication field, more particularly to a kind of search method and device.
Background technology
The retrieval of archaic Chinese Chinese text is the more commonly used a kind of retrieval mode of user now, and user is certain by being input into Term, it is possible to retrieve the archaic Chinese Chinese text for including the term.
However, in actual applications, when the term of user input is incorrect, in such as term, there is wrong word etc., It is difficult to accurately search the archaic Chinese Chinese text of user's needs.Thus, how to improve the standard of archaic Chinese Chinese text retrieval Really property just becomes problem demanding prompt solution.
The content of the invention
In view of this, the application provides a kind of search method and device, to the standard for improving the retrieval of archaic Chinese Chinese text True property.
Specifically, the application is achieved by the following technical solution:
According to the first aspect of the embodiment of the present application, there is provided a kind of search method, methods described is applied to based on some The distributed search platform that retrieval server is built, methods described include:
Receive user is by retrieving the term that client sends;
The term is traveled through, adjacent character segmentation is performed to the term, several double word phrases are generated;
Based on described several double word phrases, search in default searching database and several two-character word groups difference Corresponding retrieval object, generates retrieval list object;Wherein, the searching database is stored in advance for the retrieval data Several double word phrases that several retrieval objects included in storehouse are generated after carrying out adjacent character segmentation respectively, with the retrieval number According to the corresponding relation in storehouse between the retrieval object comprising each double word phrase;
The similarity of the term and the retrieval object retrieved in list is calculated, is higher than predetermined threshold value by similarity Several retrieval objects be back to retrieval client as retrieval result, to show to the user.
According to the second aspect of the embodiment of the present application, there is provided a kind of retrieval device, described device are applied to based on some The distributed search platform that retrieval server is built, described device include:
Receiving unit, for receive user by retrieving the term that client sends;
Cutting unit, for traveling through the term, performs adjacent character segmentation to the term, generates several double words Phrase;
Searching unit, for based on described several double word phrases, searching in default searching database some with this Individual two-character word group distinguishes corresponding retrieval object, generates retrieval list object;Wherein, the searching database stores pin in advance Several two-character words that several retrieval objects to including in the searching database are generated after carrying out adjacent character segmentation respectively Corresponding relation in group, with the searching database between the retrieval object comprising each double word phrase;
Computing unit, for calculating the similarity of the term and the retrieval object retrieved in list, will be similar Degree is back to retrieval client as retrieval result higher than several retrieval objects of predetermined threshold value, to show to the user.
The embodiment of the present application provides a kind of search method, and methods described is applied to what is built based on some retrieval servers Distributed search platform, distributed search platform can be with receive users by retrieving the term that client sends.And can be all over The term is gone through, adjacent character segmentation is performed to the term, several double word phrases are generated.Searching platform can distinguish base In described several double word phrases, retrieval corresponding with several double word phrases is searched in default searching database right As generating retrieval list object.And the term and the similarity for retrieving object retrieved in list can be calculated, will Similarity highest several retrieval objects be back to retrieval client as retrieval result, to show to the user.
As the search method that the application is provided no longer carries out participle using the participle mode of Modern Chinese, but employing will Term carries out adjacent character segmentation, generates the participle mode of double word phrase, such that it is able to user input in archaic Chinese The term of text carries out effectively cutting so that the result of participle is more reasonable, and searching platform is examined using the participle Rope, retrieval result are higher with the degree of association of term.Therefore, the search method for being provided using the application can effectively archaic Chinese The accuracy of Chinese text retrieval.
Description of the drawings
Fig. 1 is a kind of network architecture diagram of the search method shown in one exemplary embodiment of the application;
Fig. 2 is a kind of schematic diagram of the interactive interface of the client shown in one exemplary embodiment of the application;
Fig. 3 is a kind of flow chart of the search method shown in one exemplary embodiment of the application;
Fig. 4 is a kind of hardware structure diagram of the retrieval device place equipment shown in one exemplary embodiment of the application;
Fig. 5 is a kind of block diagram of the retrieval device shown in one exemplary embodiment of the application.
Specific embodiment
Here in detail exemplary embodiment will be illustrated, its example is illustrated in the accompanying drawings.Explained below is related to During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the application.Conversely, they be only with as appended by The example of consistent apparatus and method in terms of some described in detail in claims, the application.
It is the purpose only merely for description specific embodiment in term used in this application, and is not intended to be limiting the application. " one kind ", " described " and " being somebody's turn to do " of singulative used in the application and appended claims is also intended to include majority Form, unless context clearly shows that other implications.It is also understood that term "and/or" used herein is referred to and is wrapped Containing one or more associated any or all possible combinations for listing project.
It will be appreciated that though various information, but this may be described using term first, second, third, etc. in the application A little information should not necessarily be limited by these terms.These terms are only for same type of information is distinguished from each other out.For example, without departing from In the case of the application scope, the first information can also be referred to as the second information, and similarly, the second information can also be referred to as One information.Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determining ".
The retrieval of archaic Chinese Chinese text is the more commonly used a kind of retrieval mode of user now, and user is certain by being input into Term, it is possible to retrieve the archaic Chinese Chinese text for including the term of user's needs.
In related archaic Chinese Chinese text search mechanism, it is generally the case that typically using data base's fuzzy matching Retrieval scheme and the full-text search scheme based on Lucene (full-text search engine) and Modern Chinese participle are to archaic Chinese Chinese text Originally enter line retrieval.
But the either retrieval scheme of data base's fuzzy matching, it is also based on the full text of Lucene and Modern Chinese participle Retrieval scheme, when line retrieval is entered to archaic Chinese Chinese text, it will usually there is the defect of described below.
When the retrieval scheme using data base's fuzzy matching enters line retrieval to archaic Chinese Chinese text, on the one hand, when with When the term of family input includes wrong word, using the retrieval result and user input of the scheme return of data base's fuzzy matching The relatedness of term is less, in some instances it may even be possible to be not the retrieval result required for user.
For example, when the term of user input is " bright moon light before window ", it is desirable to find《Quiet night thinks》When, as user will " before bed " input is for " before window ", and data base's fuzzy matching is to enter line retrieval based on the mode of the accurate matching of keyword, so nothing Method, leads to not find to the verse of " bright moon light before window " in database lookup《Quiet night thinks》.
On the other hand, using the scheme of data base's fuzzy matching, retrieval rate is slower, when high concurrent accesses data base, Often result in that data base's pressure is excessive, in addition cause tens seconds also the situation without returning result occur.
When line retrieval is entered to archaic Chinese Chinese text using the retrieval scheme based on Lucene and Modern Chinese participle, one Aspect, using the segmenting method of Modern Chinese phrase, can not reach in the retrieval of archaic Chinese Chinese text and split knot well Really, so that it is difficult to returning and term degree of association highest retrieval result.
For example, participle is carried out to " bright moon light before bed " using the segmenting method of Modern Chinese phrase, word segmentation result is generally " bright moon ", " moonlight " etc., in retrieval, searching platform then can enter line retrieval matching with " bright moon ", " moonlight " to retrieve major key, But can match in a large number comprising " bright moon ", " moonlight " retrieval result, often lead to retrieval result relatively low with the degree of association of term.
Additionally, when working as the verse such as user input " wind determines cloud mass colour in a moment " or " five is neat both old, and octave is in county ", due to adopting Then it is difficult above-mentioned poem is split out Modern Chinese phrase to lead to not into line retrieval, institute with the participle mode of Modern Chinese To adopt the retrieval effectiveness based on Lucene and the retrieval scheme of Modern Chinese participle not good.
On the other hand, Lucene is the full-text search engine tool kit of an open source code, and developer can use Lucene realizes the function of full-text search in goal systems, or sets up the full-text search for completing based on this and draw Hold up.However, Lucene does not support the exploitation of distributed searching platform, it is impossible to realize the extending transversely of searching platform, therefore adopt With the performance that retrieval is greatly reduced based on the retrieval scheme of Lucene and Modern Chinese participle.
In order to solve the above problems, the embodiment of the present application provides a kind of search method, and methods described is applied to based on some The distributed search platform that platform retrieval server is built, distributed search platform can be sent by retrieving client with receive user Term.And the term can be traveled through, adjacent character segmentation is performed to the term, several double word phrases are generated. Searching platform can be based respectively on described several double word phrases, search and several double words in default searching database The corresponding retrieval object of phrase, generates retrieval list object.And the term can be calculated with the inspection retrieved in list The similarity of rope object, using similarity highest several retrieval objects be back to retrieval client as retrieval result, with to The user shows.
As the search method that the application is provided no longer carries out participle using the participle mode of Modern Chinese, but employing will Term carries out adjacent character segmentation, generates the participle mode of double word phrase, such that it is able to user input in archaic Chinese The term of text carries out effectively cutting so that the result of participle is more reasonable, and searching platform is examined using the participle Rope, retrieval result are higher with the degree of association of term.Therefore, the search method for being provided using the application can effectively archaic Chinese The accuracy of Chinese text retrieval.
Referring to Fig. 1, Fig. 1 is a kind of network architecture diagram of the search method shown in one exemplary embodiment of the application.This Shen The network architecture that please be illustrate, including client and the distributed search platform built based on some retrieval servers.
Above-mentioned client can include it is user oriented enter line retrieval client software.As shown in Fig. 2 Fig. 2 is this Shen A kind of schematic diagram of the interactive interface of client that please be shown in an exemplary embodiment.Under normal circumstances, the client can be User provides an interactive interface, and user can need the term into line retrieval by the interactive interface, typing.Will in client The term is sent to after above-mentioned searching platform enters line retrieval, and user can read return on the interactive interface of the client Retrieval result.For example, above-mentioned client can be《Ancient poetry dictionary》APP etc..
Wherein, the hardware environment of above-mentioned client is carried, for example, it may be PC, mobile terminal etc..In the present embodiment In, especially do not limited.
Above-mentioned searching platform, refers to search function, based on the distributed search that some retrieval servers are built Platform, the distributed searching platform can realize that the corresponding retrieval object of the term to user input carries out distributed lookup With carry out distribution of similarity calculating etc..
Referring to Fig. 3, Fig. 3 is a kind of flow chart of the search method shown in one exemplary embodiment of the application;Methods described The above-mentioned distributed search platform built based on some retrieval servers is applied to, methods described specifically includes step as described below Suddenly:
Step 301:Receive user is by retrieving the term that client sends;
Step 302:The term is traveled through, adjacent character segmentation is performed to the term, several double word phrases are generated;
Step 303:Be based respectively on described several double word phrases, search in default searching database with this several The corresponding retrieval object of double word phrase, generates retrieval list object;Wherein, the searching database is stored in advance for described Several double word phrases that several retrieval objects included in searching database are generated after carrying out adjacent character segmentation respectively, with institute State the corresponding relation retrieved between object comprising each double word phrase in searching database;
Step 304:The similarity of the term and the retrieval object retrieved in list is calculated, by similarity is higher than Several retrieval objects of predetermined threshold value are back to retrieval client as retrieval result, to show to the user.
Wherein, above-mentioned term, refers to the search key that user is input into by client.Searching platform can be by being somebody's turn to do Term is retrieved and the retrieval object comprising the term.Above-mentioned term can be a verse of user input, such as " bed A word or word in front bright moon light ", or verse, can also be that title, author of archaic Chinese Chinese text etc. believe Breath.Here, simply the content of term is exemplarily illustrated, which is not especially limited.
Above-mentioned retrieval object, refers to the set of the full detail of the item retrieved by term.For example, when with Family is input into " bright moon light before bed ", and the retrieval is to liking ancient poetry《Quiet night thinks》One of full detail set, ancient poetry can be included Exercise question《Quiet night thinks》, author li po, the information such as the content of ancient poetry.
Above-mentioned adjacent character segmentation, is a kind of participle mode.Certain order is mainly based upon, can be from left to right suitable Adjacent two word in the term of user input is carried out cutting, generates several double word phrases by sequence successively.
So that term is " bright moon light before bed " as an example, it is assumed that with order from left to right, from the leftmost word of term Start for first character, first character and second word are combined into a double word phrase and carry out cutting, two-character word is such as generated For " before bed ", then it is combined into a double word phrase and carries out cutting with second word and the 3rd word, such as generating two-character word is " front bright ", the like, until being combined into double word phrase without the next word adjacent with " light ".It is by adjacent character segmentation, raw Into double word phrase be " before bed ", " front bright ", " bright moon " " moonlight ".
Here, exemplary explanation is carried out to the order of adjacent character segmentation simply, which is not especially limited.
Above-mentioned archaic Chinese Chinese text, refers to the Chinese text of the Form creation with archaic Chinese.Ancient poetry can be included Word, ancient Chinese prose, song, tax and song etc..
The embodiment of the present application mainly proposes a kind of search method, on the one hand, as searching platform is cut by using adjacent words The participle mode divided carries out participle to term, such that it is able to improve the accuracy of archaic Chinese Chinese text retrieval;On the other hand, Due to entering line retrieval using distributed search platform, such that it is able to improve the speed of archaic Chinese Chinese text retrieval.
Before line retrieval is entered to archaic Chinese Chinese text, need to match somebody with somebody the retrieval object in searching database accordingly Put, be described in detail to configuring the retrieval object in searching database below.
The so-called retrieval object in searching database is configured, and is referred in advance by retrieval object through certain in fact Process, the process being stored in searching database.It is intended that when user input is retrieved with regard to certain in the term of object, The retrieval object corresponding with the term being stored in the searching database can be back to client by searching platform.
In the embodiment of the present application, searching platform can be based on the regular expression of non-Chinese character, by the data base In the retrieval object of predetermined number split, be divided into some substrings.Method based on the adjacent character segmentation.Inspection Suo Pingtai can carry out participle respectively to some substrings, generate some double word phrases.Respectively with the substring The some double word phrases for generating are set up each double word phrase respectively and retrieval corresponding with the double word phrase are right as retrieval major key The mapping relations of elephant.
Realizing, searching platform can receive retrieval object that is developer's input or importing from other data bases, And the retrieval object can be processed accordingly, it is then stored in searching database.
The word segmentation processing carried out to a retrieval object is described below in detail.
Searching platform can be split based on the regular expression of non-Chinese character, the content to retrieving object, split Into some substrings.Some substrings can be chosen to be and treat participle character string by searching platform successively, it is possible to will treat Participle character string carries out adjacent character segmentation, generates some two-character phrase groups.Searching platform can set up some two-character phrase groups respectively With the mapping relations of the retrieval object comprising some two-character phrase groups.
For example, retrieving the entitled of object《Quiet night thinks》As a example by this first ancient poetry, and with reference to the retrieval number of ancient poetry word and search According to the configuration scene in storehouse, the method for above-mentioned word segmentation processing is described in detail.
Searching platform can be based on the regular expression pair of non-Chinese character《Quiet night thinks》Retrieval object split, point Cut generation several character strings, for example can for " quiet night think ", " li po ", " bright moon light before bed ", " being suspected to be frost on the ground ", " raise the head Hope bright moon " and " think of native place of bowing ".
Aforementioned four character string can be chosen to be and treat participle character string by searching platform successively.It is assumed that searching platform is selected " bright moon light before bed ", to treat participle character string, searching platform can carry out adjacent character segmentation to " bright moon light before bed ", sequentially generate " before bed ", " front bright ", the two-character phrase group of " bright moon " " moonlight ".In the same manner, searching platform can also be by above-mentioned remaining three characters String is chosen to be treats participle character string, carries out adjacent character segmentation to above-mentioned remaining three character strings successively, generates some two-character phrases Group.
Searching platform can set up the mapping relations of the above-mentioned two-character phrase group being cut into and retrieval object respectively, such as can be with Set up " before bed " and entitled《Quiet night thinks》Retrieval object mapping relations, it is " front bright " and entitled《Quiet night thinks》Retrieval it is right Mapping relations of elephant etc..
Searching platform can also carry out above-mentioned word segmentation processing to remaining retrieval object, respectively obtain each retrieval object Corresponding two-character phrase group.
After above-mentioned word segmentation processing is completed to all of retrieval object, searching platform can be closed based on above-mentioned mapping System, respectively using the two-character phrase group as retrieval major key, with the retrieval object comprising the two-character phrase group as key assignments, is stored in above-mentioned number According in storehouse.In retrieval, searching platform can be based on major key, find the key value information related to retrieval major key.
It should be noted that each two-character phrase group being cut into is to there is several retrieval objects, each retrieval object Several two-character phrase groups can be included, each retrieval object includes retrieving the relevant informations such as content, exercise question and the author of object.Its In, retrieve object relevant information can be stored in the form of character string, for example, when retrieval object it is entitled《Quiet night Think》When, the preservation of its content generally can be with " bright moon light before bed ", " being suspected to be frost on the ground ", " prestige bright moon of raising the head " and " think of event of bowing The form of this four character strings of township " is preserved.Preservation form merely just to retrieving object has carried out exemplary explanation, Which is not especially limited.
As shown in table 1, table 1 is the example of retrieval object storage form to storage form of the retrieval object in searching database Property explanation.
For example, by taking classic poetry retrieval as an example, using " bright moon " as major key, then its key assignments can include entitled《Quiet night thinks》 Retrieval object, it is entitled《Prelude To Water Melody》Retrieval object and entitled《Solely drink below the moon》Retrieval object etc..In retrieval, Searching platform can be according to major key " bright moon ", and in data base, retrieval is to all retrieval objects comprising " bright moon ".
Table 1
Additionally, in the embodiment of the present application, when the retrieval object is archaic Chinese Chinese text, described based on the phase Some substrings are carried out participle by the method for adjacent character segmentation respectively, and before generating some double word phrases, searching platform is deleted Except by it is described retrieval Object Segmentation into described some character strings in non-Chinese character and interjection phrase.
When realizing, searching platform enters to some substrings respectively in the method based on the adjacent character segmentation Row participle, before generating some double word phrases, can first by the non-of some substrings generated by some retrieval Object Segmentations Chinese character and interjection phrase are deleted.
Wherein, the non-Chinese character, is not as the character of Chinese, can be including punctuation mark etc..The interjection word Group, refers to the phrase constituted with interjection." ", " ", " " for example in ancient Chinese prose or in classic poetry etc..Here, simply Exemplary explanation is carried out to non-Chinese character and interjection phrase, is not especially limited.
It is more than the explanation of the configuration to the retrieval object in searching database, below search method is carried out in detail It is bright.
In the embodiment of the present application, receive user is by retrieving the term that client sends;The term is traveled through, it is right The term performs adjacent character segmentation, generates several double word phrases;Described several double word phrases are based respectively on, default Searching database in search corresponding with several double word phrases retrieval object, generate and retrieve list object;Wherein, it is described Searching database stores several retrieval objects for including in the searching database in advance and carries out adjacent words respectively and cut Several double word phrases generated after point, it is corresponding between the retrieval object comprising each double word phrase with the searching database Relation;The similarity of the term and the retrieval object retrieved in list is calculated, by similarity higher than predetermined threshold value Several retrieval objects are back to retrieval client as retrieval result, to show to the user.
When realizing, the specified location input term of the interactive interface that user can be provided in client.Client exists After receiving the term of user input, the term can be sent to searching platform.
Searching platform can travel through from client the above-mentioned term for receiving, and based on certain order, to term Above-mentioned adjacent character segmentation is performed, term is cut into into several double word phrases.
It should be noted that in related search method, generally using the method for Modern Chinese participle in archaic Chinese Text enters line retrieval, as phrase and the Modern Chinese phrase of archaic Chinese literature have larger difference, so it is difficult with The method of Modern Chinese participle carries out cutting to the term based on archaic Chinese Chinese text, so as to be difficult according to the word being syncopated as Group finds the higher retrieval object of the degree of association.For example, by taking " bright moon light before bed " as an example, using the method for Modern Chinese participle, Generate phrase and be generally " bright moon ", " moonlight ", examining iff " bright moon ", " moonlight " the two Modern Chinese phrases is based only on Corresponding retrieval object is matched in rope data base, the quantity of the retrieval object for matching is more, and the degree of association is not high, because This, greatly reduces the accuracy of archaic Chinese Chinese text retrieval.
Segmenting method of the embodiment of the present application using adjacent character segmentation, by certain order, the retrieval to user input Word carries out adjacent words participle, generates a two-character phrase group.The participle of generation no longer only includes the word for meeting Modern Chinese grammer Group, but also archaic Chinese phrase can be included.Due to adopting effectively segmenting method to carry out participle, it is possible to effectively improve The accuracy of archaic Chinese Chinese text retrieval.
After the adjacent character segmentation for completing term, searching platform can select in above-mentioned several double word phrases Double word phrase, searches several retrieval objects corresponding with this double word phrase in the searching database through above-mentioned configuration (i.e. the retrieval object comprising the double word phrase).Then again based on similar method, search with above-mentioned some double word phrases The corresponding retrieval object of remaining double word phrase.
After the lookup for completing above-mentioned retrieval object, searching platform can by above-mentioned term be cut into it is all double Words group correspondence retrieval object is integrated, and generates retrieval list object.
For example, as shown in table 2, an exemplary illustration of the table 2 for retrieval list object, when the term of user input is When " bright moon light before bed ", searching platform can be carried out adjacent character segmentation to " bright moon light before bed ", given birth to successively based on certain order Into the two-character phrase group of " before bed ", " front bright ", " bright moon " and " moonlight ".Then in searching database, respectively search " before bed ", " front bright ", " bright moon " and " moonlight " corresponding retrieval object, generates retrieval list object.
Table 2
Searching platform can calculate the term of the retrieval object in the retrieval list object of generation and user input respectively Similarity, it is possible to the Similarity value for obtaining is ranked up, by similarity higher than predetermined threshold value retrieval object, as inspection Hitch fruit, returns client, is shown to user.
The similarity of a retrieval object in term and above-mentioned retrieval list object below to above-mentioned user input Calculating is described in detail.Remaining in retrieval list object retrieves the similarity calculating method phase of object and the term Together.
For term and the Similarity Measure of the retrieval object retrieved in list, can be using editing distance algorithm to phase Calculated like degree, it would however also be possible to employ the algorithm of other developer's sets itselfs, similarity is calculated.
In the present embodiment, searching platform is based on editing distance algorithm, calculates the term and retrieves in list with described Retrieval object similarity.The editing distance algorithm includes:Regular expression of the searching platform based on non-Chinese character, it is right The term is split, and is divided into some substrings;Some substrings that the term is divided into are true successively It is set to target substring;The target substring is calculated respectively is divided generation with the retrieval object in the retrieval list The editing distance of all substrings, and obtain the smallest edit distance for calculating;By the term segmentation life for getting Into the corresponding smallest edit distance of all substrings added up, then to add up after smallest edit distance carry out averagely, Obtain the similarity of the term and the retrieval object;Wherein, the similarity is characterized with editing distance, editing distance Less, similarity is higher.
Wherein, the editing distance referred in two character strings, changes into another character string institute by a character string The number of times of the minimum edit operation for needing.The edit operation of license includes for a character being substituted for another character, inserts one Character, deletes a character.Under normal circumstances, editing distance is less, and the similarity of two character strings is bigger.
It should be noted that in related search method, it is general using the data base's mould precisely matched based on term The method of paste inquiry.Using this querying method, it is desirable to which the term of user input must be accurate, otherwise it is difficult to return Retrieval result needed for user.
For example, it is assumed that when the term of user input is " bright moon light before window ", using the number based on term accurately mate According to storehouse fuzzy query method, it is difficult to inquire including " bright moon light before bed "《Quiet night thinks》.
And the embodiment of the present application provide search method calculated using editing distance algorithm term with retrieval object it Between similarity, on the one hand, characterized using editing distance term and retrieval object between similarity, similarity is carried out Quantify, can just clearly show the height of similarity by the size of editing distance, therefore, drastically increase returning result Accuracy.On the other hand, even if there is a small amount of wrong word in the term of user input, it is also possible to by this based on editor away from From the similarity calculating method of algorithm, the higher retrieval object of similarity is returned to user.
Below to based on editing distance algorithm, calculating between a retrieval object and the term in retrieval list object The process of editing distance is described in detail.
Searching platform can be carried out to the term of above-mentioned user input point based on the regular expression of non-Chinese character Cut, be divided into some substrings.Some substrings that above-mentioned term is divided into can be defined as by searching platform successively Target substring.
By taking a target substring as an example, searching platform can be by the target substring and above-mentioned retrieval list The substring that retrieval object is divided into carries out the calculating of editing distance successively, and obtains the smallest edit distance for calculating.
The method that searching platform can calculate smallest edit distance based on said one target substring, to remaining mesh Mark substring carries out editing distance calculating.And each corresponding smallest edit distance of target substring is obtained respectively.
The corresponding minimum of all target substrings that the term segmentation for getting can be generated by searching platform Editing distance is added up, then the smallest edit distance sum after cumulative is carried out averagely, obtaining the term right with the retrieval The similarity of elephant.
The term is as follows with the similarity formula based on editing distance algorithm of the retrieval object:
S=[MIN (Ld11,Ld12...,Ld1N)+MIN(Ld21,Ld22...,Ld2N)+...+MIN(LdM1,LdM2..., LdMN)]/M
Wherein, S represents the similarity of a retrieval object in retrieval word retrieval list, and N represents that retrieval object is divided Cut the number of the substring of generation, the number of the substring that M is divided into for term.Ldrp represents that term is divided R-th substring and retrieval object for generating is divided the editing distance between p-th substring of generation.
In the embodiment of the present application, when the retrieval object includes retrieving content, the title of retrieval object and the inspection of object During the author of rope object;It is described based on editing distance algorithm, calculate the retrieval object in the term and the retrieval list Similarity, including:Based on the editing distance algorithm, the term is calculated with the retrieval object retrieved in list The similarity of content, obtains the first similarity;Based on the editing distance algorithm, the term and the retrieval list are calculated In retrieval object title similarity, obtain the second similarity;Based on the editing distance algorithm, the term is calculated With the similarity of the author of the retrieval object in the retrieval list, third phase is obtained like degree.Based on first similarity, Two similarities and the third phase calculate the similarity of the term and the retrieval object like spending.
When realizing, searching platform can be based on above-mentioned editing distance algorithm, calculate above-mentioned term respectively with above-mentioned inspection The similarity of the content, title and author of rope object, obtains the first similarity respectively, and the second similarity and third phase are like degree.And The term can be calculated similar to the retrieval object based on the first similarity, the second similarity and the third phase like spending Degree.
Below so that the term of user input is " the silvery moonlight, cascading to the ground in front of the bed, is just like white frost " as an example, the meter to the first similarity It is described in detail.
When the term of user input is " the silvery moonlight, cascading to the ground in front of the bed, is just like white frost ", searching platform can be based on non-Chinese The term is split by the regular expression of character, generates " bright moon light before bed " and " being suspected to be frost on the ground " two substrings.
Assume retrieval object be retrieve list object in it is entitled《Quiet night thinks》Information aggregate, searching platform can be right 《Quiet night thinks》Content carry out the segmentation of the regular expression based on non-Chinese character, be divided into " bright moon light before bed ", " be suspected to be ground Upper frost ", " prestige bright moon of raising the head " and " think of native place of bowing " this four substrings.
Searching platform first can be calculated in " bright moon light before bed " substring and retrieval object that term is divided into respectively " bright moon light before bed ", " being suspected to be frost on the ground ", " prestige bright moon of raising the head " and " think of native place of bowing " this four substrings that appearance is divided into Editing distance, respectively obtain Ld11、Ld12、Ld13And Ld14, and obtain the editing distance of minimum in this four values.
Then, searching platform can calculate " being suspected to be frost on the ground " substring and the retrieval object that term is divided into respectively Content segmentation into " bright moon light before bed ", " be suspected to be on the ground frost ", " prestige bright moon of raising the head " and " bow and think native place " this four sub- characters The editing distance of string, respectively obtains Ld21、Ld22、Ld23And Ld24, and obtain the editing distance of minimum in this four values.
Value of the term with the retrieval object for the first similarity of retrieval contents of object is:
Sc=[MIN (Ld11,Ld12,Ld13,Ld14)+MIN(Ld21,Ld22,Ld23,Ld24)]/2
Searching platform can calculate the term and the term pair respectively based on the computational methods of above-mentioned first similarity The retrieval object answered respectively obtains the second similarity and third phase like degree for title and the similarity of author.
The term and the retrieval object are calculated based on the first similarity, the second similarity and the third phase like spending Similarity when, searching platform can take for the first similarity, the second similarity and third phase like spend configuration weight method, To calculate similarity, it would however also be possible to employ the method for the calculating similarity of developer's self-defining is calculated, such as adopting will First similarity, the second similarity and third phase calculate similarity like the method that is averaging of sum is spent.
In the embodiment of the present application, searching platform can be first similarity, second similarity and described the Three similarities are respectively configured corresponding weight.And can taking advantage of first similarity and the corresponding weight of the first similarity Third phase described in the sum of products of long-pending, described second similarity weight corresponding with second similarity is like spending with the third phase like spending The product of corresponding weight is added up, and obtains the similarity of the term and the retrieval object.
When realizing, searching platform can be based on default strategy, respectively the first similarity, the second similarity and the 3rd Similarity configures corresponding weight, then based on above-mentioned first, second, and third similarity and its corresponding weight of difference, meter Calculate similarity.
Calculating formula of similarity is as follows:
SP=Sc*Wc+St*Wt+Sa*Wa
Wherein, ScIt is the algorithm based on above-mentioned editing distance, calculated above-mentioned term and above-mentioned retrieval object First similarity of content, StFor the second similarity of above-mentioned term and the title of above-mentioned retrieval object, SaFor above-mentioned term Seemingly spend with the third phase of the author of above-mentioned retrieval object.WcFor the corresponding weight of the first similarity, WtIt is corresponding for the second similarity Weight, WaCorresponding weight is spent seemingly for third phase.
It should be noted that searching platform can calculate term right with retrieval according to the method for above-mentioned Similarity Measure As the similarity of each the retrieval object term corresponding with the retrieval object in list, then by the Similarity value for obtaining It is ranked up, by similarity higher than the retrieval object of predetermined threshold value, as retrieval result, returns client, be shown to user.
In the present embodiment, searching platform can also to based on default weight adjustable strategies, to first similarity, Second similarity and the third phase seemingly spend corresponding weight, are adjusted.
When realizing, searching platform can be corresponding to above three similarity based on the weight adjustable strategies of forecast model Weight is adjusted, it is also possible to which the method based on term length or two-character surname analysis is to the corresponding weight of above three similarity It is adjusted.
For the weight adjustable strategies based on forecast model, in the embodiment of the present application, searching platform is based on default pre- Model is surveyed, the probability that the term is respectively the content, title and author of the retrieval object is calculated.And can be based on calculating The term for going out is respectively the probability of the content, title and author of the retrieval object, adjusts and the retrieval object Content, title and author distinguish corresponding first similarity, the second similarity and third phase like the weight spent.
When realizing, searching platform can be by above-mentioned term incoming default forecast model, default forecast model Default statistical analysis algorithms can be based on, predict that above-mentioned term respectively retrieves content, title and the author's of object Probability.
Wherein, above-mentioned forecast model can include based on default statistical analysis algorithms, and the existing Gu of user search The retrieval data of the content, title and author of poem, the big data model for carrying out statistical analysiss and being created that.
Default statistic algorithm can be regression algorithm, and neutral net etc. is simply entered to default statistic algorithm here The exemplary explanation of row, is not especially limited to which.
For the weight adjustable strategies that term length analysis and two-character surname are analyzed, in the embodiment of the present application, searching platform The length of the term of user input is can determine, when the length of the term is in default surname length range, Improve the third phase and seemingly spend corresponding weight.Or, will be cut into by the term respectively described in several two-character words Group is matched with default two-character surname list, if arbitrary several double word phrases by described in the term is cut into hit The two-character surname list, then improve the third phase and seemingly spend corresponding weight.For the length analysis based on term, realizing When, searching platform can first be calculated and determined the length of the term of user input.When the length of the term is pre- If surname length range in when, improve above-mentioned third phase and seemingly spend corresponding weight.
It should be noted that for the setting of surname length range, developer can based on practical application in it is concrete Situation and set.For example, as the number of words of Chinese name is generally less than four words, it is possible to by default surname length Scope is set smaller than the length of 4 words, when the length of the term of user input is less than 4 words, shows user input Term may be largely the author's name of archaic Chinese Chinese text, can now improve the term with retrieval object The similarity of the author of the retrieval object in list, i.e., third phase is like degree.Merely just default surname length range is carried out Exemplary explanation, is not especially limited, and in actual applications, developer can voluntarily adjust this based on practical situation Surname length range.
For the two-character surname based on term is analyzed, when realizing, searching platform can be successively by above-mentioned term through phase Adjacent character segmentation, the double word phrase of generation are matched with local default two-character surname list, if some double word phrases being cut into In arbitrary double word phrase hit the two-character surname list, then can improve third phase and seemingly spend corresponding weight.
For example, when the term of user input is " OUYANG xiu ", after the term carries out adjacent words participle, can obtain " Ouyang ", " sun is repaiied "." Ouyang " and " sun is repaiied " can be matched by searching platform successively with default two-character surname list.If A double word phrase in " Ouyang " and " sun is repaiied " hits the two-character surname list, then can improve third phase like the weight spent.
It should be noted that the advantage by the way of above-mentioned two-character surname analyzes matching and term length analysis is, i.e., Make the term of user input comprising much information, when such as term is " poem OUYANG xiu ", cut using the adjacent words of term Divide and two-character surname analyzes matching process, as " Ouyang " has hit default two-character surname list, so the term can also be calculated Be more likely to retrieve the author of object, corresponding weight is seemingly spent so as to improve third phase so that the result of retrieval is more accurate.
In addition it is also necessary to explanation, in related search method, in the bar not limited to term by user Under part, it is for author etc. that such as user does not limit term.Generally, the retrieval of searching platform default user input Content of the word mainly for retrieval object, that is, include the term in retrieving the content of object.But when user input is for inspection During the term of the author of rope object or the title of retrieval object, related search method still with the content for retrieving object is It is main so that the retrieval result accuracy of return is relatively low.
And in the embodiment of the present application, even if searching platform is not under conditions of user is limited to term, still Default weight adjustable strategies can be based on, for the term of user input, the tendentiousness of the term is automatically analyzed.As being somebody's turn to do Term is more prone to and author's name, or title etc..Then, searching platform can be based on this tendentiousness, heighten corresponding The corresponding weight of first, second, and third similarity.Therefore, the search method for being provided using the application, due to dividing automatically The tendentiousness of the term of analysis user input, such that it is able to effectively improve the accuracy of retrieval result, improves effectiveness of retrieval.
For example, by taking classic poetry retrieval as an example, when the term of user input is " li po ", and user is specified based on work Person enters line retrieval, and in related search method, searching platform can return the classic poetry comprising li po in the content of classic poetry.And In the search method that the application is provided, even if do not specify in user entering line retrieval based on author, searching platform can also be based on Default weight adjustable strategies, analyze the tendentiousness of " li po " this term.Such as, searching platform can be based on term Length analysis, determine the length of term " li po " in default surname length range, it is possible to improve term with inspection The corresponding weight of similarity of rope object author.Therefore, it can effectively improve the accuracy of retrieval result.
Additionally, in the embodiment of the present application, the searching database for storing the retrieval object is stored in this equipment Physical memory in.
It should be noted that due to some reasons such as memory size restrictions, the data base of coordinate indexing method often stores In a hard disk, as the speed for reading data on hard disk is less than the speed of the reading data for directly accessing internal memory, so being based on The search method of hard disc storage searching database may affect the speed retrieved.
In the embodiment of the present application, due to the limited amount of archaic Chinese Chinese text, and the internal memory of modern computer and Disposal ability is all very powerful, therefore searching database can be stored in the internal memory of searching platform, therefore is greatly enhanced The speed of retrieval, improves retrieval ground efficiency.
The embodiment of the present application provides a kind of search method, and methods described is applied to what is built based on some retrieval servers Distributed search platform, distributed search platform can be with receive users by retrieving the term that client sends.And can be all over The term is gone through, adjacent character segmentation is performed to the term, several double word phrases are generated.Searching platform can distinguish base In described several double word phrases, retrieval corresponding with several double word phrases is searched in default searching database right As generating retrieval list object.And editing distance can be based on, the term can be calculated with the retrieval retrieved in list The similarity of object, using similarity highest, several retrieval objects are back to retrieval client as retrieval result, with to institute State user to show.
On the one hand, as the method that the application is provided no longer carries out participle using the participle mode of Modern Chinese, but adopt Carry out adjacent character segmentation with by term, generate the participle mode of double word phrase, such that it is able to user input for the ancient Chinese The term of language Chinese text carries out effectively cutting so that the result of participle is more reasonable, and searching platform is entered using the participle Line retrieval, retrieval result are higher with the degree of association of term.Therefore, the search method for being provided using the application can be effectively ancient The accuracy of Chinese Chinese text retrieval.
On the other hand, as the calculating of similarity is based on editing distance so that the calculating of similarity becomes for quantitative scoring Calculate, therefore drastically increase the accuracy of retrieval result.
Additionally, in the search method that the application is provided, when similarity is calculated, searching platform can be based on default power Recanalization strategy, according to the term of user input, adjusts term power corresponding with the content of retrieval object, author and title Weight, therefore the accuracy of retrieval result can be effectively improved.
Finally, in the search method that the application is provided, due to the locally stored internal memory of searching database in, therefore greatly Improve the speed of retrieval, be effectively improved effectiveness of retrieval.
Below by taking the scene of ancient poetry word and search as an example, above-mentioned search method is described in detail.
The term of hypothesis user input is " the silvery moonlight, cascading to the ground in front of the bed, is just like white frost ".
User is being received after the term of " the silvery moonlight, cascading to the ground in front of the bed, is just like white frost " that client is input into, retrieval is flat Platform can travel through the term, and the term is carried out adjacent character segmentation, and cutting generates some two-character phrase groups, respectively " bed Before ", " front bright ", " bright moon ", " moonlight ", " being suspected to be ", " be ground ", " ground " and " going up white ".
Searching platform can select " before bed " this two-character phrase group, search with " before bed " if corresponding in searching database Dry retrieval object (the retrieval object i.e. comprising " before bed " this two-character phrase group).
Then searching platform can be based on the method for searching corresponding some retrieval objects " before bed ", search respectively other Corresponding several retrieval objects of two-character phrase group.
After the lookup for completing above-mentioned retrieval object, all double words that above-mentioned term can be cut into by searching platform Phrase correspondence retrieval object is integrated, and generates retrieval list object.As shown in table 2, table 2 is simply illustrated retrieval list object A part for retrieval list object, table 2 can also include other list items, such as " be suspected to be " corresponding retrieval object of two-character phrase group etc..
After retrieval list object is generated, searching platform can calculate the retrieval object in retrieval list object and the retrieval The similarity of word.When similarity is calculated, searching platform can be from the retrieval content of object, title and author these three dimensions pair The similarity is calculated, and obtains the first similarity, the second similarity and third phase respectively like degree.Again based on first for calculating Similarity, the second similarity and third phase are calculated the similarity of the retrieval object and term like spending.
It is entitled to calculate below《Quiet night thinks》Retrieval object as a example by, calculate the content of the term and the retrieval object The first similarity.
Searching platform can be right《Quiet night thinks》Content carry out the segmentation of the regular expression based on non-Chinese character, segmentation Into " bright moon light before bed ", " being suspected to be frost on the ground ", " prestige bright moon of raising the head " and " think of native place of bowing " this four substrings.
Searching platform first can be calculated in " bright moon light before bed " substring and retrieval object that term is divided into respectively " bright moon light before bed ", " being suspected to be frost on the ground ", " prestige bright moon of raising the head " and " think of native place of bowing " this four substrings that appearance is divided into Editing distance, respectively obtain Ld11、Ld12、Ld13And Ld14, and obtain the editing distance of minimum in this four values.
Then, searching platform can calculate " being suspected to be frost on the ground " substring and the retrieval object that term is divided into respectively Content segmentation into " bright moon light before bed ", " be suspected to be on the ground frost ", " prestige bright moon of raising the head " and " bow and think native place " this four sub- characters The editing distance of string, respectively obtains Ld21、Ld22、Ld23And Ld24, and obtain the editing distance of minimum in this four values.
Value of the term with the retrieval object for the similarity of retrieval contents of object is:
Sc=[MIN (Ld11,Ld12,Ld13,Ld14)+MIN(Ld21,Ld22,Ld23,Ld24)]/2
Searching platform can calculate term corresponding with the term respectively based on the computational methods of above-mentioned first similarity Retrieval object be directed to the similarity of title and author, respectively obtain the second similarity StS is spent seemingly with third phasea
After above-mentioned first, second, and third Similarity Measure is completed, searching platform can be based on default strategy, be this First, second, and third similarity is respectively configured corresponding weight, and weight is respectively Wc、WtAnd Wa
Searching platform can be calculated entitled based on the computing formula of similarity《Quiet night thinks》Retrieval object and term For the similarity of " the silvery moonlight, cascading to the ground in front of the bed, is just like white frost ", Similarity value is:
SP=Sc*Wc+St*Wt+Sa*Wa
Searching platform can be entitled based on above-mentioned calculating《Quiet night thinks》Retrieval object and term be " bright moon before bed Light, is suspected to be frost on the ground " similarity, in calculating retrieval list object respectively, other retrieval objects are similar to the term Degree.Then the Similarity value for obtaining is ranked up, is higher than several retrieval objects of predetermined threshold value by similarity, as retrieval As a result, client is returned, is shown to user.
It is corresponding with the embodiment of aforementioned search method, present invention also provides the embodiment of retrieval device.
The embodiment of the application retrieval device can be applied on searching platform.Device embodiment can pass through software reality It is existing, it is also possible to be realized by way of hardware or software and hardware combining.As a example by implemented in software, as on a logical meaning Device, is read corresponding computer program instructions in nonvolatile memory by the processor of its place searching platform Operation in internal memory is formed.From for hardware view, as shown in figure 4, one kind of device place searching platform is retrieved for the application Hardware structure diagram, in addition to the processor shown in Fig. 4, internal memory, network outgoing interface and nonvolatile memory, embodiment Actual functional capability of the searching platform that middle device is located generally according to the retrieval, can also include other hardware, this is repeated no more.
Refer to Fig. 5, Fig. 5 is a kind of block diagram of the retrieval device shown in one exemplary embodiment of the application, described device The distributed search platform built based on some retrieval servers is applied to, described device includes:
Receiving unit 510, for receive user by retrieving the term that client sends;
Cutting unit 520, for traveling through the term, performs adjacent character segmentation to the term, generates several Double word phrase;
Searching unit 530, for based on described several double word phrases, if searching in default searching database and being somebody's turn to do Dry two-character word group distinguishes corresponding retrieval object, generates retrieval list object;Wherein, the searching database is stored in advance Several double words that several retrieval objects for including in the searching database are generated after carrying out adjacent character segmentation respectively Corresponding relation in phrase, with the searching database between the retrieval object comprising each double word phrase;
Computing unit 540, for calculating the similarity of the term and the retrieval object retrieved in list, by phase Retrieval client is back to as retrieval result like several retrieval objects of degree higher than predetermined threshold value, with aobvious to the user Show.
In another kind of optional implementation, described device also includes:
Cutting unit 550, for the regular expression based on non-Chinese character, by the predetermined number in the data base Retrieval object is split, and is divided into some substrings;
The cutting unit 520, is additionally operable to the method based on the adjacent character segmentation, respectively to some substrings Participle is carried out, some double word phrases are generated;
Storage element 560, for set up respectively some double word phrases that the substring generates and with some double words The mapping relations of the corresponding retrieval object of phrase, and the inspection is stored in as retrieval major key using some double word phrases respectively In rope data base.
In another kind of optional implementation, the computing unit 540, specifically for based on editing distance algorithm, meter Calculate the similarity of the term and the retrieval object retrieved in list;
The editing distance algorithm includes:
Based on the regular expression of non-Chinese character, the term is split, be divided into some substrings;
Some substrings that the term is divided into are defined as into target substring successively;
The target substring is calculated respectively is divided all sub- word of generation with the retrieval object retrieved in list The editing distance of symbol string, and obtain the smallest edit distance for calculating;
The corresponding smallest edit distance of all target substrings that the term segmentation for getting is generated is carried out It is cumulative, then the smallest edit distance after cumulative is carried out averagely obtaining the similarity of the term and the retrieval object;Its In, the similarity is characterized with editing distance, and editing distance is less, and similarity is higher.
In another kind of optional implementation, the retrieval object includes content, the mark of retrieval object for retrieving object The author of topic and retrieval object;
The computing unit 540, is further used for, based on the editing distance algorithm, calculating the term and the inspection The similarity of the content of the retrieval object in Suo Liebiao, obtains the first similarity;Based on the editing distance algorithm, calculate described Term and the similarity of the title of the retrieval object retrieved in list, obtain the second similarity;Based on it is described editor away from From algorithm, the similarity of the author of the term and the retrieval object retrieved in list is calculated, third phase is obtained and is seemingly spent; Based on first similarity, second similarity and the third phase like spending, the term is calculated with the retrieval row The similarity of the retrieval object in table.
In another kind of optional implementation, the computing unit 540, be further used for for first similarity, Second similarity and the third phase are respectively configured corresponding weight like degree;First similarity is first similar to this Spend the product of corresponding weight, third phase described in the sum of products of second similarity weight corresponding with second similarity seemingly The product that degree seemingly spends corresponding weight with the third phase is added up, and obtains the term similar to the retrieval object Degree.
In another kind of optional implementation, described device also includes adjustment unit 570, for based on default weight Adjustable strategies, seemingly spend corresponding weight to first similarity, second similarity and the third phase, are adjusted.
In another kind of optional implementation, the adjustment unit 570, specifically for being based on default forecast model, Calculate the probability that the term is respectively the content, title and author of the retrieval object;Based on the retrieval for calculating Word is respectively the probability of the content, title and author of the retrieval object, adjusts and the content for retrieving object, title and work Person distinguishes corresponding first similarity, the second similarity and third phase like the weight spent.
In another kind of optional implementation, the adjustment unit 570, the inspection specifically for determining user input The length of rope word, when the length of the term is in default surname length range, the raising third phase is seemingly spent corresponding Weight;Or, will be cut into by the term respectively described in several double word phrases and default two-character surname list carry out Match somebody with somebody, if arbitrary several double word phrases by described in the term is cut into hit the two-character surname list, improve described Third phase seemingly spends corresponding weight.
In another kind of optional implementation, the retrieval object is archaic Chinese poem;
Described device also includes:
Unit 580 is deleted, in the method based on the adjacent character segmentation, respectively to some substrings Carry out participle, before generating some double word phrases, delete by the retrieval Object Segmentation into described some character strings in it is non- Chinese character and interjection phrase.
In said apparatus, the function of unit and effect realizes that process specifically refers to correspondence step in said method Process is realized, be will not be described here.
For device embodiment, as which corresponds essentially to embodiment of the method, so related part is referring to method reality Apply the part explanation of example.Device embodiment described above is only schematic, wherein described as separating component The unit of explanation can be or may not be physically separate, as the part that unit shows can be or can also It is not physical location, you can local to be located at one, or can also be distributed on multiple NEs.Can be according to reality Need to select some or all of module therein to realize the purpose of application scheme.Those of ordinary skill in the art are not paying In the case of going out creative work, you can to understand and implement.
The preferred embodiment of the application is the foregoing is only, not to limit the application, all essences in the application Within god and principle, any modification, equivalent substitution and improvements done etc. are should be included within the scope of the application protection.

Claims (19)

1. a kind of search method, it is characterised in that methods described be applied to based on some retrieval servers build it is distributed Searching platform, methods described include:
Receive user is by retrieving the term that client sends;
The term is traveled through, adjacent character segmentation is performed to the term, several double word phrases are generated;
Based on described several double word phrases, in default searching database, several two-character word groups are corresponding respectively with this for lookup Retrieval object, generate retrieval list object;Wherein, the searching database is stored in advance in the searching database Comprising several retrieval objects carry out adjacent character segmentation respectively after several double word phrases for generating, with the searching database In comprising each double word phrase retrieval object between corresponding relation;
The similarity of the term and the retrieval object retrieved in list is calculated, if by similarity higher than predetermined threshold value Dry retrieval object is back to retrieval client as retrieval result, to show to the user.
2. method according to claim 1, it is characterised in that methods described also includes:
Based on the regular expression of non-Chinese character, the retrieval object of the predetermined number in the data base is split, point It is cut into some substrings;
Based on the method for the adjacent character segmentation, respectively some substrings are carried out with participle, generate some double word phrases;
Some double word phrases that the substring generates and retrieval object corresponding with some double word phrases are set up respectively Mapping relations, and be stored in the searching database using some double word phrases as retrieval major key respectively.
3. method according to claim 2, it is characterised in that in the calculating term and the retrieval list The similarity of retrieval object, including:
Based on editing distance algorithm, the similarity of the term and the retrieval object retrieved in list is calculated;
The editing distance algorithm includes:
Based on the regular expression of non-Chinese character, the term is split, be divided into some substrings;
Some substrings that the term is divided into are defined as into target substring successively;
The target substring is calculated respectively is divided all substrings of generation with the retrieval object retrieved in list Editing distance, and obtain the smallest edit distance that calculates;
The corresponding smallest edit distance of all target substrings that the term segmentation for getting is generated is added up, Again to add up after smallest edit distance carry out averagely, obtain the term and it is described retrieve object similarity;Wherein, it is described Similarity is characterized with editing distance, and editing distance is less, and similarity is higher.
4. method according to claim 3, it is characterised in that the retrieval object includes content, the retrieval for retrieving object The author of the title and retrieval object of object;
The similarity based on editing distance algorithm, the calculating term and the retrieval object retrieved in list, bag Include:
Based on the editing distance algorithm, the term is calculated similar to the content of the retrieval object in the retrieval list Degree, obtains the first similarity;
Based on the editing distance algorithm, the term is calculated similar to the title of the retrieval object in the retrieval list Degree, obtains the second similarity;
Based on the editing distance algorithm, the term is calculated similar to the author of the retrieval object in the retrieval list Degree, obtains third phase like degree;
Based on first similarity, second similarity and the third phase like spending, the term and the inspection are calculated The similarity of the retrieval object in Suo Liebiao.
5. method according to claim 4, it is characterised in that it is described based on first similarity, it is described second similar Spend and the third phase is like spending, calculate the similarity of the term and the retrieval object retrieved in list, including:
Corresponding weight is respectively configured like degree for first similarity, second similarity and the third phase;
Will be the product of weight first similarity corresponding with first similarity, second similarity second similar to this Spend third phase described in the sum of products of corresponding weight and seemingly spend the product for corresponding weight seemingly being spent with the third phase and added up, obtain The term and the similarity for retrieving object.
6. method according to claim 5, it is characterised in that methods described also includes:
Based on default weight adjustable strategies, it is right that first similarity, second similarity and the third phase are seemingly spent The weight answered, is adjusted.
7. method according to claim 6, it is characterised in that described based on default weight adjustable strategies, to described One similarity, second similarity and the third phase seemingly spend corresponding weight, are adjusted, including:
Based on default forecast model, calculate the term and be respectively the general of the content for retrieving object, title and author Rate;
The probability of the content, title and author of the retrieval object, adjustment and institute are respectively based on the term for calculating Content, title and the author for stating retrieval object distinguishes corresponding first similarity, the second similarity and third phase like the weight spent.
8. method according to claim 6, it is characterised in that described based on default weight adjustable strategies, to described One similarity, second similarity and the third phase seemingly spend corresponding weight, are adjusted, including:
Determine the length of the term of user input, when the length of the term is in default surname length range, Improve the third phase and seemingly spend corresponding weight;Or,
Several double word phrases described in being cut into by the term respectively are matched with default two-character surname list, if Arbitrary several double word phrases by described in the term is cut into hit the two-character surname list, then improve the third phase seemingly Spend corresponding weight.
9. method according to claim 2, it is characterised in that the retrieval object is archaic Chinese Chinese text;
Methods described also includes:
In the method based on the adjacent character segmentation, respectively some substrings are carried out with participle, generated some double Before words group, delete by it is described retrieval Object Segmentation into described some character strings in non-Chinese character and interjection phrase.
10. method according to claim 1, it is characterised in that the searching database for storing the retrieval object It is stored in the physical memory of this equipment.
A kind of 11. retrieval devices, it is characterised in that described device be applied to based on some retrieval servers build it is distributed Searching platform, described device include:
Receiving unit, for receive user by retrieving the term that client sends;
Cutting unit, for traveling through the term, performs adjacent character segmentation to the term, generates several two-character words Group;
Searching unit, for based on described several double word phrases, searching and several pairs in default searching database Words group distinguishes corresponding retrieval object, generates retrieval list object;Wherein, the searching database is stored in advance for institute Several double word phrases that several retrieval objects included in stating searching database are generated after carrying out adjacent character segmentation respectively, with Corresponding relation in the searching database between the retrieval object comprising each double word phrase;
Computing unit is for calculating the similarity of the term and the retrieval object retrieved in list, high by similarity Retrieval client is back to as retrieval result in several retrieval objects of predetermined threshold value, to show to the user.
12. devices according to claim 11, it is characterised in that described device also includes:
Cutting unit, for the regular expression based on non-Chinese character, will be the retrieval of the predetermined number in the data base right As being split, some substrings are divided into;
The cutting unit, is additionally operable to the method based on the adjacent character segmentation, respectively some substrings is carried out point Word, generates some double word phrases;
Storage element, for setting up some double word phrases that the substring generates and corresponding with some double word phrases respectively Retrieval object mapping relations, and respectively using some double word phrases as retrieval major key be stored in the searching database In.
13. devices according to claim 12, it is characterised in that the computing unit, specifically for based on editing distance Algorithm, calculates the similarity of the term and the retrieval object retrieved in list;
The editing distance algorithm includes:
Based on the regular expression of non-Chinese character, the term is split, be divided into some substrings;
Some substrings that the term is divided into are defined as into target substring successively;
The target substring is calculated respectively is divided all substrings of generation with the retrieval object retrieved in list Editing distance, and obtain the smallest edit distance that calculates;
The corresponding smallest edit distance of all target substrings that the term segmentation for getting is generated is added up, Again to add up after smallest edit distance carry out averagely, obtain the term and it is described retrieve object similarity;Wherein, it is described Similarity is characterized with editing distance, and editing distance is less, and similarity is higher.
14. devices according to claim 13, it is characterised in that the retrieval object includes content, the inspection for retrieving object The author of the title and retrieval object of rope object;
The computing unit, is further used for, based on the editing distance algorithm, calculating the term and the retrieval list In retrieval object content similarity, obtain the first similarity;Based on the editing distance algorithm, the term is calculated With the similarity of the title of the retrieval object in the retrieval list, the second similarity is obtained;Based on the editing distance algorithm, The similarity of the author of the term and the retrieval object retrieved in list is calculated, and third phase is obtained like degree;Based on institute The first similarity, second similarity and the third phase are stated like spending, the term is calculated and is retrieved in list with described The similarity of retrieval object.
15. devices according to claim 14, it is characterised in that the computing unit, are further used for as described first Similarity, second similarity and the third phase are respectively configured corresponding weight like degree;By first similarity with should Described in the sum of products of the product of the corresponding weight of the first similarity, second similarity weight corresponding with second similarity Third phase is seemingly spent the product for seemingly spending corresponding weight with the third phase and is added up, and obtains the term with the retrieval object Similarity.
16. devices according to claim 15, it is characterised in that described device also includes:
Adjustment unit, for based on default weight adjustable strategies, to first similarity, second similarity and described Third phase seemingly spends corresponding weight, is adjusted.
17. devices according to claim 16, it is characterised in that the adjustment unit, specifically for based on default pre- Model is surveyed, the probability that the term is respectively the content, title and author of the retrieval object is calculated;Based on the institute for calculating The probability that term is respectively the content, title and author of the retrieval object is stated, content, mark with the retrieval object is adjusted Topic and author distinguish corresponding first similarity, the second similarity and third phase like the weight spent.
18. devices according to claim 16, it is characterised in that the adjustment unit, specifically for determining user input The term length, when the length of the term is in default surname length range, improve the third phase seemingly Spend corresponding weight;Or, will be cut into by the term respectively described in several double word phrases arrange with default two-character surname Table is matched, if arbitrary several double word phrases by described in the term is cut into hit the two-character surname list, Improve the third phase and seemingly spend corresponding weight.
19. devices according to claim 12, it is characterised in that the retrieval object is archaic Chinese poem;
Described device also includes:
Unit is deleted, in the method based on the adjacent character segmentation, carrying out to some substrings respectively point Word, before generating some double word phrases, delete by the retrieval Object Segmentation into described some character strings in non-middle word Symbol and interjection phrase.
CN201611048869.9A 2016-11-21 2016-11-21 A kind of search method and device Active CN106528846B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611048869.9A CN106528846B (en) 2016-11-21 2016-11-21 A kind of search method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611048869.9A CN106528846B (en) 2016-11-21 2016-11-21 A kind of search method and device

Publications (2)

Publication Number Publication Date
CN106528846A true CN106528846A (en) 2017-03-22
CN106528846B CN106528846B (en) 2019-09-17

Family

ID=58356998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611048869.9A Active CN106528846B (en) 2016-11-21 2016-11-21 A kind of search method and device

Country Status (1)

Country Link
CN (1) CN106528846B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376170A (en) * 2018-02-27 2018-08-07 广州酷狗计算机科技有限公司 The method and apparatus for determining retrieval result
CN108520002A (en) * 2018-03-12 2018-09-11 平安科技(深圳)有限公司 Data processing method, server and computer storage media
CN108564086A (en) * 2018-03-17 2018-09-21 深圳市极客思索科技有限公司 A kind of the identification method of calibration and device of character string
CN111179935A (en) * 2018-11-12 2020-05-19 中移(杭州)信息技术有限公司 Voice quality inspection method and device
CN111194457A (en) * 2018-07-31 2020-05-22 株式会社艾飒木兰 Patent evaluation determination method, patent evaluation determination device, and patent evaluation determination program
CN112100355A (en) * 2020-09-17 2020-12-18 中国建设银行股份有限公司 Intelligent interaction method, device and equipment
CN112434137A (en) * 2020-12-11 2021-03-02 乐山师范学院 Poetry retrieval method and system based on artificial intelligence
CN113793611A (en) * 2021-08-27 2021-12-14 上海浦东发展银行股份有限公司 Scoring method, scoring device, computer equipment and storage medium
JP2022185581A (en) * 2021-06-02 2022-12-14 ネイバー コーポレーション Method for providing individual data retrieval service, computer device and computer program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101131325A (en) * 2006-08-25 2008-02-27 高德软件有限公司 Electronic navigation system information searching method and device thereof
CN102646124A (en) * 2012-02-27 2012-08-22 杨志远 Method for automatically identifying address information
CN106326233A (en) * 2015-06-18 2017-01-11 阿里巴巴集团控股有限公司 Address prompting method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101131325A (en) * 2006-08-25 2008-02-27 高德软件有限公司 Electronic navigation system information searching method and device thereof
CN102646124A (en) * 2012-02-27 2012-08-22 杨志远 Method for automatically identifying address information
CN106326233A (en) * 2015-06-18 2017-01-11 阿里巴巴集团控股有限公司 Address prompting method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FANGMEI LIU等: "Research on Chineses word segmentation based on matrix restraint", 《IBERIAN JOURNAL OF INFORMATION SYSTEMS AND TECHNOLOGIES》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108376170A (en) * 2018-02-27 2018-08-07 广州酷狗计算机科技有限公司 The method and apparatus for determining retrieval result
CN108520002A (en) * 2018-03-12 2018-09-11 平安科技(深圳)有限公司 Data processing method, server and computer storage media
CN108564086A (en) * 2018-03-17 2018-09-21 深圳市极客思索科技有限公司 A kind of the identification method of calibration and device of character string
CN108564086B (en) * 2018-03-17 2024-05-10 上海柯渡医学科技股份有限公司 Character string identification and verification method and device
CN111194457A (en) * 2018-07-31 2020-05-22 株式会社艾飒木兰 Patent evaluation determination method, patent evaluation determination device, and patent evaluation determination program
CN111179935B (en) * 2018-11-12 2022-06-28 中移(杭州)信息技术有限公司 Voice quality inspection method and device
CN111179935A (en) * 2018-11-12 2020-05-19 中移(杭州)信息技术有限公司 Voice quality inspection method and device
CN112100355A (en) * 2020-09-17 2020-12-18 中国建设银行股份有限公司 Intelligent interaction method, device and equipment
CN112434137A (en) * 2020-12-11 2021-03-02 乐山师范学院 Poetry retrieval method and system based on artificial intelligence
CN112434137B (en) * 2020-12-11 2023-04-11 乐山师范学院 Poetry retrieval method and system based on artificial intelligence
JP2022185581A (en) * 2021-06-02 2022-12-14 ネイバー コーポレーション Method for providing individual data retrieval service, computer device and computer program
JP7377915B2 (en) 2021-06-02 2023-11-10 ネイバー コーポレーション Method, computer device, and computer program for providing personalized data retrieval service
CN113793611A (en) * 2021-08-27 2021-12-14 上海浦东发展银行股份有限公司 Scoring method, scoring device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN106528846B (en) 2019-09-17

Similar Documents

Publication Publication Date Title
CN106528846A (en) Retrieval method and device
US11030201B2 (en) Preliminary ranker for scoring matching documents
CN105389349B (en) Dictionary update method and device
KR101793222B1 (en) Updating a search index used to facilitate application searches
CN107862070B (en) Online classroom discussion short text instant grouping method and system based on text clustering
EP3314464B1 (en) Storage and retrieval of data from a bit vector search index
CN107480158A (en) The method and system of the matching of content item and image is assessed based on similarity score
US10152478B2 (en) Apparatus, system and method for string disambiguation and entity ranking
EP3314468B1 (en) Matching documents using a bit vector search index
US20080040342A1 (en) Data processing apparatus and methods
CN107103016A (en) Represent to make the method for image and content matching based on keyword
US8825620B1 (en) Behavioral word segmentation for use in processing search queries
EP3314465B1 (en) Match fix-up to remove matching documents
US11748324B2 (en) Reducing matching documents for a search query
US20240160626A1 (en) System and method for automatic creation of ontological databases and semantic searching
US10275472B2 (en) Method for categorizing images to be associated with content items based on keywords of search queries
WO2016209964A1 (en) Bit vector search index using shards
US20180225382A1 (en) System and method for automatic creation of ontological databases and semantic searching
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
WO2016209968A2 (en) Updating a bit vector search index
EP3314467B1 (en) Bit vector search index
Luo et al. Research on civic hotline complaint text classification model based on word2vec
CN115129864A (en) Text classification method and device, computer equipment and storage medium
WO2016209960A1 (en) Bit vector row trimming and augmentation for matching documents
WO2019126326A1 (en) System and method for automatic creation of ontological databases and semantic searching

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant