CN102890690B - Target information search method and device - Google Patents

Target information search method and device Download PDF

Info

Publication number
CN102890690B
CN102890690B CN201110207333.8A CN201110207333A CN102890690B CN 102890690 B CN102890690 B CN 102890690B CN 201110207333 A CN201110207333 A CN 201110207333A CN 102890690 B CN102890690 B CN 102890690B
Authority
CN
China
Prior art keywords
character
segmenter
weights
character string
current class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110207333.8A
Other languages
Chinese (zh)
Other versions
CN102890690A (en
Inventor
王�琦
左杨眉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201110207333.8A priority Critical patent/CN102890690B/en
Publication of CN102890690A publication Critical patent/CN102890690A/en
Application granted granted Critical
Publication of CN102890690B publication Critical patent/CN102890690B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a target information search method and a target information search device. The method comprises the following steps of: receiving a word segmentation device selected by a user and a character string input by the user, wherein the word segmentation device is matched with the character string input by the user; performing word segmentation on the character string by using the word segmentation device to acquire a search word; and inputting the acquired search word into a search engine, and searching to acquire target information. By the method and the device, the problem of an inaccurate search result of the conventional search engine is solved, convenience is brought to a user, and retrieval quality is improved.

Description

Target information search method and device
Technical field
The present invention relates to information search field, in particular to a kind of target information search method and device.
Background technology
Search engine technique is being applied to more and more the data in various IT system, in search engine index storehouse Thus exponentially increase, with Chinese character document being continuously increased in index database, increasing Chinese character word enters into rope In drawing storehouse, the participle of all kinds of neologisms and special vocabulary (such as name or the term of specific area) after into participle storehouse to segmenter Accuracy rate generates greatly negative effect so that many Chinese sentences cannot correctly be decomposed according to semanteme, such as Chinese sentence Son:" ion cloud integrated distribution ", if not doing extra process to technical term " ion cloud ", then the Chinese sentence will be by participle Device is decomposed into " ion cloud integrated distribution ", and such word segmentation result can cause search engine to search the desired money of user Material.
It can be seen that, current way of search cannot also carry out participle according to the search target of user, cause word segmentation result and use The retrieval purpose at family is not inconsistent;In addition, above-mentioned word segmentation result is not comprehensive enough so that cannot be defeated from user by some crucial search conditions Extract in the character string for entering.
There are problems that Search Results are inaccurate for search engine in correlation technique, not yet propose effectively to solve at present Scheme.
The content of the invention
Present invention is primarily targeted at a kind of target information search method and device are provided, at least to solve above-mentioned search Engine has that Search Results are inaccurate.
According to an aspect of the invention, there is provided a kind of target information search method, comprises the steps:Receive user The segmenter of selection and the character string of user input, wherein, the segmenter is the participle with the string matching of the user input Device;Participle is carried out to above-mentioned character string using the segmenter, search terms are obtained;The search terms for obtaining are input into into search engine Scan for, obtain target information.
Before the above-mentioned segmenter of receive user selection and the character string of user input, the method also includes:Using with skill The corresponding segmenter in the corresponding classifying documents establishing techniques field in art field.
The corresponding segmenter in above-mentioned use classifying documents establishing techniques field corresponding with technical field includes:Technology is led Domain is classified, and determines the corresponding classifying documents of current class;According to the frequency that each character in classifying documents occurs, calculate every Weights of the individual character in current class;Determine the weights that the character in character string is specified in current class in current class; Weights according to the weight computing designated character string of each character in designated character string in current class;By designated character string and Weights binding of the designated character string in current class, obtains the segmenter of current class.
The above-mentioned frequency occurred according to each character in classifying documents, calculates weights bag of each character in current class Include:Delete the stop-word in classifying documents;Statistics deletes the frequency that each character occurs in the classifying documents after stop-word;Statistics Document frequency comprising character in classifying documents;The document frequency of frequency, character according to character and the sum meter of classifying documents Calculate weights of each character in current class.
Weights of the character in character string in current class are specified to include in above-mentioned determination current class:Work as current class In specify in character string when having the character being not included in classifying documents, setting is not included in the weights of the character in classifying documents For default weight.
Above-mentioned character includes one below:The character of the character, the character of Korean form or Japanese form of hanzi form.
According to a further aspect in the invention, there is provided a kind of target information searcher, including such as lower module:Receive mould Block, the segmenter selected for receive user and the character string of user input, wherein, segmenter is the character string with user input The segmenter of matching;Word-dividing mode, the segmenter for being received using receiver module carries out participle to character string, obtains search word Language;Search module, the search terms input search engine for word-dividing mode to be obtained is scanned for, and obtains target information.
Said apparatus also include:Segmenter sets up module, for setting up skill using classifying documents corresponding with technical field The corresponding segmenter in art field.
Above-mentioned segmenter sets up module to be included:Document determining unit, for classifying to technical field, it is determined that current point The corresponding classifying documents of class;Character weight calculation unit, for each word in the classifying documents that determined according to document determining unit The frequency that symbol occurs, calculates weights of each character in current class;Weights determining unit, for determining current class middle finger Determine weights of the character in character string in current class;Character string weight calculation unit, for according to every in designated character string Weights of the weight computing designated character string of individual character in current class;Segmenter sets up unit, for by designated character string Weights binding with designated character string in current class, obtains the segmenter of current class.
Above-mentioned character weight calculation unit includes:Delete subelement, for deleting classifying documents in stop-word;Statistics Unit, deletes subelement and deletes the frequency that each character occurs in the classifying documents after stop-word, and statistical for statistics Document frequency comprising character in class document;Character string computation subunit, for according to the document frequency of the frequency of character, character Weights of each character in current class are calculated with the sum of classifying documents.
By the present invention, participle is carried out with the segmenter of the string matching of user input using using, can be from user Extract each word in the character string of input exactly, scanned for using the word after participle, the target information for obtaining will The expectation of user can be met, existing search engine is solved and be there are problems that Search Results are inaccurate, it is convenient for users, carry The high quality of retrieval.
Description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this Bright schematic description and description does not constitute inappropriate limitation of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of according to embodiments of the present invention 1 target information search method;
Fig. 2 is the structured flowchart of according to embodiments of the present invention 2 target information searcher;
Fig. 3 is the concrete structure block diagram of according to embodiments of the present invention 2 target information searcher;
Fig. 4 is the concrete structure block diagram of according to embodiments of the present invention 2 target information searcher;
Fig. 5 is the structured flowchart of according to embodiments of the present invention 2 weights generation module;
Fig. 6 is the flow chart of the target information search method of according to embodiments of the present invention 2 application Fig. 4 shown devices;
Fig. 7 is the flow chart of the target information search method of according to embodiments of the present invention 2 application Fig. 4 shown devices;
Fig. 8 is the flow chart of the target information search method of according to embodiments of the present invention 2 application Fig. 4 shown devices;
Fig. 9 is according to embodiments of the present invention 2 target information search system schematic diagram.
Specific embodiment
Below with reference to accompanying drawing and in conjunction with the embodiments describing the present invention in detail.It should be noted that not conflicting In the case of, the feature in embodiment and embodiment in the application can be mutually combined.
The embodiment of the present invention considers that up till now search engine does not enter line retrieval according to technical field to retrieval information, causes to search Hitch fruit is inaccurate, there is provided a kind of target information search method and device, which can make search engine in different field In to different classifications using different participle models, the accuracy of participle can be improved;Suitable for searching engine field, participle field In the field such as WEB application system.
Embodiment 1
A kind of target information search method is present embodiments provided, referring to Fig. 1, the method comprises the steps:
Step S102, segmenter and the character string of user input that receive user is selected, wherein, the segmenter is and user The segmenter of the string matching of input;
The matching refers to that the corresponding technical field of the segmenter is consistent with the corresponding technical field of the character string of user input;
Step S104, participle is carried out using above-mentioned segmenter to the character string, obtains search terms;
Step S106, the search terms for obtaining input search engine is scanned for, and obtains target information.
The present embodiment carries out participle by using with the segmenter of the string matching of user input, can be from user input Character string in extract each word exactly, scanned for using the word after participle, the target information for obtaining will be accorded with The expectation at family is shared, existing search engine is solved and be there are problems that Search Results are inaccurate, it is convenient for users, improve The quality of retrieval.
In order to improve the accuracy of participle, in the segmenter and the character of user input of the selection of above-mentioned segmenter receive user Before string, the method also includes:Using the corresponding segmenter in classifying documents establishing techniques field corresponding with technical field.
Wherein, following steps are included using the corresponding segmenter in classifying documents establishing techniques field corresponding with technical field Suddenly:
1) technical field is classified, determines the corresponding classifying documents of current class;
2) frequency occurred according to each character in the classifying documents, calculates weights of each character in current class;
3) weights in specifying the character in character string to classify in this prior in current class are determined;
4) according to the weight computing of each character in the designated character string weights of the designated character string in current class;
5) the weights binding by designated character string and the designated character string in current class, obtains the participle of current class Device.
The concrete calculation of weights of each character in current class can be adopted:Delete the stopping in classifying documents Word;Statistics deletes the frequency that each character occurs in the classifying documents after stop-word;Text comprising character in statistical classification document Shelves frequency;The document frequency of frequency, character and the sum of classifying documents according to character calculates each character in current class Weights.Certainly, in actual use, it is also possible to do not delete the stop-word in classifying documents, it is each in direct statistical classification document The frequency that individual character occurs.Wherein, the stop-word can be previously set, for example:Article, conjunction or auxiliary word etc..
When having the character being not included in classifying documents during character string is specified in the current class, this is set and is not included in The weights of the character in classifying documents are default weight.
Above-mentioned character includes one below:The character of the character, the character of Korean form or Japanese form of hanzi form.
After establishing the corresponding segmenter of each technical field, the segmenter professional by comparison is obtained, these segmenter can To be displayed on the interface of search engine, select for user.By taking chinese character as an example, the searching method of target information is including as follows Step:
Step 1, the document to including in classification do Chinese character frequency analysis.
Step 2, the Chinese character frequency to including in classification do probability distribution process, calculate the Chinese character included in classification and are dividing The weights of apoplexy due to endogenous wind.
Step 3, the weight computing included in classification according to the Chinese character included in classification go out each word in segmenter dictionary Weights of the language in classification.
In step 4, the weights input segmenter by each word in segmenter dictionary in classification, segmenter is set to become point The special segmenter of class.
Step 5, the special segmenter for having built up the multiple classification for completing is supplied to user, user is from multiple special points Select one to be best suitable for the special segmenter of its retrieval purpose in word device, and participle is provided for search engine using special segmenter Service.
Step 6, user input search condition, special segmenter carries out word segmentation processing to search condition, and exports participle knot Really, word segmentation result is carried out full-text search by search engine, and retrieval result is returned to into user.
User selects to search for segmenter that target most matches with it and be input into chinese character in the WEB page of the Internet String, the system carries out word segmentation processing to Chinese character string by the segmenter that user specifies, and output best suits user's search purpose Chinese-character words, and transfer to search engine to process Chinese-character words.
The present embodiment can provide special segmenter for the classification of each in document library, by taking Chinese character as an example, by classification The occurrence number of the Chinese character in document does probability statistics, calculates weights of each Chinese character in classification, and according to Chinese character weights Weights of each Chinese-character words in classification in segmenter dictionary are calculated, and then special segmenter is set up for each classification, used Family searches for purpose and selects to be best suitable for the special segmenter of its search purpose in segmenter selection interface according to it, and using specialty Segmenter obtains the optimal word segmentation result that purpose is searched for for user, so as to improve the search accuracy rate of search engine, improves and uses Satisfaction of the family to search engine.
Embodiment 2
The present embodiment additionally provides a kind of target information searcher, and referring to Fig. 2, the device is included with lower module:
Receiver module 22, the segmenter selected for receive user and the character string of user input, wherein, the segmenter is With the segmenter of the string matching of the user input;
Word-dividing mode 24, is connected with receiver module 22, and the segmenter for being received using receiver module 22 is entered to character string Row participle, obtains search terms;
Search module 26, is connected with word-dividing mode 24, and the search terms input search for word-dividing mode 24 to be obtained is drawn Hold up and scan for, obtain target information.
The present embodiment carries out participle by using with the segmenter of the string matching of user input, can be from user input Character string in extract each word exactly, scanned for using the word after participle, the target information for obtaining will be accorded with The expectation at family is shared, existing search engine is solved and be there are problems that Search Results are inaccurate, it is convenient for users, improve The quality of retrieval.
In order to improve the accuracy of participle, referring to Fig. 3, said apparatus also include:Segmenter sets up module 32, with reception mould Block 22 is connected, for using the corresponding segmenter in classifying documents establishing techniques field corresponding with technical field.
Wherein, segmenter is set up module 32 and is included:Document determining unit, for classifying to technical field, it is determined that working as Front corresponding classifying documents of classifying;Character weight calculation unit, for every in the classifying documents that determined according to document determining unit The frequency that individual character occurs, calculates weights of each character in current class;Weights determining unit, for determining current class In specify weights of the character in current class in character string;Character string weight calculation unit, for according to designated character string In each character weights of the weight computing designated character string in current class;Segmenter sets up unit, for by designated word The weights binding of symbol string and designated character string in current class, obtains the segmenter of current class.
Preferably, above-mentioned character weight calculation unit includes:Delete subelement, for deleting classifying documents in stopping Word;Statistics subelement, deletes subelement and deletes the frequency that each character occurs in the classifying documents after stop-word for statistics, with And the document frequency comprising character in statistical classification document;Character string computation subunit, for according to the frequency of character, character The sum of document frequency and classifying documents calculates weights of each character in current class.
The device that the present embodiment is provided, can be that special segmenter is set up in each classification in classifying documents storehouse, and user can be from A segmenter for being best suitable for its inquiry target is selected in numerous special segmenter of classifying, is carried for search engine by the segmenter Word segmentation result for being best suitable for its inquiry target, so as to improve the search precision of search engine.
By taking Chinese character as an example, the present embodiment additionally provides another kind of target information searcher to above-mentioned character, and the device includes Such as lower module:
(1) Chinese character frequency collection module, (2) Chinese character weight computing module, (3) Chinese-character words weights generation module, (4) are special With segmenter, (5) segmenter selecting module, and (6) retrieval request pretreatment module;Wherein, the function of modules is as follows:
Chinese character frequency collection module, calculates the frequency of occurrences of each Chinese character in classification in each classification.
Chinese character weight computing module, with each the Chinese character frequency of occurrences in classification as foundation, calculates each Chinese in classification Probability of occurrence of the word in classification, and frequency is normalized, draw weights of the Chinese character in classification.
The Chinese character weight computing module can be calculated in classification according to the frequency of occurrences of all Chinese characters included in classification Comprising weights of all Chinese characters in classification.
Wherein, Chinese character frequency collection module and Chinese character weight computing module are equivalent to above-mentioned character weight calculation unit.Should Chinese character frequency collection module can collect the frequency of occurrences of all Chinese characters included in classification.
Chinese-character words weights generation module, is the word meter in segmenter dictionary with the Chinese character weights in classification as foundation Calculate the weights in classification.
The Chinese-character words weights generation module can be according to all Chinese characters included in classification in classification weight computing Go out weights of the Chinese-character words in classification in participle storehouse.
Special segmenter, is that a general segmenter is set up in classification, and the weights of all Chinese-character words of classification are incoming In general segmenter, general segmenter is set to become the special segmenter of classification, special segmenter is with segmenter dictionary and classification The weights of all Chinese-character words are used as participle foundation.
It can be seen that, the special segmenter of the present embodiment is built upon on the basis of general segmenter, by defeated to general segmenter Enter the weights of all Chinese-character words of classification, general segmenter be changed into into the special segmenter of classification, special segmenter with point The weights of word device dictionary and all Chinese-character words of classification are used as participle foundation.
Segmenter selecting module, by the proprietary segmenter of multiple classification having built up user is showed, and user is from multiple points One is selected in the special segmenter of class, for search engine Chinese Word Segmentation Service is provided.
User can select to search for the special segmenter that purpose is most matched with it by the segmenter selecting module.
Retrieval request pretreatment module, the Chinese character string of receiving user's input selectes Chinese character string input user Special segmenter, the special segmenter selected from user obtains word segmentation result, and it is defeated that word segmentation result is assembled into into querying condition In entering search engine.
By taking Chinese character input as an example, a kind of target information searcher is present embodiments provided, the device can be arranged on and search In rope engine server 40, referring to Fig. 4, the device is made up of following several modules:
(1) weights generation module 41;
(2) special segmenter 42, is connected with weights generation module 41;
(3) segmenter selecting module 43, is connected with special segmenter 42;
(4) retrieval request pretreatment module 44, is connected with segmenter selecting module 43 and network;
(5) search engine 45, are connected with retrieval request pretreatment module 44;
Wherein, weights generation module 41 is responsible for generating weights of the word included in the classification in the classification, ginseng See Fig. 5, the module includes three submodules:
1st, Chinese character frequency collection module 411:The module removes first the stop-word in document, then statistical classification document library In the frequency of occurrences (Chinese character in the occurrence number/classification of the individual Chinese character included in Chinese character frequency=classification of Chinese character that includes Total number of word), while the number of files (hereinafter referred to as document frequency) comprising Chinese character in statistical classification.
2nd, Chinese character weight computing module 412:The Chinese character frequency that the module is calculated first according to Chinese character frequency collection module 411 Rate, the total number of documents in document frequency and classification calculates weights of the Chinese character in classification;Secondly to be present in participle storehouse but not The Chinese character being present in classification gives default weight.
3rd, Chinese-character words weights generation module 413:Chinese-character words in segmenter dictionary are taken out one by one, and according to Chinese character Word obtains the weights of Chinese character that the Chinese-character words that Chinese character weight computing module 412 calculates include in classification, then basis The weight computing of Chinese character goes out weights of the Chinese-character words in classification in Chinese-character words, finally the weights by Chinese-character words in classification Write hard disk.
Special segmenter 42 is responsible for providing the user the Chinese Word Segmentation Service of specialty, and special segmenter 42 can be by the inspection of user input Rope condition is decomposed into and best suits the desired Chinese-character words of user, and the implementation process of the module is as follows:Initially set up one common point Word device, then reads in the corresponding Chinese-character words weights of classification that Chinese-character words weights generation module 43 is calculated from hard disk, and Chinese-character words weights are bound together with the Chinese-character words in participle storehouse, finally special segmenter segmenter is registered to into and is selected In module 43;During participle, go out to best suit the Chinese-character words combination of classification according to Chinese-character words weight computing.
Segmenter selecting module 43 is responsible for for the special segmenter 42 for establishing showing user in visual mode, and User is allowed to select a special segmenter 42 for best suiting retrieval purpose by the module, the implementation process of the module is as follows: The special segmenter 42 for establishing is saved among chained list first, then provides user circle by segmenter selecting module 43 Face, in the user interface shows the special segmenter 42 in chained list, selects for user.User can only select in chained list it In a special segmenter 42, user select finish after, the special segmenter 42 that segmenter selecting module 43 selectes user Pass to retrieval request pretreatment module 44.
Retrieval request pretreatment module 44 is responsible for the search condition of receiving user's input, calls the special participle that user selectes Device 42 carries out word segmentation processing, and word segmentation result is passed to into search engine:The receive user of retrieval request pretreatment module 44 first Retrieval request, then retrieval request pretreatment module 44 retrieval request is delivered to into user and is selected by segmenter selecting module 43 Word segmentation processing is carried out in fixed special segmenter 42, and word segmentation result is fetched from special segmenter 42, last retrieval request is pre- Word segmentation result is passed to search engine by processing module 44.
Based on the device that Figure 4 and 5 are provided, the present embodiment additionally provides a kind of target information search method, shown in Figure 6 Target information search method flow chart, the method comprises the following steps:
Step S601, scans classifying documents;
Step S602, counts the frequency of occurrences of the Chinese character in the classification;
Step S603, counts weights of the Chinese character in the classification;
Step S604, counts weights of the Chinese-character words in the classification;
Step S605, generates special segmenter;
Step S606, special segmenter is registered in segmenter selecting module;
Step S607, judges whether user selects segmenter;If it is, execution step S608;If not, execution step S609;
Step S608, by the segmenter that user selects retrieval request pretreatment module is delivered to;
Step S609, waits user to select segmenter;
Step S610, judges whether user is input into search condition (or claim retrieval request, equivalent to above-mentioned character string);Such as It is really, execution step S611;If not, execution step S612;
Step S611, calls the segmenter that user selects to carry out word segmentation processing to retrieval request, and using result as Querying condition passes to search engine, then execution step S613;
Step S612, waits user input retrieval request;
Step S613, returns retrieval result to client.
Referring to Fig. 7 target information search method flow charts, the method is comprised the following steps:
Step S700:Chinese character frequency collection module 411 scans classifying documents;
Step S701:Chinese character frequency collection module 411 removes the stop-word in document;
Step S702:The frequency of occurrences (the Chinese character of the Chinese character included in the statistical classification document library of Chinese character frequency collection module 411 The total number of word of Chinese character in the occurrence number/classification of the individual Chinese character included in frequency=classification);
Step S703:In the statistical classification of Chinese character frequency collection module 411 comprising Chinese character number of files (hereinafter referred to as document frequency Rate);
Step S704:The Chinese character frequency that Chinese character weight computing module 412 is calculated according to Chinese character frequency collection module 411, Total number of documents in document frequency and classification calculates weights of the Chinese character in classification;
Step S705:Chinese character weight computing module 412 is that the Chinese character for being present in participle storehouse but being not present in classifying is assigned Give default weight;
Step S706:Chinese-character words weights generation module 413 is weighed according to the Chinese character that Chinese character weight computing module 412 is calculated It is worth and gives weights for the Chinese-character words comprising Chinese character in segmenter dictionary;
Step S707:Set up a common segmenter;
Step S708:The corresponding Chinese words of classification that Chinese-character words weights generation module 413 is calculated are read in from hard disk Language weights, and Chinese-character words weights are bound together with the Chinese-character words in participle storehouse;
Step S709:Participle storehouse with Chinese character weights is injected in common segmenter, special segmenter 42 is made;
Step S710:Special segmenter is registered in segmenter selecting module;
Step S711:Judge whether that each classification establishes special segmenter, if not, repeat step S700 to S710 Till the special segmenter 42 of all class libraries is all set up and is completed;If it is, terminating.
Target information search method flow chart shown in Fig. 8, the method is comprised the following steps:
Step S800:Segmenter selecting module 43 is shown to special segmenter 42 in user interface.
Step S801:Segmenter selecting module 43 waits user to select segmenter 42.
Step S802:Segmenter selecting module 43 receives the special segmenter 42 of user's selection, and is recorded.
Step S803:Segmenter selecting module 43 sends the segmenter that user selectes to retrieval request pretreatment module 44。
Step S804:Retrieval request pretreatment module 44 receives the retrieval request of user, and calls the participle that user selectes Device carries out word segmentation processing to retrieval request, and passes to search engine 45 using result as querying condition.
Step S805:Search engine 45 enters line retrieval according to the search condition after word segmentation processing, and returns retrieval result.
Step S806:Whether user reselects special segmenter 42, if it is, repeating step S802;If not, Execution step S807.
Step S807:Whether user re-enters retrieval request, if it is, step S804 and step S805 are re-executed, If not, terminating, if that is, user does not have new activity, business processing flow terminates automatically.
The present embodiment can mark off multiple classification according to technical field, and target information search system as shown in Figure 9 is shown It is intended to, each classification respectively corresponds to a set of said apparatus, wherein, the segmenter selecting module in the device, retrieval request are pre- Processing module and search engine are utility module.
The present embodiment can provide special segmenter for the classification of each in document library, by taking Chinese character as an example, by classification The occurrence number of the Chinese character in document does probability statistics, calculates weights of each Chinese character in classification, and according to Chinese character weights Weights of each Chinese-character words in classification in segmenter dictionary are calculated, and then special segmenter is set up for each classification, used Family searches for purpose and selects to be best suitable for the special segmenter of its search purpose in segmenter selection interface according to it, and using specialty Segmenter obtains the optimal word segmentation result that purpose is searched for for user, so as to improve the search accuracy rate of search engine, improves and uses Satisfaction of the family to search engine.
As can be seen from the above description, present invention achieves following technique effect:
1st, diversified special segmenter is provided the user, user searches for special point that purpose is best suited by using with it Word device, can effectively improve the accuracy of participle, and improve the retrieval accuracy of search engine on this basis.
2nd, user can select multiple segmenter that multiple word segmentation processing is carried out to same search condition, and each participle is tied Fruit is individually submitted to search engine and retrieves, so as to accurately retrieve the desired document of user.
Obviously, those skilled in the art should be understood that above-mentioned each module of the invention or each step can be with general Computing device realizing, they can be concentrated on single computing device, or are distributed in multiple computing devices and are constituted Network on, alternatively, they can be realized with the executable program code of computing device, it is thus possible to they are stored Performed by computing device in the storage device, and in some cases, can be shown to perform different from order herein The step of going out or describe, or they are fabricated to respectively each integrated circuit modules, or by the multiple modules in them or Step is fabricated to single integrated circuit module to realize.So, the present invention is not restricted to any specific hardware and software combination.
The preferred embodiments of the present invention are the foregoing is only, the present invention is not limited to, for the skill of this area For art personnel, the present invention can have various modifications and variations.It is all within the spirit and principles in the present invention, made any repair Change, equivalent, improvement etc., should be included within the scope of the present invention.

Claims (6)

1. a kind of target information search method, it is characterised in that comprise the steps:
Segmenter and the character string of the user input that receive user is selected, wherein, the segmenter is defeated with the user The segmenter of the string matching for entering;
Participle is carried out to the character string using the segmenter, search terms are obtained;
The search terms for obtaining input search engine is scanned for, target information is obtained;
Wherein, before receiving the segmenter of user's selection and the character string of the user input, methods described also includes:Make The corresponding segmenter of the technical field is set up with classifying documents corresponding with technical field, including:Technical field is carried out point Class, determines the corresponding classifying documents of current class;According to the frequency that each character in the classifying documents occurs, calculate described every Weights of the individual character in the current class;Determine and the character in character string is specified in the current class at described current point The weights of apoplexy due to endogenous wind;Designated character string is in the current class according to the weight computing of each character in the designated character string In weights;Weights binding by the designated character string and the designated character string in the current class, obtains described The segmenter of current class.
2. method according to claim 1, it is characterised in that described occur according to each character in the classifying documents Frequency, calculating the weights of each character in the current class includes:
Delete the stop-word in the classifying documents;
Statistics deletes the frequency that each character occurs in the classifying documents after the stop-word;
Count the document frequency comprising the character in the classifying documents;
Each character described in the sum calculating of the frequency, the document frequency of the character and the classifying documents according to the character Weights in the current class.
3. method according to claim 1, it is characterised in that specify in character string in the determination current class Weights of the character in the current class include:
When having the character being not included in the classifying documents during character string is specified in the current class, do not wrap described in setting The weights of the character being contained in the classifying documents are default weight.
4. the method according to any one of claim 1-3, it is characterised in that the character includes one below:Chinese characters form The character of the character, the character of Korean form or Japanese form of formula.
5. a kind of target information searcher, it is characterised in that include such as lower module:
Receiver module, the segmenter selected for receive user and the character string of the user input, wherein, the segmenter is With the segmenter of the string matching of the user input;
Word-dividing mode, the segmenter for being received using the receiver module carries out participle to the character string, is searched Rope word;
Search module, the search terms input search engine for the word-dividing mode to be obtained is scanned for, and obtains mesh Mark information;
Wherein, described device also includes:Segmenter sets up module, for setting up institute using classifying documents corresponding with technical field The corresponding segmenter of technical field is stated, the segmenter sets up module to be included:Document determining unit, for carrying out to technical field Classification, determines the corresponding classifying documents of current class;Character weight calculation unit, for being determined according to the document determining unit Classifying documents in the frequency that occurs of each character, calculate the weights of each character in the current class;Weights are true Order unit, for determining the current class in specify weights of the character in the current class in character string;Character string Weight calculation unit, works as the designated character string according to the weight computing of each character in the designated character string described Weights in front classification;Segmenter sets up unit, for by the designated character string and the designated character string described current Weights binding in classification, obtains the segmenter of the current class.
6. device according to claim 5, it is characterised in that the character weight calculation unit includes:
Delete subelement, for deleting the classifying documents in stop-word;
Statistics subelement, for counting the deletion subelement each character in the classifying documents after the stop-word is deleted The frequency of appearance, and count the document frequency comprising the character in the classifying documents;
Character string computation subunit, for according to the frequency of the character, the document frequency of the character and the classifying documents Sum calculate the weights of each character in the current class.
CN201110207333.8A 2011-07-22 2011-07-22 Target information search method and device Active CN102890690B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110207333.8A CN102890690B (en) 2011-07-22 2011-07-22 Target information search method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110207333.8A CN102890690B (en) 2011-07-22 2011-07-22 Target information search method and device

Publications (2)

Publication Number Publication Date
CN102890690A CN102890690A (en) 2013-01-23
CN102890690B true CN102890690B (en) 2017-04-12

Family

ID=47534196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110207333.8A Active CN102890690B (en) 2011-07-22 2011-07-22 Target information search method and device

Country Status (1)

Country Link
CN (1) CN102890690B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106708798B (en) * 2015-11-16 2020-03-31 阿里巴巴集团控股有限公司 Character string segmentation method and device
CN106021625A (en) * 2016-07-26 2016-10-12 浪潮软件集团有限公司 Mixed application method of two word segmenters based on SOLR search engine
CN107798004B (en) * 2016-08-29 2022-09-30 中兴通讯股份有限公司 Keyword searching method and device and terminal
CN109063046A (en) * 2018-07-17 2018-12-21 广州资宝科技有限公司 searching method, device and intelligent terminal
CN109800326B (en) * 2019-01-24 2021-07-02 广州虎牙信息科技有限公司 Video processing method, device, equipment and storage medium
CN111090668B (en) * 2019-12-09 2023-09-26 京东科技信息技术有限公司 Data retrieval method and device, electronic equipment and computer readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561818A (en) * 2009-05-13 2009-10-21 北京用友移动商务科技有限公司 Method for word segmentation processing and method for full-text retrieval

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101561818A (en) * 2009-05-13 2009-10-21 北京用友移动商务科技有限公司 Method for word segmentation processing and method for full-text retrieval

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
The Research of Chinese Automatic Word Segmentation In Hierarchical Model Dictionary Binary Tree;Luo XianGang等;《2009 First International Workshop on Database Technology and Applications》;20090426;第321-324页 *
适用于化工专业搜索引擎的中文分词系统的研究与实现;王硕;《中国优秀硕士学位论文全文数据库信息科技辑》;20081115(第11期);第1-52页 *

Also Published As

Publication number Publication date
CN102890690A (en) 2013-01-23

Similar Documents

Publication Publication Date Title
CN102890690B (en) Target information search method and device
CN100545847C (en) A kind of method and system that blog articles is sorted
CN102043833B (en) Search method and device based on query word
CN103577416B (en) Expanding query method and system
JP4006239B2 (en) Document search method and search system
CN108304444B (en) Information query method and device
CN103838754B (en) Information retrieval device and method
CN104199965B (en) Semantic information retrieval method
CN104077407B (en) A kind of intelligent data search system and method
US20080201297A1 (en) Method and System for Determining Relation Between Search Terms in the Internet Search System
CN107729336A (en) Data processing method, equipment and system
CN102004782A (en) Search result sequencing method and search result sequencer
CN105653562A (en) Calculation method and apparatus for correlation between text content and query request
CN103902597A (en) Method and device for determining search relevant categories corresponding to target keywords
CN106202313B (en) Search result towards academic Meta Search Engine synthesizes sort method
CN103942198B (en) For excavating the method and apparatus being intended to
CN112035599A (en) Query method and device based on vertical search, computer equipment and storage medium
CN106095738A (en) Recommendation tables single slice
CN107085568A (en) A kind of text similarity method of discrimination and device
CN104462347B (en) The sorting technique and device of keyword
CN103226601B (en) A kind of method and apparatus of picture searching
CN106021423B (en) META Search Engine personalization results recommended method based on group division
JP2013054606A (en) Document retrieval device, method and program
CN105975508B (en) Personalized meta search engine search result synthesizes sort method
CN109918420B (en) Competitor recommendation method and server

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant