Summary of the invention
The objective of the invention is to: a kind of mobile Internet intelligent information search engine based on keyword search is provided, to be implemented on the portable terminal, can be in the internet site in the intended target scope, according to the specified message sorted columns, carrying out searching for fast based on the information of key word, is the form of expression that satisfies termination property and move operation characteristic simultaneously with information translation.
The present invention realizes like this, a kind of mobile Internet intelligent information search engine based on keyword search, the column that is provided with according to appointment is classified and the targeted website, machine is sampled to the targeted website automatically and is analyzed, generate search rule, and described targeted website is gathered according to described search rule; Then, through an information process, the network element that described targeted website is collected is organized into specific full-text index structure, and buffer memory, constitutes a full-text index information bank; A search mission processing module is arranged, the search command that mobile device sends is handled, judge that by equipment and channel recognition module, mobile Internet access module mobile device inserts the mode of internet, carry out the recognition of devices of mobile device and the identification of employed passage, result is returned to mobile device.
Above-mentioned search rule is meant, utilizes the automatic evaluating objects website structure of system, gathers the html info web that has similar layout accordingly, and the content that generates the targeted website automatically connects obtains expression formula; And as required, generate the content match expression formula of the target network element of definite location; The mapping relations of target network element that obtains by described content match expression formula and target network element and column classification form a network element mapping graph, generate a content and obtain expression formula, form described search rule.
The information process of described search engine is meant, under the driving of above-mentioned search rule, classify in conjunction with column, the targeted website is gathered the http protocol data information of obtaining, carry out that webpage decomposes, coupling filtration, information format, information coding, intelligent sentence go heavy link, and in conjunction with the feature code table, demonstration processes to information, the feature code word that deletion will be filtered is exported one at last and has been removed space, mark and do not have illegal character, do not have the plain text information of other non-text messages.
The content of text of the target network element after the also processing that the information in the full-text index information bank of described search engine is gathered under search rule is formed, and be the information preparation increment full-text index of new typing, and set up index according to time series and the classification of described column in the mode of increment.
Described passage and access passage and the protocol header of recognition of devices module by communicating by letter are discerned the device type of portable terminal, thereby obtain the configuration information of this device type; According to different portable terminals, with search result information, through one can processing at portable terminal characteristic and mobile subscriber's operating characteristic after, the mobility protocol data are outputed to user's portable terminal, show the result of search.
The present invention with interactive means, is provided with column classification and targeted website by adopting above technical scheme, and machine is sampled to the targeted website automatically and analyzed, and generates search rule, and according to these rules described targeted website is gathered; Then, through a message processing flow, the network element that described targeted website is collected is organized into specific full-text index structure, and buffer memory; A search mission processing module is arranged, to mobile device send search command handle, by judging that described mobile device inserts the mode of internet, carry out the recognition of devices of mobile device and the identification of employed passage, return to mobile device after result is handled through corresponding presentation layer.Less relatively at present mobile device screen, computing power is more weak and the situation of network service bandwidth under, the present invention fills up the blank of present this service of mobile field, and has well satisfied numerous mobile subscribers at the needs of mobile message aspect obtaining.
Embodiment
Below in conjunction with accompanying drawing the present invention is done and to describe in further detail:
As Fig. 1, generally speaking,, column classification 4 and targeted website 1 are set with interactive means, machine analyzes 2 to the targeted website automatically, forms search rule 3, and gathers according to these 5 pairs of targeted websites 1 of regular acquisition engine; Then, after an information process 6, will be from the targeted website 1 network element that collects, be organized into specific full-text index structure and buffer memory, constitute full-text index information bank 7; A search mission processing module 8 is arranged, the search command that mobile device sends is handled, judge that by recognition of devices and channel recognition module 9, mobile Internet access module 10 mobile device inserts the mode of internet, carry out the recognition of devices of mobile device and the identification of employed passage, result is returned to mobile device.
As shown in Figure 2, utilize the automatic evaluating objects website structure of system, gather the html info web that has similar layout accordingly, expression formula 3.1 is obtained in the connection of automatically generated content webpage, and according to the manual decision, generate the content match expression formula 3.2 of the target network element of definite location, and pass through the target network element that the content match expression formula obtains, and the mapping relations of target network element and column classification, a network element mapping graph formed, generate a content and obtain expression formula 3.3, constitute search rule.
Among Fig. 2, after system carries out targeted website structure analysis 3.11, the analysis 3.12 of target web Tag syntactic structure and target web content structure analysis 3.13 automatically, will be from the targeted website the webpage gathered of each column, each catalogue based on the tag grammer, classify by identical layout, identical catalogue, automatic generation is connected with the relevant content page in corresponding targeted website obtains expression formula 3.1.
The layout webpage Tag syntactic structure similarities and differences part similar according to each targeted website catalogue, web page contents structure similarities and differences part is determined the target complete network element position of target web, generates the content match expression formula 3.2 of target web.
Feature according to the information type of each target network element, determine the target network element of each information analysis key element correspondence in the webpage by content match expression formula 3.2, the mapping relations of target network element and column classification 4, that is to say, provide a manual decision's mode, the position of decision target network element on target web, and the classification of affiliated column, form a network element mapping graph 3.31, and the content of generation target network element is obtained expression formula 3.3.
Through above-mentioned steps, formed the complete search rule 3 of search engine.
Shown in Fig. 1,3, under the driving of search rule 3, in conjunction with column classification 4, targeted website 1 gathered the http protocol data information 5.1 obtained through an information process 6, carry out that webpage decomposes 6.1, coupling filters 6.2, information format 6.3, information coding 6.4, intelligent sentence go heavy link 6.5, and in conjunction with feature code table 6.7, demonstration processes 6.6 to information, the feature code word that deletion will be filtered, Shu Chu target network element 6.8 is one and has removed space, mark and do not have illegal character, do not have the plain text information of other non-text messages at last.This text message constitutes full-text index information bank 7 after treatment, is the information preparation increment full-text index of new typing in the mode of increment, and sets up index according to time series and column classification.
It heavily is that a kind of processing sentence information repeats elimination methods that above-mentioned intelligent sentence goes, and concrete step is, a) information is formed a complete sentence by the punctuation mark branch, extract condition code, b) information is carried out condition code and extract, every piece of information is extracted N condition code to N natural sentences, remaining is ignored, not enough zero padding; C) condition code is sorted, inserts, searches and compares, every fresh information comparative feature sign indicating number and a most close m piece of writing information d) are got rid of difference repeating in the value scope of setting.
According to full-text index information bank 7, as shown in Figure 4, after search mission processing module 8 is received portable terminal and is sent search command, task is handled, at first carry out user command and handle 8.1, combinatorial search condition, column and time range according to the user command appointment obtain the corresponding results collection from full-text index information bank 7, carry out query results then and handle 8.2, with this result set packing; The result who handles is by the access passage and the protocol header of communication, insert channel recognition 9.1 and recognition of devices 9.2 to what insert 10 portable terminal by mobile Internet, obtain the information of relevant device, according to different portable terminals, with search result information, through one can processing at portable terminal characteristic and mobile subscriber's operating characteristic after, the mobility protocol data are outputed to user's portable terminal, show the result of search.
Among Fig. 5, search mission processing module 8 also can comprise a timer 8.3 and a customizer 8.4, the search mission of inspection mobile phone users customization regularly, search mission comprises the key combination and the column that customizes generally speaking, whether system judges to exist in the information index storehouse and satisfies the up-to-date information that the user subscribes to condition, if have automatically this information push is arrived portable terminal, trigger next processing procedure if then do not continue waiting timer.
As desire on mobile phone mode by wap, to search is based on the relevant information of key word " sentiment undertone in the end of the year " in " analysis expert " the sub-column of " financial column ", the concrete realization and the mode of enforcement are as follows:
1, generates search rule
This part is that a personal-machine alternant way is finished, and has mainly comprised 2 following steps:
A. web analytics: by automatic analysis, generate content page and connect and obtain expression formula, generate the content match expression formula, generate content and obtain expression formula, generate complete information acquisition rule at last to the targeted website.
The coupling expression formula of this example:
sTitle>{.+?}<.+?<br>{.+?}<br><br></td></tr><
The expression formula of obtaining of this example is:
ef=([\″′]|\b)*{[^<\″′]+?}(([\″′]|\b)[^>]*?>)|(>){[^<]+?}<1
B. in diversified website, can specify in the zone of any one network element on the target web (content of text), thereby improve the order of accuarcy of search as our target retrieval.
This routine column is set as follows:
" financial column " coding 001
" analysis expert " coding 001001
These two steps have been finished the required search rule expression formula of driving acquisition engine, main way is the difference that the content page that produces based on two same templates on the targeted website is contrasted, analyzing structure of web page and content, analyzing web page TAG structure, determine the position of each network element in source file, the residing Tag structure of each network element.Analyze the order mapping of the network element that defines in each network element and the database.And obtain all webpages connections, and determine content page, determine that content page connects.Generate to connect and to obtain expression formula, content match expression formula, content and obtain expression formula.The checking expression formula.Form complete search rule with all the other parameters.The verification search rule.
Determine an external unified service column, classify that coded system is as follows: 3 characters are unit according to big or small column, as: 001 is the ground floor node, and 001001 is the child node under 001 node, and 001002 is the child node under 001 node; 002 is ground floor node and 001 sane level, and the like.According to setting, search engine in corresponding service column, provides content service accurately with the information stipulations of targeted website.
2, information acquisition and classification
This part is finished under the driving of the search rule of setting automatically, divides following step.
A. go up the search rule that generates according to this and drive, carry out search rule by circulation, according to a large amount of targeted website groups that set and the collection of target column,
B. in gatherer process, only gather the up-to-date information that occurs on the targeted website according to the mode of iteration.
C. after finishing information acquisition, export original webpage http protocol data-flow.
The search mission execution module decomposes task, and will searching at first, rule is divided into the subtask.
At first obtain homepage column webpage, the column classification is handled definition in search rule, after expression formula is obtained in the execution connection, obtain the connection of content page, obtain content page, submit it to information processing, obtaining next content page, and stipulations are to " analysis expert " the sub-column of " financial column ".
To the information of being gathered, to the respective classified column, just can provide an information agency door that can manage, unified according to regular stipulations to the user, make the result set of search more accurate.Engine is only gathered emerging information, and exports in the mode of quasi real time upgrading.
3, information processing and buffer memory
A. the http protocol data information of obtaining in collection is passed through the processing by the message processing module of information acquisition engine.
The search rule that just utilizes the web analytics module to generate uses content to obtain expression formula, carries out network element and extracts.Utilize the content match expression formula, carry out network element and separate, required network element is extracted.Obtain the complexity of expression formula in order to reduce content, adopt two-stage to obtain expression formula and extract, just the secondary coupling.Make mistakes if content is obtained expression formula, the write error daily record is also returned error code.Through network element decompose, coupling filtration, information format, information coding, information go heavily to handle etc., and link is handled, export one at last and remove the space, remove mark, do not have illegal character, do not have the plain text information of other non-text messages.
Will go heavy processing to the information article that repeats during the course, concrete step: information is formed a complete sentence by the punctuation mark branch, extract condition code, every piece of information extraction N condition code is just got N natural sentences, and unnecessary ignores, not enough benefit 0.Whether two pieces of articles are similar, depend on the condition code multiplicity.Condition code and, both the one piece of whole N of information condition code add up with.Information similar condition code and more approaching, different information characteristics sign indicating numbers adds up and differs bigger, utilizes the Hash table to carry out condition code and ordering, insert, search.Every fresh information comparative feature sign indicating number and the most close M piece of writing information and nearest M piece of writing information just can repeat to get rid of.Information is carried out condition code extract, search content information similar in buffer area if having then get rid of duplicate message, generally is that unit extracts condition code with the natural sentences, and purpose is the speed that improves in full relatively.
B. the information according to " analysis expert " of " financial column " after the above processing is cached to the full-text index information bank via the full-text index module, and be the information preparation increment full-text index of new typing in the mode of increment, this full-text index is establishment is carried out descending sort with time series and column as major key a full-text index, give tacit consent to up-to-date information up front, different columns can be respectively at different physics tables improving concurrent access speed, thereby more high efficiency retrieval can be provided.
4, moving the information search that inserts based on key word handles
Above-described full-text index information bank has been arranged, after search mission is handled mould and is received portable terminal and send search command, task is handled, the result who handles is by the access passage and the protocol header of communication, the device type of identification portable terminal, from a management holder, obtain the information of relevant device, on the wap web interface in " analysis expert " the sub-column of " financial column " search based on the relevant information of key word " sentiment undertone in the end of the year ", with search result information, different qualities according to portable terminal, be packaged into the wap protocol data, make appropriate being presented on the terminal of result.