CN107301253A - A kind of method and device for improving multi-site search key accuracy - Google Patents

A kind of method and device for improving multi-site search key accuracy Download PDF

Info

Publication number
CN107301253A
CN107301253A CN201710732432.5A CN201710732432A CN107301253A CN 107301253 A CN107301253 A CN 107301253A CN 201710732432 A CN201710732432 A CN 201710732432A CN 107301253 A CN107301253 A CN 107301253A
Authority
CN
China
Prior art keywords
information
site
search
targeted website
search key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710732432.5A
Other languages
Chinese (zh)
Other versions
CN107301253B (en
Inventor
李成
范渊
黄进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DBAPPSecurity Co Ltd
Original Assignee
DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DBAPPSecurity Co Ltd filed Critical DBAPPSecurity Co Ltd
Priority to CN201710732432.5A priority Critical patent/CN107301253B/en
Publication of CN107301253A publication Critical patent/CN107301253A/en
Application granted granted Critical
Publication of CN107301253B publication Critical patent/CN107301253B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of method and device for improving multi-site search key accuracy, it is related to internet information field, this method includes:The related information between the site information and preset search word of targeted website is obtained, wherein, site information is the fresh web information of current target website, and site information includes web site contents and station address;Word segmentation processing is carried out to the search information that user inputs, search key is obtained, wherein, search information is the information scanned for targeted website data;The web site contents matched with search key are searched according to related information, web site contents are pushed to user.When the present invention alleviates the web page contents matched by the search of the searching method of prior art with search key, the poor technical problem of the accuracy that exists.

Description

A kind of method and device for improving multi-site search key accuracy
Technical field
The present invention relates to technical field of Internet information, multi-site search key accuracy is improved more particularly, to one kind Method and device.
Background technology
Internet (Internet) is developed rapidly from after 1990s mid-term commercial operations in the whole world.With The high speed development of today's society internet, internet has penetrated into the every field of our daily lifes.Internet can be with Let us understands topical news in time, obtains various newest knowledge and information, our open visual field, improves us usually Entertainment life.
However, when we experience the facility of internet, we also experience the numerous and complicated of internet content, internet Content not only dabbles content extensively, and updates very fast, is all being continually changing all the time, be embodied in content conversion, The increase of content, deletion of content etc..Also, in the case of the new conversion of content day, numerous repetitions are had in internet unavoidably Content.
For above-mentioned situation, under existing search technique background, closed when user inputs search in the search box of webpage After key word, following several situations are will appear from:It can not find the content to be searched for, or the content and search key searched not phase Close, or search the content of multiple repetitions, thus matched by the searching method search of prior art with search key Web page contents when, often there is the poor technical problem of accuracy.
The content of the invention
In view of this, it is an object of the invention to provide a kind of method and dress for improving multi-site search key accuracy Put, when searching for the web page contents matched with search key with the searching method alleviated by prior art, what is existed is accurate The poor technical problem of property.
In a first aspect, the embodiments of the invention provide a kind of method for improving multi-site search key accuracy, including:
The related information between the site information and preset search word of targeted website is obtained, wherein, the site information is The fresh web information of targeted website described in current time, the site information includes web site contents and station address;
To user input search information carry out word segmentation processing, obtain search key, wherein, it is described search information for pair The information that the targeted website data are scanned for;
The web site contents matched according to being searched in the related information with the search key, are pushed to the user The web site contents.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of the first of first aspect, wherein, Obtain before the related information between the site information and preset search word of targeted website, methods described also includes:
Obtain target and crawl the time;
The execution of time control reptile is crawled in the target and currently crawls task, to crawl targeted website, obtains the first net Stand information;
The web site contents included according to first site information determine preset search word, and set up the preset search Related information between word and first site information;
The related information is stored in data server.
With reference to the first possible embodiment of first aspect, the embodiments of the invention provide second of first aspect Possible embodiment, wherein, crawl the execution of time control reptile in the target and currently crawl task, to crawl the target Website, obtains the first site information, including:
Perform it is described it is current crawl task when, the homepage to the targeted website is crawled, and obtains the target network The hyperlink connection interface included in the homepage content and the First page information of the targeted website stood;
The hyperlink connection interface is analyzed, whether determine the hyperlink connection interface is target hyperlink connection interface, wherein, The target hyperlink connection interface is the interface not being crawled, and the target hyperlink connection interface is correct hyperlink connection interface, And the web page contents pre-set are included in the web page contents corresponding to the target hyperlink connection interface;
In the case where determining the target hyperlink connection interface, webpage progress time corresponding to the hyperlink connection interface Go through, obtain the web site contents of the target hyperlink connection interface;
It regard the web site contents and station address of each target hyperlink connection interface as first site information.
With reference to second of possible embodiment of first aspect, the embodiments of the invention provide the third of first aspect Possible embodiment, wherein, the homepage to the targeted website is crawled, including:
Judge whether to the targeted website be to perform to crawl task first;
Judge it is no in the case of, the second site information is analyzed, it is determined whether the target network can be passed through Stand and the webpage indicated by target network address is conducted interviews, or with the presence or absence of in webpage in the webpage indicated by the target network address Hold, second site information is that reptile execution first crawls the information crawled during task, and described first crawls task Task is crawled for current crawl task upper one, the target network address is any one in second site information Station address,
Wherein, it is determined that in the case of, then the homepage to the targeted website is crawled, to obtain the target The hyperlink connection interface included in the homepage content and the First page information of website;
Determine it is no in the case of, by the related information associated with the target network address from the data server Delete.
With reference to the first possible embodiment of first aspect, the embodiment of the present invention additionally provides the 4th of first aspect Possible embodiment is planted, wherein, methods described also includes:
Judge whether to the targeted website be to perform described currently to crawl task first;
Judge it is no in the case of, the second site information is analyzed, it is determined whether the target network can be passed through Stand and the webpage indicated by target network address is conducted interviews, or with the presence or absence of in webpage in the webpage indicated by the target network address Hold, second site information is that reptile execution first crawls the information crawled during task, and described first crawls task Task is crawled for current crawl task upper one, the target network address is any one in second site information Station address,
Wherein, it is determined that in the case of, then perform the step of being crawled to the homepage of the targeted website;
Determine it is no in the case of, by the related information associated with the target network address from the data server Delete.
With reference to second of possible embodiment of first aspect, the embodiments of the invention provide the 5th of first aspect kind Possible embodiment, wherein, obtain target and crawl the time, including:
Java timer quartz is configured in advance, the time is crawled with set the reptile, wherein, Java's Timer quartz is used for reptile execution described in clocked flip and crawls task;
From it is described crawl the time in extract target crawl the time.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of the 6th of first aspect kind, wherein, it is right The search information of user's input carries out word segmentation processing, obtains search key, including:
By IKAnalyzer segmenter, word segmentation processing is carried out to the search information that user inputs, search key is obtained.
With reference in a first aspect, the embodiments of the invention provide the possible embodiment of gas kind of first aspect, wherein, root The web site contents matched with the search key are searched according to the related information, including:From the search of the related information The search key is searched in word;According to the matching degree between the search term and the search key, it is determined that and institute State the associated site information of search term;
Pushing the web site contents to the user includes:According to the matching degree, by the net in the site information Content push of standing gives the user.
Second aspect, the embodiment of the present invention also provides a kind of device for improving multi-site search key accuracy, including:
First acquisition module, for obtaining the related information between the site information of targeted website and preset search word, its In, the site information is the fresh web information of targeted website described in current time, and the site information includes web site contents And station address;
Word-dividing mode, the search information for being inputted to user carries out word segmentation processing, obtains search key, wherein, institute It is the information scanned for the targeted website data to state search information;
Pushing module, for searching the web site contents matched with the search key according to the related information, to The user pushes the web site contents.
The third aspect, the embodiment of the present invention also provides a kind of meter for the non-volatile program code that can perform with processor Calculation machine computer-readable recording medium, described program code makes the raising multi-site search key described in the computing device first aspect accurate The method of true property.
The embodiment of the present invention brings following beneficial effect:Between the site information and preset search word that obtain targeted website Related information, wherein, site information be current target website fresh web information, site information include web site contents And station address;To user input search information carry out word segmentation processing, obtain search key, wherein, search information for pair The information that targeted website data are scanned for;The web site contents matched with search key, Xiang Yong are searched according to related information Family pushes the web site contents.In the embodiment of the present invention, site information is the fresh web information of current target website, is closed Join the newest related information that information is also current time, the Real-time ensuring technology of the related information accuracy of related information, from And when alleviating the web page contents matched by the searching method search of prior art with search key, the accuracy existed Poor technical problem.
Other features and advantages of the present invention will be illustrated in the following description, also, partly be become from specification Obtain it is clear that or being understood by implementing the present invention.The purpose of the present invention and other advantages are in specification, claims And specifically noted structure is realized and obtained in accompanying drawing.
To enable the above objects, features and advantages of the present invention to become apparent, preferred embodiment cited below particularly, and coordinate Appended accompanying drawing, is described in detail below.
Brief description of the drawings
, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical scheme of the prior art The accompanying drawing used required in embodiment or description of the prior art is briefly described, it should be apparent that, in describing below Accompanying drawing is some embodiments of the present invention, for those of ordinary skill in the art, before creative work is not paid Put, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of method flow diagram for raising multi-site search key accuracy that the embodiment of the present invention one is provided;
Fig. 2 is the method flow diagram crawled to the homepage of targeted website that the embodiment of the present invention one is provided;
Fig. 3 is a kind of schematic device for raising multi-site search key accuracy that the embodiment of the present invention two is provided;
Fig. 4 is the device signal for another raising multi-site search key accuracy that the embodiment of the present invention two is provided Figure.
Icon:The acquisition modules of 100- first;200- word-dividing modes;300- pushing modules;The acquisition modules of 400- second;500- Crawl module;600- sets up module;700- memory modules.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with accompanying drawing to the present invention Technical scheme be clearly and completely described, it is clear that described embodiment is a part of embodiment of the invention, rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise Lower obtained every other embodiment, belongs to the scope of protection of the invention.
Internet content not only dabbles content extensively, and updates very fast, thus, after user's input search key, The content and search key that either can not find the content to be searched for or search are uncorrelated, or search multiple repetitions , often there is the poor technical problem of accuracy with the web page contents that search key matches in content.Based on this, the present invention is real A kind of method and device of raising multi-site search key accuracy of example offer is provided, searching by prior art can be alleviated When Suo Fangfa searches for the web page contents matched with search key, the poor technical problem of the accuracy that exists.
Embodiment one
A kind of method for improving multi-site search key accuracy provided in an embodiment of the present invention, as shown in figure 1, the party Method comprises the following steps:
Step S102, obtains the related information between the site information and preset search word of targeted website, wherein, website letter The fresh web information for current target website is ceased, site information includes web site contents and station address.
Specifically, targeted website includes single web-site or multiple web-sites.
Further, since the information of number of site is all in real-time update, above-mentioned site information is current target net The fresh web information stood, thus site information is real-time site information.
Step S104, carries out word segmentation processing to the search information that user inputs, obtains search key, wherein, search letter Cease the information to be scanned for targeted website data.
Specifically, the search information of user's input is generally character string, by carrying out word segmentation processing to character string, is searched for Keyword.
Step S106, the web site contents matched with search key are searched according to related information, and website is pushed to user Content.
It should be noted that step described by above-mentioned steps S102 to step S106 can by a performs device come Carry out, the performs device can be located between company Intranet and targeted website (targeted website is the outer net of company), and performs device is led to Cross and outer net communication and obtain the related information between the site information of targeted website and preset search word, and by related information Preserved.In addition, performs device pre-sets the word segmentation regulation that word segmentation processing is carried out to the search information that user inputs. When the user of company Intranet will scan for the data of targeted website, performs device by with the client communication in Intranet and The search information of user's input is obtained, and obtains the related information pre-saved, is then searched and is closed with searching for according to related information The web site contents that key word matches, web site contents are pushed to user.
It is emphasized that site information is the fresh web information of current target website, related information is also to work as The newest related information at preceding moment, the Real-time ensuring technology of the related information accuracy of related information, passes through so as to alleviate During the web page contents that the searching method search of prior art matches with search key, the poor technology of the accuracy that exists is asked Topic.
On performs device by and outer net communication and obtain between the site information of targeted website and preset search word Related information, give detailed embodiment in an optional embodiment of the embodiment of the present invention, specifically include as follows Step:
Before the related information between the site information and preset search word of targeted website is obtained, when acquisition target is crawled Between;
The execution of time control reptile is crawled in target and currently crawls task, to crawl targeted website, obtains the first website letter Breath;
The web site contents included according to the first site information determine preset search word, and set up preset search word and first Related information between site information;
Related information is stored in data server.
Specifically, can also by the address depth of web-site, and for judge address whether be targeted website ground The entity class info web such as location is stored in data server, so as to crawling in task and can more efficiently crawl afterwards.
It should be noted that the related information stored in data server includes following two situations:The first situation is Related information between web site contents and preset search word, second of situation is the association between station address and preset search word Information.For the first situation, after user's input search information, searched and search key phase directly from related information The web site contents matched somebody with somebody, and web site contents are pushed to user;For second of situation, after user's input search information, from pass The station address matched with search key is searched in connection information, is then searched indicated by station address in the website of webpage Hold, and web site contents are pushed to user.
Wherein, crawl the execution of time control reptile in target and currently crawl task, and target crawls the detailed acquisition of time Method is specifically, as follows referring to another optional embodiment of the embodiment of the present invention:
Java timer quartz is configured in advance, the time is crawled with set reptile, wherein, Java timing Device quartz is used for the execution of clocked flip reptile and crawls task;Then target is extracted from the time is crawled and crawls the time.
It should be noted that Java timer quartz has triggering reptile to perform the preset time for the task that crawls, and hold Crawled before the trade task target crawl the time in above-mentioned preset time apart from the previous time that current time is nearest.
In another optional embodiment of the embodiment of the present invention, crawl the execution of time control reptile in target and currently crawl Task, to crawl targeted website, obtains the first site information, including:
When execution currently crawls task, the homepage to targeted website is crawled, and obtains the homepage content of targeted website With the hyperlink connection interface included in the First page information of targeted website, for example, href interfaces, src interfaces.
Hyperlink connection interface is analyzed, whether determine hyperlink connection interface is target hyperlink connection interface, wherein, target hyperlink Connection interface is the interface that was not crawled, and target hyperlink connection interface is correct hyperlink connection interface, and target hyperlink connection interface The web page contents pre-set are included in corresponding web page contents.Here the web page contents pre-set in advance are wanted to obtain Web page contents, if the hyperlink connection interface of uninterested web page contents then webpage is not connect for above-mentioned target hyperlink Mouthful.
In the case where determining target hyperlink connection interface, webpage corresponding to hyperlink connection interface is traveled through, and obtains mesh Mark the web site contents of hyperlink connection interface;
It regard the web site contents and station address of each target hyperlink connection interface as the first site information.
It should be noted that the embodiment of the present invention gives the mode that reptile is crawled to website homepage, for difference The website of depth, reptile crawls to deeper webpage, can equally take above-mentioned crawling mode.
In another optional embodiment of the embodiment of the present invention, as shown in Fig. 2 the homepage to targeted website is climbed Take, comprise the following steps:
Step S201, judge to targeted website whether be first perform crawl task, wherein, judge it is no in the case of Step S202 is performed, step S203 is performed in the case where judging to be;
Step S202, is analyzed the second site information, it is determined whether can be by targeted website to target network address institute Whether the webpage of instruction conducts interviews (that is, the webpage of target network address whether there is), or deposited in the webpage indicated by target network address Web page contents (that is, the info web of target network address whether there is), the second site information is that reptile execution first crawls task When the information that crawls, first to crawl task be currently to crawl upper one of task to crawl task, and target network address is the second website Any one station address in information.Wherein, it is determined that in the case of perform step S203, determining no situation Lower execution step S204;
Step S203, the homepage to targeted website is crawled, to obtain the homepage content and First page information of targeted website In the hyperlink connection interface that includes;
Step S204, the related information associated with target network address is deleted from data server, continuation is then back to Perform step S203.
It should be noted that in the embodiment of the present invention, target network address is any one website in the second site information Location, the second site information is that reptile execution first crawls the information crawled during task, is realized by above-mentioned steps to upper one It is secondary crawl the purpose that obtained site information is verified, it is to avoid search result occurs the information of target network address, but clicks on The phenomenon of the web site contents of correlation is not got after the link of target network address, it is to avoid the subsequent searches therefore caused are wrong By mistake.
In another optional embodiment of the embodiment of the present invention, related information is being stored in it in data server Before, improving the method for multi-site search key accuracy also includes:
Whether judge data server is to store related information first;
In the case where judging data server for storage related information first, by the number stored in data server According to carrying out emptying processing, thus avoid remain former crawled to the website outside targeted website in data server and Obtained related information, or avoid remaining some dirty datas in data server.
In another optional embodiment of the embodiment of the present invention, word segmentation processing is carried out to the search information that user inputs, Search key is obtained, including:
By IKAnalyzer segmenter, word segmentation processing is carried out to the search information that user inputs, search key is obtained.
Specifically, the common approach of participle is managed by the general-purpose interface of a participle first.Wherein, at for participle The returning result of reason, it is necessary in view of a variety of situations, including:The first, returning result is gathered with key-value pair Map;Second, Returning result is that String character strings are gathered for the Set of mark.
In addition, carrying out participle by IKAnalyzer segmenter, two kinds of participle types can be added, one kind is that intelligence is cut Point, also one kind is fine granularity cutting, can thus go to carry out the cutting character string of different modes as needed.
Furthermore, it is possible to timely be updated to IKAnalyzer dictionary, participle is allowed to reach more preferably participle Effect.
The embodiment of the present invention uses IKAnalyzer segmenter, by the way of based on text matches, it is not necessary to which input is big Amount manpower is trained and marked, and can make dictionary by oneself, and the convenient word for adding domain specific can separate many granularities Result.
In another optional embodiment of the embodiment of the present invention, searched and matched with search key according to related information Web site contents, including:Search key is searched from the search term of related information;According between search term and search key Matching degree, it is determined that the site information associated with search term;
Pushing web site contents to user includes:According to matching degree, the web site contents in site information are pushed to user.
Specifically, can be implemented by search server (such as solr, solr are an independent data servers, The search information that solr can be inputted according to user, generation indexes to rapidly search for result and return);User can also lead to Http Get are crossed to propose to include the searching request of search information.
In the case where search key is many phrases, search key can be connected by OR, Ran Houtong Crossing Client goes matching preset search word to obtain web site contents.Wherein, the web site contents in site information are pushed to user, Can be the web site contents filtered out, the quantity number for containing search key according to the corresponding preset search word of web site contents is arranged Sequence scans for the page presentation of result.
Embodiment two
A kind of device for improving multi-site search key accuracy provided in an embodiment of the present invention, as shown in figure 3, bag Include:
First acquisition module 100, for obtaining the related information between the site information of targeted website and preset search word, Wherein, site information is the fresh web information of current target website, and site information includes web site contents and station address;
Word-dividing mode 200, the search information for being inputted to user carries out word segmentation processing, obtains search key, wherein, Search information is the information scanned for targeted website data;
Pushing module 300, for searching the web site contents matched with search key according to related information, is pushed away to user Send web site contents.
In embodiments of the present invention, the first acquisition module 100 obtain targeted website site information and preset search word it Between related information, wherein, site information be current target website fresh web information, site information include website in Hold and station address;The search information that word-dividing mode 200 is inputted to user carries out word segmentation processing, obtains search key, wherein, Search information is the information scanned for targeted website data;Pushing module 300 is searched with searching for crucial according to related information The web site contents that word matches, web site contents are pushed to user.In the embodiment of the present invention, site information is current target net The fresh web information stood, related information is also the newest related information at current time, the Real-time ensuring technology of related information The accuracy of related information, so as to alleviate the webpage matched by the searching method search of prior art with search key During content, the poor technical problem of the accuracy that exists.
In another optional embodiment of the embodiment of the present invention, as shown in figure 4, it is accurate to improve multi-site search key The device of property also includes:
Second acquisition module 400, the time is crawled for obtaining target;
Module 500 is crawled, currently task is crawled for crawling the execution of time control reptile in target, to crawl target network Stand, obtain the first site information;
Module 600 is set up, the web site contents for including according to the first site information determine preset search word, and set up Related information between preset search word and the first site information;
Memory module 700, for related information to be stored in data server.
In another optional embodiment of the embodiment of the present invention, crawling module includes:
Unit is crawled, for when execution currently crawls task, the homepage to targeted website to be crawled, and obtains target network The hyperlink connection interface included in the homepage content and the First page information of targeted website stood;
Determining unit, whether for analyzing hyperlink connection interface, it is target hyperlink connection interface to determine hyperlink connection interface, Wherein, target hyperlink connection interface is the interface that was not crawled, and target hyperlink connection interface is correct hyperlink connection interface, and mesh The web page contents pre-set are included in web page contents corresponding to mark hyperlink connection interface;
Traversal Unit, in the case where determining target hyperlink connection interface, webpage corresponding to hyperlink connection interface to enter Row traversal, obtains the web site contents of target hyperlink connection interface;
Determining unit, for the web site contents and station address of each target hyperlink connection interface to be believed as the first website Breath.
In another optional embodiment of the embodiment of the present invention, crawl unit and be additionally operable to:
Judge whether to targeted website be to perform to crawl task first;
Judge it is no in the case of, the second site information is analyzed, it is determined whether targeted website pair can be passed through Webpage indicated by target network address conducts interviews, or whether there is web page contents, the second net in the webpage indicated by target network address Information of standing is that reptile execution first crawls the information crawled during task, and first crawls task currently to crawl upper one of task Task is crawled, target network address is any one station address in the second site information,
Wherein, it is determined that in the case of, then the homepage to targeted website is crawled, to obtain the head of targeted website The hyperlink connection interface included in page content and First page information;
Determine it is no in the case of, the related information associated with target network address is deleted from data server.
In another optional embodiment of the embodiment of the present invention, the device of multi-site search key accuracy is improved also Including:
Judge module, for judging whether data server is to store related information first;
Module is emptied, in the case where judging data server for storage related information first, by data, services The data stored in device carry out emptying processing.
In another optional embodiment of the embodiment of the present invention, the second acquisition module is used for:
Java timer quartz is configured in advance, the time is crawled with set reptile, wherein, Java timing Device quartz is used for the execution of clocked flip reptile and crawls task;
Target is extracted from the time is crawled and crawls the time.
In another optional embodiment of the embodiment of the present invention, word-dividing mode is used for:
By IKAnalyzer segmenter, word segmentation processing is carried out to the search information that user inputs, search key is obtained.
In another optional embodiment of the embodiment of the present invention, pushing module is used for:
Search key is searched from the search term of related information;According to the matching journey between search term and search key Degree, it is determined that the site information associated with search term;
According to matching degree, the web site contents in site information are pushed to user.
Embodiment three
The embodiments of the invention provide a kind of the computer-readable of non-volatile program code that can perform with processor Medium, program code makes a kind of method of raising multi-site search key accuracy of computing device embodiment, wherein, by In the fresh web information that site information is current target website, the newest association letter of related information also for current time Breath, the Real-time ensuring technology of the related information accuracy of related information, is searched so as to alleviate by the searching method of prior art During the web page contents that rope and search key match, the poor technical problem of the accuracy that exists.
The computer journey of the method and device for the raising multi-site search key accuracy that the embodiment of the present invention is provided Sequence product, including the computer-readable recording medium of program code is stored, the instruction that described program code includes can be used for holding Method described in row previous methods embodiment, implements and can be found in embodiment of the method, will not be repeated here.
It is apparent to those skilled in the art that, for convenience and simplicity of description, the system of foregoing description With the specific work process of device, the corresponding process in preceding method embodiment is may be referred to, be will not be repeated here.
In addition, in the description of the embodiment of the present invention, unless otherwise clearly defined and limited, term " installation ", " phase Even ", " connection " should be interpreted broadly, for example, it may be being fixedly connected or being detachably connected, or be integrally connected;Can To be mechanical connection or electrical connection;Can be joined directly together, can also be indirectly connected to by intermediary, Ke Yishi The connection of two element internals.For the ordinary skill in the art, with concrete condition above-mentioned term can be understood at this Concrete meaning in invention.
If the function is realized using in the form of SFU software functional unit and is used as independent production marketing or in use, can be with It is stored in a computer read/write memory medium.Understood based on such, technical scheme is substantially in other words The part contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, including some instructions are to cause a computer equipment (can be individual People's computer, server, or network equipment etc.) perform all or part of step of each of the invention embodiment methods described. And foregoing storage medium includes:USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
In the description of the invention, it is necessary to explanation, term " " center ", " on ", " under ", "left", "right", " vertical ", The orientation or position relationship of the instruction such as " level ", " interior ", " outer " be based on orientation shown in the drawings or position relationship, merely to Be easy to the description present invention and simplify description, rather than indicate or imply signified device or element must have specific orientation, With specific azimuth configuration and operation, therefore it is not considered as limiting the invention.
In addition, term " first ", " second ", " the 3rd " are only used for describing purpose, and it is not intended that indicating or implying phase To importance.
Finally it should be noted that:Embodiment described above, is only the embodiment of the present invention, to illustrate the present invention Technical scheme, rather than its limitations, protection scope of the present invention is not limited thereto, although with reference to the foregoing embodiments to this hair It is bright to be described in detail, it will be understood by those within the art that:Any one skilled in the art The invention discloses technical scope in, it can still modify to the technical scheme described in previous embodiment or can be light Change is readily conceivable that, or equivalent substitution is carried out to which part technical characteristic;And these modifications, change or replacement, do not make The essence of appropriate technical solution departs from the spirit and scope of technical scheme of the embodiment of the present invention, should all cover the protection in the present invention Within the scope of.Therefore, protection scope of the present invention described should be defined by scope of the claims.

Claims (10)

1. a kind of method for improving multi-site search key accuracy, it is characterised in that including:
The related information between the site information and preset search word of targeted website is obtained, wherein, the site information is current The fresh web information of targeted website described in moment, the site information includes web site contents and station address;
Word segmentation processing is carried out to the search information that user inputs, search key is obtained, wherein, the search information is to described The information that targeted website data are scanned for;
The web site contents matched with the search key are searched according to the related information, the net is pushed to the user Stand content.
2. according to the method described in claim 1, it is characterised in that obtaining the site information and preset search word of targeted website Between related information before, methods described also includes:
Obtain target and crawl the time;
The execution of time control reptile is crawled in the target and currently crawls task, to crawl targeted website, obtains the first website letter Breath;
The web site contents included according to first site information determine preset search word, and set up the preset search word and Related information between first site information;
The related information is stored in data server.
3. method according to claim 2, it is characterised in that crawl the execution of time control reptile in the target and currently climb Task is taken, to crawl the targeted website, the first site information is obtained, including:
Perform it is described it is current crawl task when, the homepage to the targeted website is crawled, and obtains the targeted website The hyperlink connection interface included in the First page information of homepage content and the targeted website;
The hyperlink connection interface is analyzed, whether determine the hyperlink connection interface is target hyperlink connection interface, wherein, it is described Target hyperlink connection interface is the interface that was not crawled, and the target hyperlink connection interface is correct hyperlink connection interface, and institute State in the web page contents corresponding to target hyperlink connection interface comprising the web page contents pre-set;
In the case where determining the target hyperlink connection interface, webpage corresponding to the hyperlink connection interface is traveled through, and is obtained To the web site contents of the target hyperlink connection interface;
It regard the web site contents and station address of each target hyperlink connection interface as first site information.
4. method according to claim 3, it is characterised in that the homepage to the targeted website is crawled, including:
Judge whether to the targeted website be to perform to crawl task first;
Judge it is no in the case of, the second site information is analyzed, so that determine whether can be by the targeted website Webpage indicated by target network address is conducted interviews, or with the presence or absence of web page contents in the webpage indicated by the target network address, Second site information is that the reptile performs first and crawls the information crawled during task, and described first to crawl task be institute State current crawl task upper one and crawl task, the target network address is any one website in second site information Address,
Wherein, it is determined that in the case of, then the homepage to the targeted website is crawled, to obtain the targeted website Homepage content and the First page information in the hyperlink connection interface that includes;
Determine it is no in the case of, the related information associated with the target network address is deleted from the data server Remove.
5. method according to claim 2, it is characterised in that the related information is being stored in it in data server Before, methods described also includes:
Whether judge the data server is to store the related information first;
Judging the data server in the case of storing the related information first, by the data server The data of storage carry out emptying processing.
6. method according to claim 2, it is characterised in that obtain target and crawl the time, including:
Java timer quartz is configured in advance, the time is crawled with set the reptile, wherein, Java timing Device quartz is used for reptile execution described in clocked flip and crawls task;
From it is described crawl the time in extract target crawl the time.
7. according to the method described in claim 1, it is characterised in that word segmentation processing is carried out to the search information that user inputs, obtained To search key, including:
By IKAnalyzer segmenter, word segmentation processing is carried out to the search information that user inputs, search key is obtained.
8. according to the method described in claim 1, it is characterised in that
The web site contents matched with the search key are searched according to the related information, including:From the related information Search term in search the search key;According to the matching degree between the search term and the search key, really The fixed site information associated with the search term;
Pushing the web site contents to the user includes:According to the matching degree, by the website in the site information Appearance is pushed to the user.
9. a kind of device for improving multi-site search key accuracy, it is characterised in that including:
First acquisition module, for obtaining the related information between the site information of targeted website and preset search word, wherein, institute The fresh web information that site information is targeted website described in current time is stated, the site information includes web site contents and website Address;
Word-dividing mode, the search information for being inputted to user carries out word segmentation processing, obtains search key, wherein, it is described to search Rope information is the information scanned for the targeted website data;
Pushing module, for searching the web site contents matched with the search key according to the related information, to described User pushes the web site contents.
10. a kind of computer-readable medium for the non-volatile program code that can perform with processor, it is characterised in that described Program code makes the raising multi-site search key accuracy any one of the computing device claim 1-8 Method.
CN201710732432.5A 2017-08-23 2017-08-23 Method and device for improving accuracy of multi-site search keywords Active CN107301253B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710732432.5A CN107301253B (en) 2017-08-23 2017-08-23 Method and device for improving accuracy of multi-site search keywords

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710732432.5A CN107301253B (en) 2017-08-23 2017-08-23 Method and device for improving accuracy of multi-site search keywords

Publications (2)

Publication Number Publication Date
CN107301253A true CN107301253A (en) 2017-10-27
CN107301253B CN107301253B (en) 2020-02-04

Family

ID=60132524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710732432.5A Active CN107301253B (en) 2017-08-23 2017-08-23 Method and device for improving accuracy of multi-site search keywords

Country Status (1)

Country Link
CN (1) CN107301253B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108519984A (en) * 2018-02-07 2018-09-11 平安科技(深圳)有限公司 weather data processing method, server and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102135967A (en) * 2010-01-27 2011-07-27 华为技术有限公司 Webpage keywords extracting method, device and system
CN102446225A (en) * 2012-01-11 2012-05-09 深圳市爱咕科技有限公司 Real-time search method, device and system
CN102456057A (en) * 2010-11-01 2012-05-16 阿里巴巴集团控股有限公司 Retrieval method, device and server based on online trading platform
CN102591992A (en) * 2012-02-15 2012-07-18 苏州亚新丰信息技术有限公司 Webpage classification identifying system and method based on vertical search and focused crawler technology
CN103778122A (en) * 2012-10-17 2014-05-07 腾讯科技(深圳)有限公司 Searching method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102135967A (en) * 2010-01-27 2011-07-27 华为技术有限公司 Webpage keywords extracting method, device and system
CN102456057A (en) * 2010-11-01 2012-05-16 阿里巴巴集团控股有限公司 Retrieval method, device and server based on online trading platform
CN102446225A (en) * 2012-01-11 2012-05-09 深圳市爱咕科技有限公司 Real-time search method, device and system
CN102591992A (en) * 2012-02-15 2012-07-18 苏州亚新丰信息技术有限公司 Webpage classification identifying system and method based on vertical search and focused crawler technology
CN103778122A (en) * 2012-10-17 2014-05-07 腾讯科技(深圳)有限公司 Searching method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108519984A (en) * 2018-02-07 2018-09-11 平安科技(深圳)有限公司 weather data processing method, server and computer readable storage medium

Also Published As

Publication number Publication date
CN107301253B (en) 2020-02-04

Similar Documents

Publication Publication Date Title
US10146785B2 (en) Operator-guided application crawling architecture
US8751466B1 (en) Customizable answer engine implemented by user-defined plug-ins
Kumar et al. Keyword query based focused Web crawler
US8386495B1 (en) Augmented resource graph for scoring resources
CN103365924B (en) A kind of method of internet information search, device and terminal
JP4848388B2 (en) How to calculate a score for a search query
US20130290319A1 (en) Performing application searches
CN102760151B (en) Implementation method of open source software acquisition and searching system
CN107145496A (en) The method for being matched image with content item based on keyword
CN108052632B (en) Network information acquisition method and system and enterprise information search system
CN102930059A (en) Method for designing focused crawler
CN103049542A (en) Domain-oriented network information search method
CN105159930A (en) Search keyword pushing method and apparatus
US20230229714A1 (en) Identifying Information Using Referenced Text
CN102385613A (en) Web page positioning method and system
CN107679226B (en) Tourism body constructing method based on theme
CN102270331A (en) Network shopping navigating method based on visual search
CN103530339A (en) Mobile application information push method and device
CN112948547B (en) Logging knowledge graph construction query method, device, equipment and storage medium
CN108572971B (en) Method and device for mining keywords related to search terms
CN102760150A (en) Webpage extraction method based on attribute reproduction and labeled path
CN102222098A (en) Method and system for pre-fetching webpage
CN105022775A (en) Apparatus and method for structuring web page access history
US20160103913A1 (en) Method and system for calculating a degree of linkage for webpages
CN103942211B (en) A kind of recognition methods of text page and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 310051 No. 188 Lianhui Street, Xixing Street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Hangzhou Annan information technology Limited by Share Ltd

Address before: Zhejiang Zhongcai Building No. 68 Binjiang District road Hangzhou City, Zhejiang Province, the 310051 and 15 layer

Applicant before: Dbappsecurity Co.,ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant