CN107301253B - Method and device for improving accuracy of multi-site search keywords - Google Patents
Method and device for improving accuracy of multi-site search keywords Download PDFInfo
- Publication number
- CN107301253B CN107301253B CN201710732432.5A CN201710732432A CN107301253B CN 107301253 B CN107301253 B CN 107301253B CN 201710732432 A CN201710732432 A CN 201710732432A CN 107301253 B CN107301253 B CN 107301253B
- Authority
- CN
- China
- Prior art keywords
- website
- information
- target
- search
- crawling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention provides a method and a device for improving accuracy of multi-site search keywords, which relate to the field of internet information, and the method comprises the following steps: acquiring associated information between website information of a target website and a preset search word, wherein the website information is latest website information of the target website at the current moment, and the website information comprises website content and a website address; performing word segmentation processing on search information input by a user to obtain search keywords, wherein the search information is information for searching target website data; and searching the website content matched with the search keyword according to the associated information, and pushing the website content to the user. The invention alleviates the technical problem of poor accuracy when searching the webpage content matched with the search keyword by the search method in the prior art.
Description
Technical Field
The invention relates to the technical field of internet information, in particular to a method and a device for improving accuracy of multi-site search keywords.
Background
The Internet (Internet) has been rapidly developed globally after its commercial operation in the middle of the 90 s of the 20 th century. With the rapid development of the internet in the current society, the internet has penetrated into various fields of our daily lives. The internet can help people to know current affair news in time, acquire various latest knowledge and information, widen the visual field of people and improve the ordinary entertainment life of people.
However, when people feel the convenience of the internet, people also feel that the internet content is complicated, the internet content is not only related to a wide range of contents, but also is updated quickly, and the internet content changes constantly at any moment, specifically, the contents are changed, increased, deleted and the like. Moreover, when the content is changed every day, the internet has many duplicate contents.
In view of the above situation, in the background of the existing search technology, after a user inputs a search keyword in a search box of a web page, the following situations occur: the content to be searched cannot be found, or the searched content is not related to the search keyword, or a plurality of repeated contents are searched, so that the technical problem of poor accuracy often exists when the web page content matched with the search keyword is searched by the search method in the prior art.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for improving accuracy of a multi-site search keyword, so as to alleviate the technical problem of poor accuracy when searching for web content matched with the search keyword by using a search method in the prior art.
In a first aspect, an embodiment of the present invention provides a method for improving accuracy of a multi-site search keyword, including:
acquiring associated information between website information of a target website and a preset search word, wherein the website information is the latest website information of the target website at the current moment, and the website information comprises website content and a website address;
performing word segmentation processing on search information input by a user to obtain search keywords, wherein the search information is information for searching the target website data;
and searching the website content matched with the search keyword according to the associated information, and pushing the website content to the user.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where before acquiring association information between website information of a target website and a preset search term, the method further includes:
acquiring target crawling time;
controlling a crawler to execute a current crawling task at the target crawling time so as to crawl a target website to obtain first website information;
determining a preset search word according to website content included in the first website information, and establishing associated information between the preset search word and the first website information;
and storing the associated information in a data server.
With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where controlling, at the target crawling time, a crawler to execute a current crawling task to crawl the target website to obtain first website information includes:
when the current crawling task is executed, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information of the target website;
analyzing the hyperlink interface to determine whether the hyperlink interface is a target hyperlink interface, wherein the target hyperlink interface is an interface which is not crawled, the target hyperlink interface is a correct hyperlink interface, and the webpage content corresponding to the target hyperlink interface contains preset webpage content;
traversing the webpage corresponding to the hyperlink interface under the condition that the target hyperlink interface is determined to obtain the website content of the target hyperlink interface;
and taking the website content and the website address of each target hyperlink interface as the first website information.
With reference to the second possible implementation manner of the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where crawling the top page of the target website includes:
judging whether the target website executes the crawling task for the first time;
under the condition that no, analyzing second website information to determine whether a webpage indicated by a target website can be accessed through the target website or whether webpage content exists in the webpage indicated by the target website, wherein the second website information is information crawled by the crawler when executing a first crawling task, the first crawling task is a last crawling task of the current crawling task, and the target website is any website address in the second website information,
if yes, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information;
and in the case of no determination, deleting the associated information associated with the target website from the data server.
With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention further provides a fourth possible implementation manner of the first aspect, where the method further includes:
judging whether the current crawling task is executed for the first time on the target website;
under the condition that no, analyzing second website information to determine whether a webpage indicated by a target website can be accessed through the target website or whether webpage content exists in the webpage indicated by the target website, wherein the second website information is information crawled by the crawler when executing a first crawling task, the first crawling task is a last crawling task of the current crawling task, and the target website is any website address in the second website information,
if yes, crawling the home page of the target website;
and in the case of no determination, deleting the associated information associated with the target website from the data server.
With reference to the second possible implementation manner of the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where the obtaining a target crawling time includes:
setting a timer quartz of Java in advance to set the crawling time of the crawler, wherein the timer quartz of Java is used for triggering the crawler to execute a crawling task at regular time;
and extracting target crawling time from the crawling time.
With reference to the first aspect, an embodiment of the present invention provides a sixth possible implementation manner of the first aspect, where performing a word segmentation process on search information input by a user to obtain a search keyword, where the word segmentation process includes:
and performing word segmentation processing on the search information input by the user through an IKAnalyzer word segmenter to obtain search keywords.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where finding, according to the association information, website content that matches the search keyword includes: searching the search keyword from the search word of the associated information; determining the website information associated with the search terms according to the matching degree between the search terms and the search keywords;
pushing the website content to the user comprises: and pushing the website content in the website information to the user according to the matching degree.
In a second aspect, an embodiment of the present invention further provides an apparatus for improving accuracy of a multi-site search keyword, where the apparatus includes:
the system comprises a first acquisition module, a first search module and a second acquisition module, wherein the first acquisition module is used for acquiring the website information of a target website and the association information between preset search terms, the website information is the latest website information of the target website at the current moment, and the website information comprises website content and a website address;
the word segmentation module is used for carrying out word segmentation on search information input by a user to obtain search keywords, wherein the search information is information for searching the target website data;
and the pushing module is used for searching the website content matched with the search keyword according to the associated information and pushing the website content to the user.
In a third aspect, an embodiment of the present invention further provides a computer-readable medium having a non-volatile program code executable by a processor, where the program code causes the processor to execute the method for improving accuracy of a multi-site search keyword according to the first aspect.
The embodiment of the invention has the following beneficial effects: acquiring associated information between website information of a target website and a preset search word, wherein the website information is latest website information of the target website at the current moment, and the website information comprises website content and a website address; performing word segmentation processing on search information input by a user to obtain search keywords, wherein the search information is information for searching target website data; and searching the website content matched with the search keyword according to the associated information, and pushing the website content to the user. In the embodiment of the invention, the website information is the latest website information of the target website at the current moment, the associated information is also the latest associated information at the current moment, and the accuracy of the associated information is ensured by the real-time property of the associated information, so that the technical problem of poor accuracy when the webpage content matched with the search keyword is searched by the search method in the prior art is solved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a method for improving accuracy of a multi-site search keyword according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for crawling a home page of a target website according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an apparatus for improving accuracy of a multi-site search keyword according to a second embodiment of the present invention;
fig. 4 is a schematic diagram of another apparatus for improving accuracy of a multi-site search keyword according to a second embodiment of the present invention.
Icon: 100-a first acquisition module; 200-word segmentation module; 300-a push module; 400-a second acquisition module; 500-a crawling module; 600-establishing a module; 700-storage module.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The internet content is not only wide in hunting content, but also fast in updating, so that after a user inputs a search keyword, the content to be searched cannot be found, the searched content is not related to the search keyword, or a plurality of repeated contents are searched, and the webpage content matched with the search keyword is often in a poor accuracy technical problem. Based on this, the method and the device for improving the accuracy of the multi-site search keyword provided by the embodiment of the invention can solve the technical problem of poor accuracy when the search method in the prior art is used for searching the webpage content matched with the search keyword.
Example one
The method for improving the accuracy of the multi-site search keyword provided by the embodiment of the invention comprises the following steps as shown in fig. 1:
step S102, acquiring the associated information between the website information of the target website and the preset search word, wherein the website information is the latest website information of the target website at the current moment, and the website information comprises website content and website address.
Specifically, the target website includes a single website site or a plurality of website sites.
In addition, since the information of some websites is updated in real time, the website information is the latest website information of the target website at the current time, and thus the website information is real-time website information.
And step S104, performing word segmentation processing on the search information input by the user to obtain search keywords, wherein the search information is information for searching target website data.
Specifically, most of the search information input by the user is character strings, and the search keywords are obtained by performing word segmentation processing on the character strings.
And step S106, searching the website content matched with the search keyword according to the associated information, and pushing the website content to the user.
It should be noted that the steps described in the above steps S102 to S106 may be executed by an execution device, the execution device may be disposed between an intranet of a company and a target website (the target website is an extranet of the company), and the execution device acquires, through communication with the extranet, association information between website information of the target website and a preset search word, and stores the association information. Furthermore, the execution means sets in advance a word segmentation rule for performing word segmentation processing on the search information input by the user. When a user in an intranet of a company searches data of a target website, an execution device acquires search information input by the user through communication with a client in the intranet, acquires pre-stored associated information, searches website content matched with a search keyword according to the associated information, and pushes the website content to the user.
It should be emphasized that the website information is the latest website information of the target website at the current time, the associated information is also the latest associated information at the current time, and the accuracy of the associated information is ensured by the real-time property of the associated information, so that the technical problem of poor accuracy existing when the web page content matched with the search keyword is searched by the search method in the prior art is solved.
Regarding the execution device obtaining the associated information between the website information of the target website and the preset search term through communication with the external network, a detailed implementation is given in an optional implementation of the embodiment of the present invention, which specifically includes the following steps:
acquiring target crawling time before acquiring the website information of a target website and the associated information between preset search terms;
controlling a crawler to execute a current crawling task at a target crawling time so as to crawl a target website to obtain first website information;
determining a preset search word according to website content included in the first website information, and establishing associated information between the preset search word and the first website information;
the association information is stored in a data server.
Specifically, the address depth of the website and the entity webpage information for judging whether the address is the address of the target website can be stored in the data server, so that the website can be more efficiently crawled in the following crawling task.
It should be noted that the association information stored in the data server includes the following two cases: the first case is the association information between the contents of the website and the preset search words, and the second case is the association information between the address of the website and the preset search words. For the first case, after the user inputs the search information, directly searching the website content matched with the search keyword from the associated information, and pushing the website content to the user; for the second case, after the user inputs the search information, the website address matched with the search keyword is searched from the associated information, then the website content of the webpage indicated by the website address is searched, and the website content is pushed to the user.
Referring to another optional implementation manner of the embodiment of the present invention, specifically, the following is a detailed acquisition method of the target crawling time, for controlling a crawler to execute a current crawling task at the target crawling time:
setting a timer quartz of Java in advance to set crawling time of the crawler, wherein the timer quartz of Java is used for triggering the crawler to execute a crawling task at regular time; and then extracting the target crawling time from the crawling time.
It should be noted that the Java timer quartz has a preset time for triggering the crawler to execute the crawling task, and the target crawling time for executing the current crawling task is a previous time closest to the current time in the preset time.
In another optional implementation manner of the embodiment of the present invention, controlling a crawler to execute a current crawling task at a target crawling time to crawl a target website to obtain first website information includes:
when the current crawling task is executed, the home page of the target website is crawled to obtain the home page content of the target website and hyperlink interfaces, such as an href interface and an src interface, contained in the home page information of the target website.
And analyzing the hyperlink interface to determine whether the hyperlink interface is a target hyperlink interface, wherein the target hyperlink interface is an interface which is not crawled, the target hyperlink interface is a correct hyperlink interface, and the webpage content corresponding to the target hyperlink interface contains preset webpage content. The preset web page content is the web page content which is desired to be obtained in advance, and if the web page content is the web page content which is not of interest, the hyperlink interface of the web page is not the target hyperlink interface.
Traversing the web page corresponding to the hyperlink interface under the condition that the target hyperlink interface is determined to obtain the website content of the target hyperlink interface;
and taking the website content and the website address of each target hyperlink interface as first website information.
It should be noted that, the embodiment of the present invention provides a way for crawling a home page of a website by a crawler, and for websites with different depths, the crawler crawls deeper webpages, and the crawling way can also be adopted.
In another optional implementation manner of the embodiment of the present invention, as shown in fig. 2, crawling the home page of the target website includes the following steps:
step S201, judging whether the crawling task is executed for the first time on the target website, wherein the step S202 is executed under the condition that no crawling task is judged, and the step S203 is executed under the condition that yes crawling task is judged;
step S202, analyzing second website information, and determining whether a webpage indicated by a target website can be accessed through the target website (namely, whether the webpage of the target website exists) or whether webpage content exists in the webpage indicated by the target website (namely, whether the webpage information of the target website exists), wherein the second website information is information crawled by a crawler when executing a first crawling task, the first crawling task is a last crawling task of a current crawling task, and the target website is any website address in the second website information. Wherein, the step S203 is executed if the determination is yes, and the step S204 is executed if the determination is no;
step S203, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information;
in step S204, the associated information associated with the target website is deleted from the data server, and then the process returns to continue to step S203.
It should be noted that, in the embodiment of the present invention, the target website is any website address in the second website information, and the second website information is information crawled by the crawler when executing the first crawling task, through the above steps, the purpose of verifying the website information obtained by crawling at the last time is achieved, a phenomenon that the target website information appears in a search result, but no related website content is obtained after clicking a link of the target website is avoided, and thus, a subsequent search error is avoided.
In another optional implementation manner of the embodiment of the present invention, before storing the association information in the data server, the method for improving accuracy of the multi-site search keyword further includes:
judging whether the data server stores the associated information for the first time;
and under the condition that the data server is judged to store the associated information for the first time, emptying the stored data in the data server, so that the associated information obtained by crawling websites except the target website in the prior art is prevented from being left in the data server, or some dirty data is prevented from being left in the data server.
In another optional implementation manner of the embodiment of the present invention, performing word segmentation processing on search information input by a user to obtain a search keyword includes:
and performing word segmentation processing on the search information input by the user through an IKAnalyzer word segmenter to obtain search keywords.
Specifically, a common method of word segmentation is first managed through a common interface of word segmentation. For the returned result of word segmentation processing, various situations need to be considered, including: first, the returned result is a set of key-value pairs Map; second, the returned result is a Set with String strings as the identifier.
In addition, word segmentation is carried out through the IKAnalyzer word segmentation device, two word segmentation types can be added, one is intelligent segmentation, and the other is fine-grained segmentation, so that character strings can be segmented in different modes according to needs.
In addition, the word stock of the IKAnalyzer can be updated in time, and word segmentation can achieve a more ideal word segmentation effect.
The embodiment of the invention adopts the IKAnalyzer word segmentation device, adopts a text matching-based mode, does not need to invest a large amount of manpower for training and labeling, can self-define a dictionary, is convenient to add the words of domain specific, and can segment multi-granularity results.
In another optional implementation manner of the embodiment of the present invention, searching for website content matching a search keyword according to associated information includes: searching search keywords from the search terms of the associated information; determining website information associated with the search terms according to the matching degree between the search terms and the search keywords;
pushing website content to the user includes: and pushing the website content in the website information to the user according to the matching degree.
Specifically, the method can be implemented by a search server (for example, solr, which is an independent data server, can generate an index to quickly search for and return results according to search information input by a user); the user can also make a search request containing search information through Http Get.
In the case where the search keyword is a multi-word group, the search keywords may be connected by OR, and then the preset search word may be matched by the Client to obtain the website content. The website contents in the website information are pushed to the user, and the screened website contents can be displayed on the pages of the search results according to the quantity and the number of the search keywords contained in the preset search words corresponding to the website contents.
Example two
The apparatus for improving accuracy of a multi-site search keyword provided by an embodiment of the present invention, as shown in fig. 3, includes:
the first obtaining module 100 is configured to obtain associated information between website information of a target website and a preset search term, where the website information is latest website information of the target website at a current time, and the website information includes website content and a website address;
the word segmentation module 200 is configured to perform word segmentation processing on search information input by a user to obtain search keywords, where the search information is information for searching target website data;
the pushing module 300 is configured to search for website content matched with the search keyword according to the association information, and push the website content to the user.
In the embodiment of the present invention, the first obtaining module 100 obtains associated information between website information of a target website and a preset search term, where the website information is latest website information of the target website at a current time, and the website information includes website content and a website address; the word segmentation module 200 performs word segmentation processing on search information input by a user to obtain search keywords, wherein the search information is information for searching target website data; the pushing module 300 searches for the website content matched with the search keyword according to the association information, and pushes the website content to the user. In the embodiment of the invention, the website information is the latest website information of the target website at the current moment, the associated information is also the latest associated information at the current moment, and the accuracy of the associated information is ensured by the real-time property of the associated information, so that the technical problem of poor accuracy when the webpage content matched with the search keyword is searched by the search method in the prior art is solved.
In another optional implementation manner of the embodiment of the present invention, as shown in fig. 4, the apparatus for improving accuracy of a multi-site search keyword further includes:
a second obtaining module 400, configured to obtain a target crawling time;
the crawling module 500 is used for controlling a crawler to execute a current crawling task at a target crawling time so as to crawl a target website to obtain first website information;
an establishing module 600, configured to determine a preset search term according to website content included in the first website information, and establish associated information between the preset search term and the first website information;
a storage module 700 for storing the association information in the data server.
In another optional implementation of the embodiment of the present invention, the crawling module includes:
the crawling unit is used for crawling the home page of the target website when executing the current crawling task to obtain the home page content of the target website and a hyperlink interface contained in the home page information of the target website;
the determining unit is used for analyzing the hyperlink interface and determining whether the hyperlink interface is a target hyperlink interface or not, wherein the target hyperlink interface is an interface which is not crawled, the target hyperlink interface is a correct hyperlink interface, and the webpage content corresponding to the target hyperlink interface comprises preset webpage content;
the traversing unit is used for traversing the webpage corresponding to the hyperlink interface under the condition that the target hyperlink interface is determined to obtain the website content of the target hyperlink interface;
and the determining unit is used for taking the website content and the website address of each target hyperlink interface as the first website information.
In another optional implementation manner of the embodiment of the present invention, the crawling unit is further configured to:
judging whether a crawling task is executed for the first time on a target website;
under the condition that the judgment is negative, analyzing the second website information to determine whether the webpage indicated by the target website can be accessed through the target website or whether webpage content exists in the webpage indicated by the target website, wherein the second website information is information crawled by a crawler when executing a first crawling task, the first crawling task is a previous crawling task of the current crawling task, the target website is any website address in the second website information,
if yes, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information;
in the event that a determination is made of no, the associated information associated with the target web address is deleted from the data server.
In another optional implementation manner of the embodiment of the present invention, the apparatus for improving accuracy of a multi-site search keyword further includes:
the judging module is used for judging whether the data server stores the associated information for the first time;
and the clearing module is used for clearing the data stored in the data server under the condition that the data server is judged to store the associated information for the first time.
In another optional implementation manner of the embodiment of the present invention, the second obtaining module is configured to:
setting a timer quartz of Java in advance to set crawling time of the crawler, wherein the timer quartz of Java is used for triggering the crawler to execute a crawling task at regular time;
and extracting target crawling time from the crawling time.
In another optional implementation manner of the embodiment of the present invention, the word segmentation module is configured to:
and performing word segmentation processing on the search information input by the user through an IKAnalyzer word segmenter to obtain search keywords.
In another optional implementation manner of the embodiment of the present invention, the pushing module is configured to:
searching search keywords from the search terms of the associated information; determining website information associated with the search terms according to the matching degree between the search terms and the search keywords;
and pushing the website content in the website information to the user according to the matching degree.
EXAMPLE III
The embodiment of the invention provides a computer readable medium with a nonvolatile program code executable by a processor, wherein the program code enables the processor to execute the method for improving the accuracy of the multi-site search keyword, which is provided by the embodiment, wherein the website information is the latest website information of a target website at the current moment, the associated information is also the latest associated information at the current moment, and the accuracy of the associated information is ensured by the real-time property of the associated information, so that the technical problem of poor accuracy when the webpage content matched with the search keyword is searched by using the search method in the prior art is solved.
The computer program product of the method and the apparatus for improving accuracy of multi-site search keywords provided by the embodiments of the present invention includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementation may refer to the method embodiments, and will not be described herein again.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In addition, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention.
Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (7)
1. A method for improving accuracy of a multi-site search keyword, comprising:
acquiring associated information between website information of a target website and a preset search word, wherein the website information is the latest website information of the target website at the current moment, and the website information comprises website content and a website address;
performing word segmentation processing on search information input by a user to obtain search keywords, wherein the search information is information for searching the target website data;
searching website content matched with the search keywords according to the associated information, and pushing the website content to the user;
before acquiring the website information of the target website and the associated information between the preset search terms, the method further comprises the following steps:
acquiring target crawling time;
controlling a crawler to execute a current crawling task at the target crawling time so as to crawl a target website to obtain first website information;
determining a preset search word according to website content included in the first website information, and establishing associated information between the preset search word and the first website information;
storing the associated information in a data server;
the target crawling time is used for controlling a crawler to execute a current crawling task so as to crawl the target website, and first website information is obtained, wherein the method comprises the following steps:
when the current crawling task is executed, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information of the target website;
analyzing the hyperlink interface to determine whether the hyperlink interface is a target hyperlink interface, wherein the target hyperlink interface is an interface which is not crawled, the target hyperlink interface is a correct hyperlink interface, and the webpage content corresponding to the target hyperlink interface contains preset webpage content;
traversing the webpage corresponding to the hyperlink interface under the condition that the target hyperlink interface is determined to obtain the website content of the target hyperlink interface;
taking the website content and the website address of each target hyperlink interface as the first website information;
wherein, crawling the home page of the target website comprises:
judging whether the target website executes the crawling task for the first time;
under the condition that no, analyzing second website information to determine whether a webpage indicated by a target website can be accessed through the target website or whether webpage content exists in the webpage indicated by the target website, wherein the second website information is information crawled by the crawler when executing a first crawling task, the first crawling task is a last crawling task of the current crawling task, and the target website is any website address in the second website information,
if yes, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information;
and in the case of no determination, deleting the associated information associated with the target website from the data server.
2. The method of claim 1, wherein prior to storing the association information in a data server, the method further comprises:
judging whether the data server stores the associated information for the first time;
and under the condition that the data server is judged to store the associated information for the first time, emptying the data stored in the data server.
3. The method of claim 1, wherein obtaining a target crawl time comprises:
setting a timer quartz of Java in advance to set the crawling time of the crawler, wherein the timer quartz of Java is used for triggering the crawler to execute a crawling task at regular time;
and extracting target crawling time from the crawling time.
4. The method of claim 1, wherein performing a word segmentation process on the search information input by the user to obtain a search keyword comprises:
and performing word segmentation processing on the search information input by the user through an IKAnalyzer word segmenter to obtain search keywords.
5. The method of claim 1,
searching the website content matched with the search keyword according to the associated information, wherein the searching comprises the following steps: searching the search keyword from the search word of the associated information; determining the website information associated with the search terms according to the matching degree between the search terms and the search keywords;
pushing the website content to the user comprises: and pushing the website content in the website information to the user according to the matching degree.
6. An apparatus for improving accuracy of a multi-site search keyword, comprising:
the system comprises a first acquisition module, a first search module and a second acquisition module, wherein the first acquisition module is used for acquiring the website information of a target website and the association information between preset search terms, the website information is the latest website information of the target website at the current moment, and the website information comprises website content and a website address;
the word segmentation module is used for carrying out word segmentation on search information input by a user to obtain search keywords, wherein the search information is information for searching the target website data;
the pushing module is used for searching the website content matched with the search keyword according to the associated information and pushing the website content to the user;
wherein the apparatus further comprises: the second acquisition module is used for acquiring target crawling time;
the crawling module is used for controlling a crawler to execute a current crawling task at a target crawling time so as to crawl a target website to obtain first website information;
the establishing module is used for determining a preset search word according to website content included in the first website information and establishing associated information between the preset search word and the first website information;
the storage module is used for storing the associated information in the data server;
wherein, the crawling module comprises:
the crawling unit is used for crawling the home page of the target website when executing the current crawling task to obtain the home page content of the target website and a hyperlink interface contained in the home page information of the target website;
the determining unit is used for analyzing the hyperlink interface and determining whether the hyperlink interface is a target hyperlink interface or not, wherein the target hyperlink interface is an interface which is not crawled, the target hyperlink interface is a correct hyperlink interface, and the webpage content corresponding to the target hyperlink interface comprises preset webpage content;
the traversing unit is used for traversing the webpage corresponding to the hyperlink interface under the condition that the target hyperlink interface is determined to obtain the website content of the target hyperlink interface;
the determining unit is used for taking the website content and the website address of each target hyperlink interface as first website information;
wherein the crawling unit is further configured to:
judging whether a crawling task is executed for the first time on a target website;
under the condition that the judgment is negative, analyzing the second website information to determine whether the webpage indicated by the target website can be accessed through the target website or whether webpage content exists in the webpage indicated by the target website, wherein the second website information is information crawled by a crawler when executing a first crawling task, the first crawling task is a previous crawling task of the current crawling task, the target website is any website address in the second website information,
if yes, crawling the home page of the target website to obtain the home page content of the target website and a hyperlink interface contained in the home page information;
in the event that a determination is made of no, the associated information associated with the target web address is deleted from the data server.
7. A computer-readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method of improving accuracy of a multi-site search keyword of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710732432.5A CN107301253B (en) | 2017-08-23 | 2017-08-23 | Method and device for improving accuracy of multi-site search keywords |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710732432.5A CN107301253B (en) | 2017-08-23 | 2017-08-23 | Method and device for improving accuracy of multi-site search keywords |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107301253A CN107301253A (en) | 2017-10-27 |
CN107301253B true CN107301253B (en) | 2020-02-04 |
Family
ID=60132524
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710732432.5A Active CN107301253B (en) | 2017-08-23 | 2017-08-23 | Method and device for improving accuracy of multi-site search keywords |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107301253B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108519984B (en) * | 2018-02-07 | 2022-11-04 | 平安科技(深圳)有限公司 | Weather data processing method, server and computer readable storage medium |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102135967B (en) * | 2010-01-27 | 2013-06-05 | 华为技术有限公司 | Webpage keywords extracting method, device and system |
CN102456057B (en) * | 2010-11-01 | 2016-08-17 | 阿里巴巴集团控股有限公司 | Search method based on online trade platform, device and server |
CN102446225A (en) * | 2012-01-11 | 2012-05-09 | 深圳市爱咕科技有限公司 | Real-time search method, device and system |
CN102591992A (en) * | 2012-02-15 | 2012-07-18 | 苏州亚新丰信息技术有限公司 | Webpage classification identifying system and method based on vertical search and focused crawler technology |
CN103778122B (en) * | 2012-10-17 | 2018-01-23 | 腾讯科技(深圳)有限公司 | Searching method and system |
-
2017
- 2017-08-23 CN CN201710732432.5A patent/CN107301253B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN107301253A (en) | 2017-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8812520B1 (en) | Augmented resource graph for scoring resources | |
US8751466B1 (en) | Customizable answer engine implemented by user-defined plug-ins | |
US10250526B2 (en) | Method and apparatus for increasing subresource loading speed | |
US9304979B2 (en) | Authorized syndicated descriptions of linked web content displayed with links in user-generated content | |
US9734149B2 (en) | Clustering repetitive structure of asynchronous web application content | |
US20060095430A1 (en) | Web page ranking with hierarchical considerations | |
US20110173177A1 (en) | Sightful cache: efficient invalidation for search engine caching | |
US10216716B2 (en) | Method and system for electronic resource annotation including proposing tags | |
US8631097B1 (en) | Methods and systems for finding a mobile and non-mobile page pair | |
CN108566399B (en) | Phishing website identification method and system | |
WO2018095411A1 (en) | Web page clustering method and device | |
US10810181B2 (en) | Refining structured data indexes | |
US20080281827A1 (en) | Using structured database for webpage information extraction | |
CN103294732A (en) | Web page crawling method and spider | |
CN102222098A (en) | Method and system for pre-fetching webpage | |
US20220405312A1 (en) | Methods and systems for modifying a search result | |
US20150302093A1 (en) | Method and system for filtering of a website | |
CN110889023A (en) | Distributed multifunctional search engine of elastic search | |
CN103399872A (en) | Method and device for optimizing webpage capture | |
CN105550359A (en) | Webpage sorting method and device based on vertical search and server | |
CN108845925B (en) | Web page testing method and device, electronic equipment and computer readable medium | |
WO2018053024A1 (en) | Organizing datasets for adaptive responses to queries | |
CN105528357A (en) | Webpage content extraction method based on similarity of URLs and similarity of webpage document structures | |
WO2022179128A1 (en) | Crawler-based data crawling method and apparatus, computer device, and storage medium | |
CN107430614B (en) | Application local deep linking to corresponding resources |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 310051 No. 188 Lianhui Street, Xixing Street, Binjiang District, Hangzhou City, Zhejiang Province Applicant after: Hangzhou Annan information technology Limited by Share Ltd Address before: Zhejiang Zhongcai Building No. 68 Binjiang District road Hangzhou City, Zhejiang Province, the 310051 and 15 layer Applicant before: Dbappsecurity Co.,ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |