CN110851747B

CN110851747B - Information matching method and device

Info

Publication number: CN110851747B
Application number: CN201810861161.8A
Authority: CN
Inventors: 梁洪波
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2022-08-02
Anticipated expiration: 2038-08-01
Also published as: CN110851747A

Abstract

The invention discloses a method and a device for matching information, which are characterized in that a phrase input by a user, first URL information and second URL information related to the phrase are obtained; unifying the symbolic formats of the first URL information and the second URL information; removing the protocol, the user name and the password contained in the first URL information and the second URL information; aligning connection characters between port information and path information in the residual information of the first URL information and the second URL information, and dividing the residual information into two parts by taking the connection characters as boundaries; and if the two parts of the first URL information and the second URL information both meet the preset condition, determining that the first URL information is matched with the second URL information. By the method, redundant information in the URL information is removed, and the obtained URL information is adjusted and then matched. Not only is the matching of the URL optimized, but also the matching accuracy is improved.

Description

Information matching method and device

Technical Field

The present invention relates to the field of network technologies, and in particular, to an information matching method and apparatus.

Background

With the continuous development of society, the internet becomes an indispensable part of people's lives, and users have increasingly high requirements on the accuracy of information obtained by using search engines on the internet.

When performing keyword ranking analysis of Search Engine Optimization (SEO), obtaining link information of a phrase in a designated Search Engine through a crawler program according to the phrase and a Uniform Resource Locator (URL) input by a user, and then matching the link information with the URL input by the user.

In the prior art, the general matching process is as follows: firstly preprocessing the URL input by the user and the crawled URL, secondly comparing the URL input by the user and the crawled URL, and finally finishing matching if the URL input by the user and the crawled URL are equal. The search engine acquires information by judging the legality of the URL through preprocessing, and directly matching the URL according to the input URL of a user after preprocessing. The URL preprocessing mainly judges the legality of the URL, and a regular expression is generally adopted to match all parts of the URL to judge whether the URL is legal or not. Since the URL must be preceded by a specific protocol in the prior art, and the address cannot contain double bytes or non-link special characters, it may cause a situation that the URL is not clear or understood incorrectly by using the regular expression. The search results of the search engine are used in a mode of reflecting the webpage ranking according to the relevance of characters, words and phrases, namely, when keyword ranking is carried out, various matching rules occur, so that the matching is not flexible, and the URL input by a user is directly matched in a crawling result set, so that the business requirements are difficult to meet.

Disclosure of Invention

In view of this, embodiments of the present invention provide an information matching method and apparatus to achieve the purpose of optimizing URL matching and improving URL matching accuracy, and the embodiments of the present invention provide the following technical solutions:

a method of information matching, the method comprising:

acquiring a phrase input by a user, first Uniform Resource Locator (URL) information and second URL information which is acquired in a search engine through a crawler technology and is related to the phrase;

unifying symbol formats of the first URL information and the second URL information;

removing protocols, user names and passwords contained in the first URL information and the second URL information;

aligning connection characters between port information and path information in the residual information of the first URL information and the second URL information, and dividing the residual information into two parts by taking the connection characters as boundaries, wherein the left side of each connection character is a first part, and the right side of each connection character is a second part;

and matching the first part of the first URL information with the first part of the second URL information, and matching the second part of the first URL information with the second part of the second URL information, wherein the matching conditions are met, and the first URL information is determined to be matched with the second URL information.

Preferably, the unifying the symbolic formats of the first URL information and the second URL information includes:

and uniformly adjusting the symbol formats in the first URL information and the second URL information into a lower case format or an upper case format.

Preferably, the matching the first part of the first URL information and the first part of the second URL information, and the matching the second part of the first URL information and the second part of the second URL information, both satisfying a preset matching condition, and determining that the first URL information and the second URL information match includes:

matching a first portion of the first URL information and a first portion of the second URL information, and matching a second portion of the first URL information and a second portion of the second URL information;

and if the first part of the second URL information begins with the first part of the first URL information and the second part of the second URL information ends with the second part of the first URL information, determining that the first URL information and the second URL information are matched.

Preferably, if the first part of the second URL information does not begin with the first part of the first URL information or the second part of the second URL information does not end with the second part of the first URL information, it is determined that the first URL information and the second URL information do not match.

Preferably, the information matching method further includes:

in the process of removing the protocol, the user name and the password included in the first URL information and the second URL information, if it is detected that 80 ports are included in the first URL information and the second URL information, 80 ports in the first URL information and the second URL information are removed.

An information matching apparatus, the apparatus comprising:

the acquisition unit is used for acquiring a phrase input by a user, first Uniform Resource Locator (URL) information and second URL information which is acquired in a search engine through a crawler technology and is related to the phrase;

a uniform format unit for unifying symbol formats of the first URL information and the second URL information;

a removing unit, configured to remove a protocol, a user name, and a password included in the first URL information and the second URL information;

the adjusting unit is used for aligning connection characters between port information and path information in the residual information of the first URL information and the second URL information, dividing the residual information into two parts by taking the connection characters as boundaries, wherein the left side of each connection character is a first part, and the right side of each connection character is a second part;

and the matching unit is used for matching the first part of the first URL information with the first part of the second URL information and matching the second part of the first URL information with the second part of the second URL information, and the matching unit both meet preset matching conditions and determine that the first URL information is matched with the second URL information.

Preferably, the uniform format unit is configured to uniformly adjust symbol formats in the first URL information and the second URL information into a lower case format or an upper case format.

Preferably, the matching unit is configured to match a first part of the first URL information and a first part of the second URL information, and match a second part of the first URL information and a second part of the second URL information; and if the first part of the second URL information begins with the first part of the first URL information and the second part of the second URL information ends with the second part of the first URL information, determining that the first URL information is matched with the second URL information.

Preferably, the matching unit is further configured to determine that the first URL information does not match the second URL information if the first portion of the second URL information does not begin with the first portion of the first URL information or the second portion of the second URL information does not end with the second portion of the first URL information.

Preferably, the removing unit is further configured to, in a process of removing a protocol, a user name, and a password included in the first URL information and the second URL information, remove 80 ports in the first URL information and the second URL information if it is detected that 80 ports are included in the first URL information and the second URL information.

A storage medium on which a program is stored, the program implementing the above-described information matching method when executed by a processor.

A processor for running a program, wherein the program runs to execute the information matching method.

The embodiment of the invention obtains a phrase and first URL information input by a user, and obtains second URL information related to the phrase in a search engine through a crawler technology; unifying the symbolic formats of the first URL information and the second URL information; removing a protocol, a user name and a password contained in the first URL information and the second URL information, and 80 ports contained in the second URL information and the second URL information detected in the process; aligning connection characters between port information and path information in the residual information of the first URL information and the second URL information, and dividing the residual information into two parts by taking the connection characters as boundaries, wherein the left side of each connection character is a first part, and the right side of each connection character is a second part; and matching the first part of the first URL information with the first part of the second URL information, and matching the second part of the first URL information with the second part of the second URL information, wherein the first URL information and the second URL information are determined to be matched when preset conditions are met. By the information matching method, redundant information in the first URL information and the second URL information is removed, and the obtained first URL information and the obtained second URL information are adjusted and then matched. Not only is the matching of the URL optimized, but also the matching accuracy is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flow chart of an information matching method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an information matching apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

As can be seen from the background art, the process of obtaining link information of a phrase in a designated search engine through a crawler technology according to the phrase and a uniform resource locator input by a user and then matching the link information with the uniform resource locator input by the user requires search engine optimization. The search engine optimization is a way of improving the ranking of the current website in the related search engine by using the search rules of the search engine. When matching, the URL input by the user and the crawled URL are needed to be preprocessed, and the URL preprocessing mainly adopts a regular expression to match all parts of the URL, namely, the legality of the URL is judged. The regular expression is used for retrieving and replacing the text which conforms to a certain pattern. When URL preprocessing is carried out in the prior art, the situation that the regular expression is unclear or wrong in understanding the URL can be caused, and various matching rules can be caused when keyword ranking is carried out, so that the matching is not flexible, and the service requirement is difficult to meet. Therefore, the invention discloses an information matching method and device, and aims to achieve the purposes of URL optimization and accurate matching.

Fig. 1 is a schematic flow chart of an information matching method according to an embodiment of the present invention. The information matching method includes the following steps.

Step S101: the method comprises the steps of obtaining a phrase and first URL information input by a user, and obtaining second URL information related to the phrase in a search engine through a crawler technology.

Step S102, unifying symbol formats in the first URL information and the second URL information.

It should be noted that unifying the symbolic formats of the first URL information and the second URL information includes: and uniformly adjusting the symbol formats in the first URL information and the second URL information into a lower case format or an upper case format.

Step S103, removing the protocol, the user name and the password contained in the first URL information and the second URL information.

In a specific implementation, a complete piece of URL information includes: protocol, username, password, domain name, port, path, query terms, and other fragments, etc. For example, a piece of URL information given as follows:

http://lianghonbo:123456@www.gridsum.com:8080/news/sznews/news.html？date＝xx#top＝10。

wherein http is a protocol; lianghonbo:12345 is the username and password; www.grisum.com is a domain name; 8080 is a port; html is a path; is there a date xx is a query condition; and # top 10 is a fragment.

In the process of executing step S103, the protocol, the user name, and the password included in the URL information are removed to obtain: www.gridsum.com 8080/news/sznews/news. htmldate. xx # top. 10.

Optionally, in the process of executing step S103, if it is detected that the first URL information and the second URL information include 80 ports, 80 ports in the first URL information and the second URL information are removed.

Step S104, aligning connection characters between the port information and the path information in the residual information of the first URL information and the second URL information, dividing the residual information into two parts by taking the connection characters as boundaries, wherein the left side of the connection characters is a first part, and the right side of the connection characters is a second part.

The above-described specific process of executing step S104 is exemplified.

If the first URL information of the protocol, the user name and the password is removed is as follows:

gridsum.com:8080/news/sznews/。

the second URL information for removing the protocol, the user name, and the password is:

www.gridsum.com:8080/news/sznews/news.htmldate＝xx#top＝10。

step S104 is performed to align the connection characters "/" between the ports and the paths in the first URL information and the second URL information, and to divide the remaining information of the first URL information and the second URL information into two parts with "/" as a boundary. The left side of the "/" is the first part and the right side of the "/" is the second part. The first part of the first URL information and the first part of the second URL information may be further aligned to the right, and the second part of the first URL information and the second part of the second URL information may be further aligned to the left, and then:

www.gridsum.com:8080/news/sznews/news.htmldate＝xx#top＝10

gridsum.com:8080/news/sznews/

step S105, matching the first part of the first URL information and the first part of the second URL information.

Step S106, matching the second part of the first URL information with the second part of the second URL information.

Step S105 and step S106 may be performed simultaneously or may not be performed simultaneously.

Step S107, if the results of the steps S105 and S106 both satisfy a preset matching condition, determining that the first URL information and the second URL information are matched.

In step S107, the preset matching condition is: the first portion of the second URL information begins with the first portion of the first URL information and the second portion of the second URL information ends with the second portion of the first URL information.

Therefore, in the process of executing step S105 to step S107, if the first part of the second URL information starts with the first part of the first URL information and the second part of the second URL information ends with the second part of the first URL information, it is determined that the first URL information and the second URL information match.

Determining that the first URL information and the second URL information do not match if the first portion of the second URL information does not begin with the first portion of the first URL information or the second portion of the second URL information does not end with the second portion of the first URL information.

For example: the first part of the second URL information is: www.gridsum.com:8080, the second part of the second URL information is: htmldate ═ xx # top ═ 10. The first part of the first URL information is: com:8080, the second part of the first URL information is: news/sznews/. In the matching process, the following results are obtained:

www.gridsum.com:8080/news/sznews/news.htmldate＝xx#top＝10

gridsum.com:8080/news/sznews/

as can be seen from the above example, the matching results in that the first portion of the second URL information begins with the first portion of the first URL information and the second portion of the second URL information ends with the second portion of the first URL information, confirming that the first URL information matches the second URL information.

If the first part of the second URL information is: www.gridsum.com:8080, the second part of the second URL information is: htmldate ═ xx # top ═ 10. The first part of the first URL information is: news. grid. com:8080, the second part of the first URL information is: news/sznews/. Obtained during the matching process

www.gridusm.com:8080/news/sznews/news.htmldate＝xx#top＝10

news.gridsum.com:8080/news/sznews/

As can be seen from the above example, the matching results in that the first portion of the second URL information does not begin with the first portion of the first URL information or the second portion of the second URL information does not end with the second portion of the first URL information, and it is determined that the first URL information and the second URL information do not match.

The embodiment of the invention obtains a phrase and first URL information input by a user, and obtains second URL information related to the phrase in a search engine through a crawler technology; unifying the symbolic formats of the first URL information and the second URL information; removing the protocol, the user name and the password contained in the first URL information and the second URL information, and 80 ports contained in the first URL information and the second URL information detected in the process; aligning connection characters between port information and path information in the residual information of the first URL information and the second URL information, and dividing the residual information into two parts by taking the connection characters as boundaries, wherein the left side of each connection character is a first part, and the right side of each connection character is a second part; and matching the first part of the first URL information with the first part of the second URL information, and matching the second part of the first URL information with the second part of the second URL information, wherein the first URL information and the second URL information are determined to be matched when preset conditions are met. By the information matching method, redundant information in the first URL information and the second URL information is removed, and the obtained first URL information and the obtained second URL information are adjusted and then matched. Not only is the matching of the URL optimized, but also the matching accuracy is improved.

Based on the information matching method disclosed in the embodiment of the present invention, the embodiment of the present invention also correspondingly discloses an information matching device, as shown in fig. 2, the information matching device 200 mainly includes:

the obtaining unit 201 is configured to obtain a phrase and first URL information input by a user, and second URL information related to the phrase, which is obtained in a search engine through a crawler technology.

A unifying unit 202, configured to unify the symbol formats of the first URL information and the second URL information.

A removing unit 203, configured to remove the protocol, the user name, and the password included in the first URL information and the second URL information.

An adjusting unit 204, configured to align a connection character between port information and path information in remaining information of the first URL information and the second URL information, divide the remaining information into two parts by using the connection character as a boundary, where a left side of the connection character is a first part, and a right side of the connection character is a second part;

the matching unit 205 is configured to match the first part of the first URL information and the first part of the second URL information, and match the second part of the first URL information and the second part of the second URL information, which both satisfy a preset matching condition, and determine that the first URL information and the second URL information match.

Further, the symbol format in the unified element 202 is: and uniformly adjusting the first URL information and the second URL information into a lower case format or an upper case format.

Further, in the process of performing the removing unit 203, if it is detected that the first URL information and the second URL information include 80 ports, the 80 ports included in the first URL information and the second URL information are removed.

Further, the preset matching conditions in the matching unit 205 are: the first portion of the second URL information begins with the first portion of the first URL information and the second portion of the second URL information ends with the second portion of the first URL information. Determining that the first URL information matches the second URL information if the first portion of the second URL information begins with the first portion of the first URL information and the second portion of the second URL information ends with the second portion of the first URL information; and if the first part of the second URL information does not begin with the first part of the second URL information and the second part of the second URL information does not end with the second part of the first URL information, determining that the first URL information and the second URL information do not match.

The specific principle and the implementation process of each module and unit in the information matching device disclosed in the embodiment of the present invention are the same as those of the information matching method disclosed in the embodiment of the present invention, and reference may be made to corresponding parts in the information matching method disclosed in the embodiment of the present invention, which are not described herein again.

The information matching device comprises a processor and a memory, wherein the acquisition unit, the unification unit, the removal unit, the adjustment unit, the matching unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.

The embodiment of the invention provides an information matching device, which comprises an acquisition unit, a matching unit and a display unit, wherein the acquisition unit is used for acquiring a phrase input by a user, first URL information and second URL information related to the phrase; secondly, unifying the symbolic formats of the first URL information and the second URL information through a unification unit; removing the protocol, the user name and the password of the first URL information and the second URL information and the 80 ports detected in the process again through the removing unit; then, adjusting the alignment of the first part of the first URL information and the first part of the second URL information and the alignment of the second part of the first URL information and the second part of the second URL information through an adjusting unit; and finally, matching the first part of the first URL information and the first part of the second URL information and the second part of the first URL information and the second part of the second URL information through a matching unit. Redundant information in the first URL information and the second URL information is removed through the removing unit, and the obtained first URL information and the obtained second URL information are adjusted through the adjusting unit and then matched through the matching unit. Not only is the matching of the URL optimized, but also the matching accuracy is improved.

An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the information matching method when executed by a processor.

The embodiment of the invention provides a processor, which is used for running a program, wherein the information matching method is executed when the program runs.

The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps:

acquiring a phrase input by a user, first Uniform Resource Locator (URL) information and second URL information which is acquired in a search engine through a crawler technology and is related to the phrase; unifying symbol formats of the first URL information and the second URL information; removing protocols, user names and passwords contained in the first URL information and the second URL information; aligning connection characters between port information and path information in the residual information of the first URL information and the second URL information, and dividing the residual information into two parts by taking the connection characters as boundaries, wherein the left side of each connection character is a first part, and the right side of each connection character is a second part; and matching the first part of the first URL information with the first part of the second URL information, and matching the second part of the first URL information with the second part of the second URL information, wherein the matching conditions are met, and the first URL information is determined to be matched with the second URL information.

Preferably, the unifying the symbolic formats of the first URL information and the second URL information includes: and uniformly adjusting the symbol formats in the first URL information and the second URL information into a lower case format or an upper case format.

Preferably, the matching the first part of the first URL information and the first part of the second URL information, and the matching the second part of the first URL information and the second part of the second URL information, both satisfying a preset matching condition, and determining that the first URL information and the second URL information match includes: matching a first portion of the first URL information and a first portion of the second URL information, and matching a second portion of the first URL information and a second portion of the second URL information; and if the first part of the second URL information begins with the first part of the first URL information and the second part of the second URL information ends with the second part of the first URL information, determining that the first URL information and the second URL information are matched.

Preferably, the method further comprises the following steps: in the process of removing the protocol, the user name and the password included in the first URL information and the second URL information, if it is detected that 80 ports are included in the first URL information and the second URL information, 80 ports in the first URL information and the second URL information are removed.

The device herein may be a server, a PC, a PAD, a mobile phone, etc.

All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, the system or system embodiments, which are substantially similar to the method embodiments, are described in a relatively simple manner, and reference may be made to some descriptions of the method embodiments for relevant points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An information matching method, characterized in that the method comprises:

and matching a first portion of the first URL information with a first portion of the second URL information, and matching a second portion of the first URL information with a second portion of the second URL information, wherein if the first portion of the second URL information begins with the first portion of the first URL information and the second portion of the second URL information ends with the second portion of the first URL information, it is determined that the first URL information and the second URL information match.

2. The method of claim 1, wherein unifying the symbolic format of the first URL information and the second URL information comprises:

3. The method of claim 1,

4. The method according to any one of claims 1-3, further comprising:

5. An information matching apparatus, characterized in that the apparatus comprises:

a matching unit, configured to match a first portion of the first URL information with a first portion of the second URL information, and match a second portion of the first URL information with a second portion of the second URL information, and determine that the first URL information matches the second URL information if the first portion of the second URL information begins with the first portion of the first URL information and the second portion of the second URL information ends with the second portion of the first URL information.

6. The apparatus of claim 5,

the uniform format unit is used for uniformly adjusting the symbol formats in the first URL information and the second URL information into a lower case format or an upper case format.

7. A storage medium, characterized in that a program is stored thereon, which when executed by a processor implements the information matching method according to any one of claims 1 to 4.

8. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the information matching method according to any one of claims 1-4.