WO2020258669A1

WO2020258669A1 - Website identification method and apparatus, and computer device and storage medium

Info

Publication number: WO2020258669A1
Application number: PCT/CN2019/118243
Authority: WO
Inventors: 王建华; 何四燕; 金志敏
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-06-26
Filing date: 2019-11-14
Publication date: 2020-12-30
Also published as: CN110414518A

Abstract

A website identification method, comprising: identifying a website in a picture to be identified with an OCR tool; if the directly identified website is incomplete, identifying feature information in said picture; obtaining an associated website related to the feature information by accessing a third-party internet search engine; and matching the directly identified website with the associated website to obtain a complete website. The website carried by said picture can be efficiently and accurately identified.

Description

Website identification method, device, computer equipment and storage medium

Cross references to related applications

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 26, 2019, the application number is 2019105613705, and the application name is "URL identification method, device, computer equipment and storage medium", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to a method, device, computer equipment and storage medium for identifying a website address.

Background technique

With the development of science and technology, the Internet has penetrated into people's daily lives. People can query data, buy goods, socialize, etc. through the Internet, which brings great convenience to people.

Generally, when users browse web pages through a browser, they usually enter the access URL manually. When the screen of some mobile phones is too small and the input is inconvenient, or the URL is too long, it is easy to make mistakes or even miss writing, which is time-consuming and time-consuming. In response to this situation, there is currently a technology to obtain a URL directly from a picture. In an application scenario based on this technology, it can be that user A receives a picture shared by user B, and the picture carries user B recommended to user A. After receiving the shared picture, user A can recognize the URL carried in the picture based on the URL recognition technology in the picture, and extract it and input it into the browser of user A’s terminal. User A can browse the news.

However, although the above-mentioned URL recognition technology in the image can realize URL identification and extraction, but for the situation that the image carries the URL abnormal (such as part of the URL is covered, the URL is printed incorrectly, etc.) or the carrying URL is incomplete, the traditional technology cannot obtain the corresponding accurate URL. .

Summary of the invention

According to various embodiments disclosed in the present application, a method, device, computer equipment, and storage medium for identifying a website address are provided.

A method for identifying URLs, including:

Obtain the picture to be recognized, and recognize the URL carried in the picture to be recognized through an OCR (Optical Character Recognition) tool;

When the identified URL is an incomplete URL, use the OCR tool to extract the characteristic information carried in the image to be identified;

Obtain the associated URL of the feature information fed back by a third-party Internet search engine; and

Match the associated URL with the incomplete URL to obtain a target URL.

A web address recognition device includes:

The first recognition module is configured to obtain the picture to be recognized, and recognize the URL carried in the picture to be recognized through the OCR tool;

The second recognition module is used to extract the characteristic information carried in the picture to be recognized through the OCR tool when the recognized website is an incomplete website;

The search module is used to obtain the associated URL of the feature information fed back by a third-party Internet search engine; and

The URL matching module is used to match the associated URL with the incomplete URL to obtain a target URL.

A computer device, including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:

Obtain the picture to be recognized, and use the OCR tool to recognize the URL carried in the picture to be recognized;

Match the associated URL with the incomplete URL to obtain a target URL.

One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Match the associated URL with the incomplete URL to obtain a target URL.

The details of one or more embodiments of the application are set forth in the following drawings and description. Other features and advantages of this application will become apparent from the description, drawings and claims.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

Fig. 1 is an application scenario diagram of a method for identifying a website address according to one or more embodiments.

Fig. 2 is a schematic flowchart of a method for identifying a website address according to one or more embodiments.

Fig. 3 is a schematic flowchart of a method for identifying a website address in another embodiment.

Fig. 4 is a block diagram of a website identification device according to one or more embodiments.

Figure 5 is a block diagram of a computer device according to one or more embodiments.

Detailed ways

In order to make the technical solutions and advantages of the present application clearer, the following further describes the present application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.

The URL identification method provided in this application can be applied to the application environment as shown in FIG. 1. Wherein, the terminal 102 communicates with the server 104 through the network through the network. The terminal 102 sends the picture to be recognized to the server 104 via the network. The server 104 receives the picture to be recognized, and uses the OCR tool to recognize the URL contained in the picture to be recognized. When the recognized URL is a complete URL, the complete URL is fed back to the terminal 102, or directly visit the URL link, and return the access feedback data to the terminal 102. The user can browse to the information corresponding to the URL through the terminal; when the identified URL is an incomplete URL, extract the image to be identified through the OCR tool The characteristic information carried in the third-party Internet search engine obtains the associated URL of the characteristic information fed back by the third-party Internet search engine, and matches the associated URL with the incomplete URL to obtain the target URL. The server 104 can send the anti-target URL to the terminal 102 or directly access the URL Link to return the access feedback data to the terminal 102. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 104 may be implemented as an independent server or a server cluster composed of multiple servers.

In one embodiment, as shown in FIG. 2, a method for identifying a website address is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:

S200: Obtain a picture to be recognized, and use an OCR tool to recognize the URL carried in the picture to be recognized.

The OCR tool is a process in which electronic devices (such as scanners or digital cameras) check characters printed on paper, determine their shape by detecting dark and light patterns, and then use character recognition methods to translate the shapes into computer text. The picture to be recognized refers to the picture to be recognized by the URL. The picture to be recognized can be a picture obtained by the terminal performing a screenshot operation, a picture downloaded from the Internet, or a picture in the process of chatting through a social application, and the picture is recognized through the OCR tool URL carried in. It should be pointed out that when the image to be identified does not carry a web address, the web address obtained this time is blank. Take the chat scenario as an example. User A shares a food article with friend B. User A operates the terminal to take a screenshot to capture the picture containing the food article URL. User A sends the picture containing the food article URL to friend B. In this scenario, the picture that contains the URL of the food article is the picture to be recognized; or in the Internet download scenario as an example, user A browses to a product introduction picture from the Internet and carries it in the product introduction picture There is a website address of the product introduction webpage, user A downloads the product introduction picture, and recognizes the website address carried by the product introduction picture through the URL identification method of this application. In this scenario, the product introduction picture is the picture to be identified.

S400: When the recognized web address is an incomplete web address, extract the characteristic information carried in the picture to be recognized through an OCR tool.

Generally speaking, the generation of URLs is based on industry rules. Therefore, by analyzing the URLs, it can be accurately determined whether the currently identified URLs are complete URLs. When it is an incomplete URL, use the OCR tool to identify the feature information in the image to be recognized again. The feature information mainly includes text feature information and graphic logo information. The text feature information can obtain text data in the image to be recognized to form a text data collection. The text data is data cleaned and the keywords are extracted. The key recognition process can be based on historical experience data to obtain a set of keywords related to the website, such as Ping An XX, Phoenix XX, Sina XX, etc. The graphic logo information can specifically be the brand's trademark, the shape of the article, etc.

S600: Obtain the associated URL of the characteristic information fed back by the third-party Internet search engine.

The related URL query of feature information can be completed by a third-party Internet search engine, the server pushes the feature information to the third-party Internet search engine, and the third-party Internet search engine queries the related feature information on the Internet based on the Internet’s big data query technology The URL of the content, for example, when the extracted characteristic information is "Ping An Technology X", the server communicates with the search engine server via the Internet, and sends "Ping An Technology X" to the search engine server on the Internet, and obtains information related to "Ping An Technology X" URL. When the extracted feature information is the trademark of a certain brand, the trademark of a certain brand is sent to the search engine server, and the related website of the brand can be queried. The related website includes the official website of the brand, related advertisements of the brand, product introduction, and related News and other URLs. Specifically, there can be one or more associated URLs. The third-party search engine can be the current common network search engines, such as Baidu search engine, Google search engine, etc. The server sends characteristic information to the search engine by accessing these search engines, and receives data from the search engine to obtain the association of the characteristic information. URL.

S800: Match the associated URL with the incomplete URL to obtain the target URL.

Since the obtained incomplete URL is part of the complete URL (destination URL), it can be used as a matching condition for the complete URL to match the incomplete URL with the URL obtained by the query. If the match is successful, it means that it is already in the obtained URL The target URL is found, and the browser is called to open the target URL to realize efficient and accurate identification of the URL.

The above URL identification method uses the OCR tool to identify the URL in the image to be identified. If the directly identified URL is incomplete, then identify the characteristic information in the image to be identified, access a third-party Internet search engine to obtain the associated URL related to the characteristic information, The directly identified URL matches the associated URL to obtain a complete URL, which can accurately identify the URL carried.

As shown in Figure 3, in one of the embodiments, step S200 includes:

S210: Perform image grayscale processing and edge detection on the picture to be recognized, and perform straight line detection based on Hough transform.

S220: Perform Radon transformation on the straight line detection result, calculate the projection area in each direction, find the angle when the projection area has the smallest width, and use the searched angle as a tilt correction angle for tilt correction processing.

S230: Binarize the grayscale image after the tilt correction, and determine the area carrying the website information based on the horizontal projection and the vertical projection obtained after the binarization process.

S240: Cut the area carrying the URL information, and perform zoom processing on the cut image according to a preset size.

S250: Identify the URL carried in the zoomed image through the optical identifier tool.

The tilt correction specifically includes the grayscale processing of the image, the CANNY edge detection, the straight line detection based on the Hough transform, the Radon transform on the straight line detection result, and the calculation of the projection area in each direction to find the smallest width of the projection area The angle of is the tilt direction, and then rotate and correct the original input business card at this angle. Clipping includes binarization of the gray image after tilt correction. The threshold determination method adopts the maximum between-class variance method, and then the area of the business card is determined based on horizontal projection and vertical projection. The threshold determination adopts the empirical method, and then Cut out the business card area. Scaling includes: scaling the cut out business card area according to the initial set size, and adopting bilinear method as the interpolation method when scaling. In this embodiment, the image to be recognized is first processed for pre-processing, so that it can perform OCR recognition more efficiently and accurately.

In one of the embodiments, before recognizing the URL carried in the zoomed image to be recognized by the OCR tool, it further includes: using data morphology processing technology and connected area analysis technology to extract independent characters from the zoomed image to be recognized ; The extracted independent characters are used as sub-images; the URLs carried in the images to be recognized after the zoom processing are recognized by the OCR tool include: the URLs carried in the sub-images are recognized by the OCR tool.

The digital morphology processing is: mathematical morphology processing is performed on the binarization result graph to preserve the real character area. Mathematical morphology processing includes image expansion, image erosion, opening operation, closing operation, connected area analysis, noise removal, and abnormal area removal; the connected area analysis is performed on the binarized result map after the real characters are retained, and each The connected area is subjected to horizontal expansion processing, and then the connected area is analyzed again to obtain the circumscribed rectangle of the new connected area. Finally, the block area is extracted as a sub-image based on the circumscribed rectangle. There may be multiple sub-images obtained. When there are multiple sub-images, the URLs carried in each sub-image are obtained separately, and the obtained URLs are combined in an orderly manner to obtain the URLs carried in the picture to be identified.

As shown in FIG. 3, in one of the embodiments, before step S400, the method further includes:

S300: Perform web site analysis on the recognized web site, and determine whether the last identification character in the recognized web site is a preset URL end identification character.

The preset URL end identification character is a standard character set based on industry standards, such as conventional .cn or .com or .html. For example, https://baike, which is an incomplete website address. When it is an incomplete website address, enter step S400.

In one of the embodiments, the feature information includes text feature information, and the associated URL for obtaining the feature information fed back by a third-party Internet search engine includes:

Perform word segmentation processing on text feature information to obtain multiple segmented words; extract network words in multiple segmented words according to the preset network feature word database; push network words to third-party Internet search engines; receive third-party Internet search engines The associated URL of the web term.

Word segmentation processing refers to dividing a complete paragraph or a sentence into multiple word segmentation words reasonably, searching the word segmentation words obtained by word segmentation in a preset network feature word database, and seeing whether there are network words in multiple word segmentation words, and then dividing the network The words are sent to a third-party Internet search engine to find the corresponding associated URL. Specifically, the preset network feature word database is a database constructed based on historical experience, which can be continuously updated according to daily applications. The network words can specifically be company entities, product names, celebrity names, Internet celebrity locations, etc. Generally speaking , Internet words generally refer to words that can be found on the Internet that require related content.

In one of the embodiments, the feature information includes graphic feature information, and the associated URL for obtaining the feature information fed back by the third-party Internet search engine includes: pushing the graphic feature information to the third-party Internet search engine; receiving the graphic feature found by the third-party Internet search engine The associated URL of the information, the associated URL of the graphic feature is obtained by a third-party Internet search engine searching for the product information or company entity name associated with the graphic feature information, and looking up the product information or company entity name.

The graphic feature information may specifically be company trademarks, product iconic appearances, etc. Based on the graphic feature information, the associated product information or company entity name is identified, and then the URL associated with the product information or company entity name is searched. Take liquor as an example. At present, many companies in liquor products use uniquely shaped bottles. When the characteristic information includes bottle shape data, the bottle shape data can be used to use big data through a third-party Internet search engine. Find the information of the alcoholic product and/or the company that produces the alcoholic product, and then further search for the website associated with the alcoholic product and/or company.

In one of the embodiments, matching the associated URL with the incomplete URL to obtain the target URL includes: performing similarity matching between the associated URL and the incomplete URL, and selecting the associated URL with the highest similarity matching result as the target URL.

Since there may be multiple associated URLs, in this embodiment, the method of similarity matching is used to select the URL with the highest similarity from the associated URLs as the target URL to achieve efficient and accurate identification of the target URL.

In order to further explain the technical solution of the above-mentioned URL identification method in detail, a specific application example will be used for description below.

In an application example, the user sends a picture carrying makeup products to the server through the terminal. The picture is a screenshot picture of a makeup product introduction. The server receives the picture to be recognized, and uses the OCR tool to identify the picture to be recognized. The https://ABCD URL of, analyzes the URL and determines that it does not carry the end of the URL identification character. Therefore, it is an incomplete URL. The server uses the OCR tool to extract the characteristic information carried in the image to be recognized, including XX Beauty Cream and the product shape that is similar to the face shape, use big data to search for the related website of XX beauty cream or the website of the product shape that is similar to the face shape, and get the related URL 1. https://ABMP.com; 2. https://ATMP.com; 3. https://ABCDMPQ.com, match the above 3 associated URLs with the incomplete URL https://ABCD, and get the target URL https://ABCDMPQ.com.

It should be understood that, although the various steps in the flowchart of FIGS. 2-3 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in Figure 2-3 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. These sub-steps or stages The execution order of is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

As shown in Fig. 4, a device for identifying a website address includes:

The first recognition module 200 is configured to obtain the picture to be recognized, and use the OCR tool to recognize the URL carried in the picture to be recognized.

The second recognition module 400 is used for extracting the characteristic information carried in the picture to be recognized by the OCR tool when the recognized website is an incomplete website.

The search module 600 is used to obtain the associated website address of the characteristic information fed back by the third-party Internet search engine.

The URL matching module 800 is used to match an associated URL with an incomplete URL to obtain a target URL.

In the above URL recognition device, the first recognition module 200 uses the OCR tool to recognize the URL in the picture to be recognized. If the directly recognized URL is incomplete, the second recognition module 400 recognizes the characteristic information in the picture to be recognized, and the search module 600 accesses the third-party Internet The search engine obtains the associated website address related to the characteristic information, and the website address matching module 800 matches the directly identified website address with the associated website address to obtain a complete website address, which can accurately identify the website address carried.

In one of the embodiments, the first recognition module 200 is also used to perform image gray-scale processing and edge detection on the image to be recognized, and perform straight line detection based on Hough transform; perform Radon transformation on the straight line detection result to calculate the projection in each direction Area, find the angle when the width of the projection area is the smallest, and use the searched angle as the tilt correction angle for tilt correction processing; binarize the gray image after tilt correction, and based on the horizontal projection and The vertical projection determines the area carrying the website information; the area carrying the website information is cut, and the cut image is scaled according to the preset size; the optical identifier tool is used to identify the website carried in the scaled image.

In one of the embodiments, the first recognition module 200 is further configured to use data morphology processing technology and connected area analysis technology to extract independent characters from the zoomed image to be recognized; use the extracted independent characters as sub-images; and use OCR The tool recognizes the URL carried in the sub-image.

In one of the embodiments, the above-mentioned URL recognition device further includes a judgment module, which is used to perform a website analysis on the recognized website and determine whether the last identification character in the recognized website is a preset website ending identification character.

In one of the embodiments, the feature information includes text feature information, and the search module 600 is also used to perform word segmentation processing on the text feature information to obtain multiple word segmentation words; according to a preset network feature word database, extract the network of the multiple word segmentation words Words; push online words to a third-party Internet search engine; receive related URLs of online words found by a third-party Internet search engine.

In one of the embodiments, the feature information includes graphic feature information, and the search module 600 is also used to push the graphic feature information to a third-party Internet search engine; receive the associated URL of the graphic feature information found by the third-party Internet search engine, and the association of the graphic feature The URL is obtained by a third-party Internet search engine searching for product information or company entity name associated with graphic feature information, and searching for product information or company entity name.

In one of the embodiments, the URL matching module 800 is used to perform similarity matching between associated URLs and incomplete URLs; the associated URL with the highest similarity matching result is selected as the target URL.

Regarding the specific limitation of the website identification device, please refer to the above limitation of the website identification method, which will not be repeated here. Each module in the above-mentioned web address recognition device can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 5. The computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store data such as characteristic information and associated web addresses. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a method for identifying a website address is realized.

Those skilled in the art can understand that the structure shown in FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.

A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors implement the methods provided in any of the embodiments of the present application. The steps of the URL identification method.

One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors implement any one of the embodiments of the present application. Provide the steps of the URL identification method.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, they should It is considered as the range described in this specification.

The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims

A method for identifying URLs, including:

Obtain the picture to be recognized, and identify the URL carried in the picture to be recognized through an optical identifier tool;

When the recognized URL is an incomplete URL, use the optical identifier tool to extract the characteristic information carried in the image to be identified;

Obtain the associated URL of the feature information fed back by a third-party Internet search engine; and

Match the associated URL with the incomplete URL to obtain a target URL.
The method according to claim 1, wherein the obtaining the picture to be recognized, and recognizing the URL carried in the picture to be recognized by an optical identifier tool comprises:

Performing image grayscale processing and edge detection on the picture to be recognized, and performing straight line detection based on Hough transform;

Carry out Radon transformation on the detection result of the straight line, calculate the projection area in each direction, find the angle when the width of the projection area is the smallest, and use the searched angle as the tilt correction angle for tilt correction processing;

Binarize the grayscale image after tilt correction, and determine the area carrying the website information based on the horizontal projection and vertical projection obtained after the binarization process;

Cut the area carrying the URL information, and scale the cut image according to the preset size; and

Recognize the URL carried in the zoomed image through the optical identifier tool.
The method according to claim 2, wherein before said recognizing the web address carried in the image to be recognized after the zooming process by the optical identifier tool, the method further comprises:

Using data morphology processing technology and connected area analysis technology to extract independent characters from the image to be recognized after the zoom processing; and

Use the extracted independent characters as sub-images;

The URL carried in the image to be recognized after the zoom processing is recognized by the optical identifier tool includes:

Identify the URL carried in the sub-image through an optical identifier tool.
3. The method according to claim 3, wherein said using data morphology processing technology and connected area analysis technology to extract independent characters from said image to be recognized after said scaling processing comprises:

Perform data morphology processing on the image to be recognized after the zoom processing to obtain a retained image. The mathematical morphology processing includes image expansion, image erosion, opening operation, closing operation, connected region analysis, noise removal, and abnormal area removal ；

Performing connected region analysis on the retained image to determine the connected region in the retained image;

Perform horizontal expansion treatment on each connected area, and perform connected area analysis again on the connected area after horizontal expansion treatment to obtain a new connected area;

Solving the circumscribed rectangle of the new connected region; and

Extract independent characters according to the external matrix.
The method according to claim 1, characterized in that, when the recognized web address is an incomplete web address, before extracting the characteristic information carried in the picture to be recognized by an optical identifier tool, the method further comprises:

Perform web site analysis on the recognized web site to determine whether the last identification character in the recognized web site is a preset end identification character of the web site.
The method according to claim 1, wherein the characteristic information includes text characteristic information, and the associated web address for obtaining the characteristic information fed back by a third-party Internet search engine includes:

Perform word segmentation processing on text feature information to obtain multiple word segmentation words;

Extracting network words among the plurality of word segmentation words according to a preset network feature word database;

Push the network words to a third-party Internet search engine; and

Receive the associated website address of the network term searched by the third-party Internet search engine.
The method according to claim 1, wherein the characteristic information includes graphic characteristic information, and the associated web address for obtaining the characteristic information fed back by a third-party Internet search engine includes:

Push the graphic feature information to a third-party Internet search engine; and

Receive the associated URL of the graphic feature information searched by the third-party Internet search engine, and the third-party Internet search engine searches for the product information or company entity name associated with the graphic feature information, And look up the product information or the company entity name.
The method according to claim 1, wherein the matching the associated website with the incomplete website to obtain the target website comprises:

Perform similarity matching on the associated URL and the incomplete URL; and

Select the associated URL with the highest similarity matching result as the destination URL.
A web address recognition device includes:

The first recognition module is used to obtain the picture to be recognized, and to recognize the web address carried in the picture to be recognized through an optical identifier tool;

The second recognition module is used to extract the characteristic information carried in the picture to be recognized by the optical identifier tool when the recognized website is an incomplete website;

The search module is used to obtain the associated URL of the feature information fed back by a third-party Internet search engine; and

The URL matching module is used to match the associated URL with the incomplete URL to obtain a target URL.
The device according to claim 9, wherein the first recognition module is further configured to perform image gray-scale processing and edge detection on the image to be recognized, and perform straight line detection based on Hough transform; Carry out Radon transformation, calculate the projection area in each direction, find the angle when the width of the projection area is the smallest, and use the searched angle as the tilt correction angle for tilt correction processing; binarize the tilt-corrected grayscale image, and based The horizontal projection and vertical projection obtained after the binarization process determine the area carrying the website information; cut the area carrying the website information, and scale the cropped image according to the preset size; and identify the scaling process through the optical identifier tool The URL carried in the image after.
The device according to claim 10, wherein the first recognition module is further configured to use data morphology processing technology and connected region analysis technology to extract independent characters from the image to be recognized after the zoom processing; Taking the extracted independent characters as a sub-image; and identifying the URL carried in the sub-image through an optical identifier tool.
The device according to claim 11, wherein the first recognition module is further configured to perform data morphology processing on the image to be recognized after the scaling process to obtain a retained image; and perform data morphological processing on the retained image Connected area analysis to determine the connected areas in the retained image; perform horizontal expansion processing on each connected area, and perform connected area analysis again on the connected areas after the horizontal expansion processing to obtain a new connected area; solve the new connected area The circumscribed rectangle of the connected area; and extracting independent characters according to the circumscribed matrix; the mathematical morphology processing includes image expansion, image erosion, opening operation, closing operation, connected area analysis, noise removal and abnormal area removal.
The device according to claim 9, wherein the feature information includes text feature information; the search module is further configured to perform word segmentation processing on the text feature information to obtain multiple word segmentation words; according to a preset network feature word database , Extracting web terms in the plurality of word segmentation terms; pushing the web terms to a third-party Internet search engine; and receiving the associated web addresses of the web terms searched by the third-party Internet search engine.
The device according to claim 9, wherein the feature information includes graphic feature information; the search module is further configured to push the graphic feature information to a third-party Internet search engine; and receive the third-party Internet search The associated web address of the graphic feature information searched by the engine, the associated web address of the graphic feature is searched by the third-party Internet search engine for product information or company entity name associated with the graphic feature information, and the product information or The name of the company entity is obtained.
A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more Each processor performs the following steps:

Obtain the picture to be recognized, and identify the URL carried in the picture to be recognized through an optical identifier tool;

When the recognized URL is an incomplete URL, use the optical identifier tool to extract the characteristic information carried in the image to be identified;

Obtain the associated URL of the feature information fed back by a third-party Internet search engine; and

Match the associated URL with the incomplete URL to obtain a target URL.
The computer device according to claim 15, wherein the processor further executes the following steps when executing the computer-readable instruction:

Performing image grayscale processing and edge detection on the picture to be recognized, and performing straight line detection based on Hough transform;

Carry out Radon transformation on the detection result of the straight line, calculate the projection area in each direction, find the angle when the width of the projection area is the smallest, and use the searched angle as the tilt correction angle for tilt correction processing;

Binarize the grayscale image after tilt correction, and determine the area carrying the website information based on the horizontal projection and vertical projection obtained after the binarization process;

Cut the area carrying the URL information, and scale the cut image according to the preset size; and

Recognize the URL carried in the zoomed image through the optical identifier tool.
The computer device according to claim 16, wherein the processor further executes the following steps when executing the computer readable instruction:

Using data morphology processing technology and connected area analysis technology to extract independent characters from the image to be recognized after the scaling process;

Use the extracted independent characters as sub-images; and

Identify the URL carried in the sub-image through an optical identifier tool.
One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:

Obtain the picture to be recognized, and identify the URL carried in the picture to be recognized through an optical identifier tool;

When the recognized URL is an incomplete URL, use the optical identifier tool to extract the characteristic information carried in the image to be identified;

Obtain the associated URL of the feature information fed back by a third-party Internet search engine; and

Match the associated URL with the incomplete URL to obtain a target URL.
18. The storage medium of claim 18, wherein the following steps are further performed when the computer-readable instructions are executed by the processor:

Performing image grayscale processing and edge detection on the picture to be recognized, and performing straight line detection based on Hough transform;

Carry out Radon transformation on the detection result of the straight line, calculate the projection area in each direction, find the angle when the width of the projection area is the smallest, and use the searched angle as the tilt correction angle for tilt correction processing;

Binarize the grayscale image after tilt correction, and determine the area carrying the website information based on the horizontal projection and vertical projection obtained after the binarization process;

Cut the area carrying the URL information, and scale the cut image according to the preset size; and

Recognize the URL carried in the zoomed image through the optical identifier tool.
The storage medium according to claim 19, wherein the following steps are further executed when the computer-readable instructions are executed by the processor:

Using data morphology processing technology and connected area analysis technology to extract independent characters from the image to be recognized after the scaling process;

Use the extracted independent characters as sub-images; and

Identify the URL carried in the sub-image through an optical identifier tool.