WO2020258669A1 - Website identification method and apparatus, and computer device and storage medium - Google Patents

Website identification method and apparatus, and computer device and storage medium Download PDF

Info

Publication number
WO2020258669A1
WO2020258669A1 PCT/CN2019/118243 CN2019118243W WO2020258669A1 WO 2020258669 A1 WO2020258669 A1 WO 2020258669A1 CN 2019118243 W CN2019118243 W CN 2019118243W WO 2020258669 A1 WO2020258669 A1 WO 2020258669A1
Authority
WO
WIPO (PCT)
Prior art keywords
url
image
recognized
picture
website
Prior art date
Application number
PCT/CN2019/118243
Other languages
French (fr)
Chinese (zh)
Inventor
王建华
何四燕
金志敏
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020258669A1 publication Critical patent/WO2020258669A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Definitions

  • This application relates to a method, device, computer equipment and storage medium for identifying a website address.
  • a method, device, computer equipment, and storage medium for identifying a website address are provided.
  • a method for identifying URLs including:
  • a web address recognition device includes:
  • the first recognition module is configured to obtain the picture to be recognized, and recognize the URL carried in the picture to be recognized through the OCR tool;
  • the second recognition module is used to extract the characteristic information carried in the picture to be recognized through the OCR tool when the recognized website is an incomplete website;
  • the search module is used to obtain the associated URL of the feature information fed back by a third-party Internet search engine.
  • the URL matching module is used to match the associated URL with the incomplete URL to obtain a target URL.
  • a computer device including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors execute the following steps:
  • Fig. 1 is an application scenario diagram of a method for identifying a website address according to one or more embodiments.
  • Fig. 2 is a schematic flowchart of a method for identifying a website address according to one or more embodiments.
  • Fig. 3 is a schematic flowchart of a method for identifying a website address in another embodiment.
  • Fig. 4 is a block diagram of a website identification device according to one or more embodiments.
  • Figure 5 is a block diagram of a computer device according to one or more embodiments.
  • the URL identification method provided in this application can be applied to the application environment as shown in FIG. 1.
  • the terminal 102 communicates with the server 104 through the network through the network.
  • the terminal 102 sends the picture to be recognized to the server 104 via the network.
  • the server 104 receives the picture to be recognized, and uses the OCR tool to recognize the URL contained in the picture to be recognized.
  • the recognized URL is a complete URL
  • the complete URL is fed back to the terminal 102, or directly visit the URL link, and return the access feedback data to the terminal 102.
  • the user can browse to the information corresponding to the URL through the terminal; when the identified URL is an incomplete URL, extract the image to be identified through the OCR tool
  • the characteristic information carried in the third-party Internet search engine obtains the associated URL of the characteristic information fed back by the third-party Internet search engine, and matches the associated URL with the incomplete URL to obtain the target URL.
  • the server 104 can send the anti-target URL to the terminal 102 or directly access the URL Link to return the access feedback data to the terminal 102.
  • the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server 104 may be implemented as an independent server or a server cluster composed of multiple servers.
  • a method for identifying a website address is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
  • S200 Obtain a picture to be recognized, and use an OCR tool to recognize the URL carried in the picture to be recognized.
  • the OCR tool is a process in which electronic devices (such as scanners or digital cameras) check characters printed on paper, determine their shape by detecting dark and light patterns, and then use character recognition methods to translate the shapes into computer text.
  • the picture to be recognized refers to the picture to be recognized by the URL.
  • the picture to be recognized can be a picture obtained by the terminal performing a screenshot operation, a picture downloaded from the Internet, or a picture in the process of chatting through a social application, and the picture is recognized through the OCR tool URL carried in. It should be pointed out that when the image to be identified does not carry a web address, the web address obtained this time is blank. Take the chat scenario as an example. User A shares a food article with friend B.
  • User A operates the terminal to take a screenshot to capture the picture containing the food article URL.
  • User A sends the picture containing the food article URL to friend B.
  • the picture that contains the URL of the food article is the picture to be recognized; or in the Internet download scenario as an example, user A browses to a product introduction picture from the Internet and carries it in the product introduction picture
  • There is a website address of the product introduction webpage user A downloads the product introduction picture, and recognizes the website address carried by the product introduction picture through the URL identification method of this application.
  • the product introduction picture is the picture to be identified.
  • the feature information mainly includes text feature information and graphic logo information.
  • the text feature information can obtain text data in the image to be recognized to form a text data collection.
  • the text data is data cleaned and the keywords are extracted.
  • the key recognition process can be based on historical experience data to obtain a set of keywords related to the website, such as Ping An XX, Phoenix XX, Sina XX, etc.
  • the graphic logo information can specifically be the brand's trademark, the shape of the article, etc.
  • the related URL query of feature information can be completed by a third-party Internet search engine, the server pushes the feature information to the third-party Internet search engine, and the third-party Internet search engine queries the related feature information on the Internet based on the Internet’s big data query technology
  • the URL of the content for example, when the extracted characteristic information is "Ping An Technology X”, the server communicates with the search engine server via the Internet, and sends "Ping An Technology X" to the search engine server on the Internet, and obtains information related to "Ping An Technology X" URL.
  • the extracted feature information is the trademark of a certain brand
  • the trademark of a certain brand is sent to the search engine server, and the related website of the brand can be queried.
  • the related website includes the official website of the brand, related advertisements of the brand, product introduction, and related News and other URLs. Specifically, there can be one or more associated URLs.
  • the third-party search engine can be the current common network search engines, such as Baidu search engine, Google search engine, etc.
  • the server sends characteristic information to the search engine by accessing these search engines, and receives data from the search engine to obtain the association of the characteristic information. URL.
  • S800 Match the associated URL with the incomplete URL to obtain the target URL.
  • the obtained incomplete URL is part of the complete URL (destination URL)
  • it can be used as a matching condition for the complete URL to match the incomplete URL with the URL obtained by the query. If the match is successful, it means that it is already in the obtained URL
  • the target URL is found, and the browser is called to open the target URL to realize efficient and accurate identification of the URL.
  • the above URL identification method uses the OCR tool to identify the URL in the image to be identified. If the directly identified URL is incomplete, then identify the characteristic information in the image to be identified, access a third-party Internet search engine to obtain the associated URL related to the characteristic information, The directly identified URL matches the associated URL to obtain a complete URL, which can accurately identify the URL carried.
  • step S200 includes:
  • S210 Perform image grayscale processing and edge detection on the picture to be recognized, and perform straight line detection based on Hough transform.
  • S220 Perform Radon transformation on the straight line detection result, calculate the projection area in each direction, find the angle when the projection area has the smallest width, and use the searched angle as a tilt correction angle for tilt correction processing.
  • S230 Binarize the grayscale image after the tilt correction, and determine the area carrying the website information based on the horizontal projection and the vertical projection obtained after the binarization process.
  • S240 Cut the area carrying the URL information, and perform zoom processing on the cut image according to a preset size.
  • the tilt correction specifically includes the grayscale processing of the image, the CANNY edge detection, the straight line detection based on the Hough transform, the Radon transform on the straight line detection result, and the calculation of the projection area in each direction to find the smallest width of the projection area
  • the angle of is the tilt direction, and then rotate and correct the original input business card at this angle.
  • Clipping includes binarization of the gray image after tilt correction.
  • the threshold determination method adopts the maximum between-class variance method, and then the area of the business card is determined based on horizontal projection and vertical projection.
  • the threshold determination adopts the empirical method, and then Cut out the business card area.
  • Scaling includes: scaling the cut out business card area according to the initial set size, and adopting bilinear method as the interpolation method when scaling.
  • the image to be recognized is first processed for pre-processing, so that it can perform OCR recognition more efficiently and accurately.
  • the OCR tool before recognizing the URL carried in the zoomed image to be recognized by the OCR tool, it further includes: using data morphology processing technology and connected area analysis technology to extract independent characters from the zoomed image to be recognized ; The extracted independent characters are used as sub-images; the URLs carried in the images to be recognized after the zoom processing are recognized by the OCR tool include: the URLs carried in the sub-images are recognized by the OCR tool.
  • the digital morphology processing is: mathematical morphology processing is performed on the binarization result graph to preserve the real character area.
  • Mathematical morphology processing includes image expansion, image erosion, opening operation, closing operation, connected area analysis, noise removal, and abnormal area removal; the connected area analysis is performed on the binarized result map after the real characters are retained, and each The connected area is subjected to horizontal expansion processing, and then the connected area is analyzed again to obtain the circumscribed rectangle of the new connected area. Finally, the block area is extracted as a sub-image based on the circumscribed rectangle. There may be multiple sub-images obtained. When there are multiple sub-images, the URLs carried in each sub-image are obtained separately, and the obtained URLs are combined in an orderly manner to obtain the URLs carried in the picture to be identified.
  • the method before step S400, the method further includes:
  • S300 Perform web site analysis on the recognized web site, and determine whether the last identification character in the recognized web site is a preset URL end identification character.
  • the preset URL end identification character is a standard character set based on industry standards, such as conventional .cn or .com or .html. For example, https://baike, which is an incomplete website address. When it is an incomplete website address, enter step S400.
  • the feature information includes text feature information
  • the associated URL for obtaining the feature information fed back by a third-party Internet search engine includes:
  • Word segmentation processing refers to dividing a complete paragraph or a sentence into multiple word segmentation words reasonably, searching the word segmentation words obtained by word segmentation in a preset network feature word database, and seeing whether there are network words in multiple word segmentation words, and then dividing the network
  • the words are sent to a third-party Internet search engine to find the corresponding associated URL.
  • the preset network feature word database is a database constructed based on historical experience, which can be continuously updated according to daily applications.
  • the network words can specifically be company entities, product names, celebrity names, Internet celebrity locations, etc.
  • Internet words generally refer to words that can be found on the Internet that require related content.
  • the feature information includes graphic feature information
  • the associated URL for obtaining the feature information fed back by the third-party Internet search engine includes: pushing the graphic feature information to the third-party Internet search engine; receiving the graphic feature found by the third-party Internet search engine
  • the associated URL of the information, the associated URL of the graphic feature is obtained by a third-party Internet search engine searching for the product information or company entity name associated with the graphic feature information, and looking up the product information or company entity name.
  • the graphic feature information may specifically be company trademarks, product iconic appearances, etc. Based on the graphic feature information, the associated product information or company entity name is identified, and then the URL associated with the product information or company entity name is searched. Take liquor as an example. At present, many companies in liquor products use uniquely shaped bottles. When the characteristic information includes bottle shape data, the bottle shape data can be used to use big data through a third-party Internet search engine. Find the information of the alcoholic product and/or the company that produces the alcoholic product, and then further search for the website associated with the alcoholic product and/or company.
  • matching the associated URL with the incomplete URL to obtain the target URL includes: performing similarity matching between the associated URL and the incomplete URL, and selecting the associated URL with the highest similarity matching result as the target URL.
  • the method of similarity matching is used to select the URL with the highest similarity from the associated URLs as the target URL to achieve efficient and accurate identification of the target URL.
  • the user sends a picture carrying makeup products to the server through the terminal.
  • the picture is a screenshot picture of a makeup product introduction.
  • the server receives the picture to be recognized, and uses the OCR tool to identify the picture to be recognized.
  • the https://ABCD URL of analyzes the URL and determines that it does not carry the end of the URL identification character. Therefore, it is an incomplete URL.
  • the server uses the OCR tool to extract the characteristic information carried in the image to be recognized, including XX Beauty Cream and the product shape that is similar to the face shape, use big data to search for the related website of XX beauty cream or the website of the product shape that is similar to the face shape, and get the related URL 1. https://ABMP.com; 2. https://ATMP.com; 3. https://ABCDMPQ.com, match the above 3 associated URLs with the incomplete URL https://ABCD, and get the target URL https://ABCDMPQ.com.
  • a device for identifying a website address includes:
  • the first recognition module 200 is configured to obtain the picture to be recognized, and use the OCR tool to recognize the URL carried in the picture to be recognized.
  • the second recognition module 400 is used for extracting the characteristic information carried in the picture to be recognized by the OCR tool when the recognized website is an incomplete website.
  • the search module 600 is used to obtain the associated website address of the characteristic information fed back by the third-party Internet search engine.
  • the URL matching module 800 is used to match an associated URL with an incomplete URL to obtain a target URL.
  • the first recognition module 200 uses the OCR tool to recognize the URL in the picture to be recognized. If the directly recognized URL is incomplete, the second recognition module 400 recognizes the characteristic information in the picture to be recognized, and the search module 600 accesses the third-party Internet The search engine obtains the associated website address related to the characteristic information, and the website address matching module 800 matches the directly identified website address with the associated website address to obtain a complete website address, which can accurately identify the website address carried.
  • the first recognition module 200 is also used to perform image gray-scale processing and edge detection on the image to be recognized, and perform straight line detection based on Hough transform; perform Radon transformation on the straight line detection result to calculate the projection in each direction Area, find the angle when the width of the projection area is the smallest, and use the searched angle as the tilt correction angle for tilt correction processing; binarize the gray image after tilt correction, and based on the horizontal projection and The vertical projection determines the area carrying the website information; the area carrying the website information is cut, and the cut image is scaled according to the preset size; the optical identifier tool is used to identify the website carried in the scaled image.
  • the first recognition module 200 is further configured to use data morphology processing technology and connected area analysis technology to extract independent characters from the zoomed image to be recognized; use the extracted independent characters as sub-images; and use OCR The tool recognizes the URL carried in the sub-image.
  • the above-mentioned URL recognition device further includes a judgment module, which is used to perform a website analysis on the recognized website and determine whether the last identification character in the recognized website is a preset website ending identification character.
  • the feature information includes text feature information
  • the search module 600 is also used to perform word segmentation processing on the text feature information to obtain multiple word segmentation words; according to a preset network feature word database, extract the network of the multiple word segmentation words Words; push online words to a third-party Internet search engine; receive related URLs of online words found by a third-party Internet search engine.
  • the feature information includes graphic feature information
  • the search module 600 is also used to push the graphic feature information to a third-party Internet search engine; receive the associated URL of the graphic feature information found by the third-party Internet search engine, and the association of the graphic feature
  • the URL is obtained by a third-party Internet search engine searching for product information or company entity name associated with graphic feature information, and searching for product information or company entity name.
  • the URL matching module 800 is used to perform similarity matching between associated URLs and incomplete URLs; the associated URL with the highest similarity matching result is selected as the target URL.
  • Each module in the above-mentioned web address recognition device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 5.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store data such as characteristic information and associated web addresses.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a method for identifying a website address is realized.
  • FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the one or more processors implement the methods provided in any of the embodiments of the present application. The steps of the URL identification method.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors implement any one of the embodiments of the present application. Provide the steps of the URL identification method.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • ROM read only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A website identification method, comprising: identifying a website in a picture to be identified with an OCR tool; if the directly identified website is incomplete, identifying feature information in said picture; obtaining an associated website related to the feature information by accessing a third-party internet search engine; and matching the directly identified website with the associated website to obtain a complete website. The website carried by said picture can be efficiently and accurately identified.

Description

网址识别方法、装置、计算机设备和存储介质Website identification method, device, computer equipment and storage medium
相关申请的交叉引用Cross references to related applications
本申请要求于2019年06月26日提交中国专利局,申请号为2019105613705,申请名称为“网址识别方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on June 26, 2019, the application number is 2019105613705, and the application name is "URL identification method, device, computer equipment and storage medium", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及一种网址识别方法、装置、计算机设备和存储介质。This application relates to a method, device, computer equipment and storage medium for identifying a website address.
背景技术Background technique
随着科学技术的发展,目前互联网已经深入到人们的日常生活中。人们可以通过互联网查询数据、购买商品、社交等,给人们带来了巨大的便利。With the development of science and technology, the Internet has penetrated into people's daily lives. People can query data, buy goods, socialize, etc. through the Internet, which brings great convenience to people.
一般用户通过浏览器进行网页浏览时候,一般采用手动输入访问网址,当遇到某些手机屏幕太小导致输入不便,或者网址太长,容易写错甚至写漏,费事费时。针对这种情况,目前已有直接从图片中获取网址的技术,在基于该技术的应用场景中,可以是用户A接收用户B分享的图片,在该图片中携带有用户B推荐给用户A一篇新闻的网址,用户A在接收到该分享图片后,基于图片中网址识别技术可以图片中携带的网址识别、并提取输入至用户A终端的浏览器中,用户A即可浏览该篇新闻。Generally, when users browse web pages through a browser, they usually enter the access URL manually. When the screen of some mobile phones is too small and the input is inconvenient, or the URL is too long, it is easy to make mistakes or even miss writing, which is time-consuming and time-consuming. In response to this situation, there is currently a technology to obtain a URL directly from a picture. In an application scenario based on this technology, it can be that user A receives a picture shared by user B, and the picture carries user B recommended to user A. After receiving the shared picture, user A can recognize the URL carried in the picture based on the URL recognition technology in the picture, and extract it and input it into the browser of user A’s terminal. User A can browse the news.
然而,虽然上述图片中网址识别技术可以实现网址识别与提取,但是针对图片携带网址异常(例如部分网址被遮盖、网址打印错误等)或携带网址残缺的情况,传统技术则无法得到对应准确的网址。However, although the above-mentioned URL recognition technology in the image can realize URL identification and extraction, but for the situation that the image carries the URL abnormal (such as part of the URL is covered, the URL is printed incorrectly, etc.) or the carrying URL is incomplete, the traditional technology cannot obtain the corresponding accurate URL. .
发明内容Summary of the invention
根据本申请公开的各种实施例,提供一种网址识别方法、装置、计算机设备和存储介质。According to various embodiments disclosed in the present application, a method, device, computer equipment, and storage medium for identifying a website address are provided.
一种网址识别方法,包括:A method for identifying URLs, including:
获取待识别图片,通过OCR(Optical Character Recognition,光学字符识别)工具识别所述待识别图片中携带的网址;Obtain the picture to be recognized, and recognize the URL carried in the picture to be recognized through an OCR (Optical Character Recognition) tool;
当识别得到的网址为不完整的网址时,通过OCR工具提取待识别图片中携带的特征信息;When the identified URL is an incomplete URL, use the OCR tool to extract the characteristic information carried in the image to be identified;
获取第三方互联网搜索引擎反馈的所述特征信息的关联网址;及Obtain the associated URL of the feature information fed back by a third-party Internet search engine; and
将所述关联网址与所述不完整的网址匹配,得到目标网址。Match the associated URL with the incomplete URL to obtain a target URL.
一种网址识别装置,包括:A web address recognition device includes:
第一识别模块,用于获取待识别图片,通过OCR工具识别所述待识别图片中携带的网址;The first recognition module is configured to obtain the picture to be recognized, and recognize the URL carried in the picture to be recognized through the OCR tool;
第二识别模块,用于当识别得到的网址为不完整的网址时,通过OCR工具提取待识别图片中携带的特征信息;The second recognition module is used to extract the characteristic information carried in the picture to be recognized through the OCR tool when the recognized website is an incomplete website;
查找模块,用于获取第三方互联网搜索引擎反馈的所述特征信息的关联网址;及The search module is used to obtain the associated URL of the feature information fed back by a third-party Internet search engine; and
网址匹配模块,用于将所述关联网址与所述不完整的网址匹配,得到目标网址。The URL matching module is used to match the associated URL with the incomplete URL to obtain a target URL.
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device, including a memory and one or more processors, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:
获取待识别图片,通过OCR工具识别所述待识别图片中携带的网址;Obtain the picture to be recognized, and use the OCR tool to recognize the URL carried in the picture to be recognized;
当识别得到的网址为不完整的网址时,通过OCR工具提取待识别图片中携带的特征信息;When the identified URL is an incomplete URL, use the OCR tool to extract the characteristic information carried in the image to be identified;
获取第三方互联网搜索引擎反馈的所述特征信息的关联网址;及Obtain the associated URL of the feature information fed back by a third-party Internet search engine; and
将所述关联网址与所述不完整的网址匹配,得到目标网址。Match the associated URL with the incomplete URL to obtain a target URL.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取待识别图片,通过OCR工具识别所述待识别图片中携带的网址;Obtain the picture to be recognized, and use the OCR tool to recognize the URL carried in the picture to be recognized;
当识别得到的网址为不完整的网址时,通过OCR工具提取待识别图片中携带的特征信息;When the identified URL is an incomplete URL, use the OCR tool to extract the characteristic information carried in the image to be identified;
获取第三方互联网搜索引擎反馈的所述特征信息的关联网址;及Obtain the associated URL of the feature information fed back by a third-party Internet search engine; and
将所述关联网址与所述不完整的网址匹配,得到目标网址。Match the associated URL with the incomplete URL to obtain a target URL.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。The details of one or more embodiments of the application are set forth in the following drawings and description. Other features and advantages of this application will become apparent from the description, drawings and claims.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1为根据一个或多个实施例中网址识别方法的应用场景图。Fig. 1 is an application scenario diagram of a method for identifying a website address according to one or more embodiments.
图2为根据一个或多个实施例中网址识别方法的流程示意图。Fig. 2 is a schematic flowchart of a method for identifying a website address according to one or more embodiments.
图3为又一个实施例中网址识别方法的流程示意图。Fig. 3 is a schematic flowchart of a method for identifying a website address in another embodiment.
图4为根据一个或多个实施例中网址识别装置的框图。Fig. 4 is a block diagram of a website identification device according to one or more embodiments.
图5为根据一个或多个实施例中计算机设备的框图。Figure 5 is a block diagram of a computer device according to one or more embodiments.
具体实施方式Detailed ways
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the technical solutions and advantages of the present application clearer, the following further describes the present application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.
本申请提供的网址识别方法,可以应用于如图1所示的应用环境中。其中,终端102通过网络与服务器104通过网络进行通信。终端102通过网络发送待识别图片至服务器104,服务器104接收该待识别图片,通过OCR工具识别待识别图片中携带的网址,当识别得到的网址为完整的网址时,反馈该完整的网址至终端102,或直接访问该网址链接,将访问反馈的数据返回至终端102,用户可以通过终端浏览到该网址对应的信息;当识别得到的网址为不完整的网址时,通过OCR工具提取待识别图片中携带的特征信息,获取第三方互联网搜索引擎反馈的特征信息的关联网址,将关联网址与不完整的网址匹配,得到目标网址,服务器104可以将反目标网址至终端102,或直接访问该网址链接,将访问反馈的数据返回至终端102。其中,终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The URL identification method provided in this application can be applied to the application environment as shown in FIG. 1. Wherein, the terminal 102 communicates with the server 104 through the network through the network. The terminal 102 sends the picture to be recognized to the server 104 via the network. The server 104 receives the picture to be recognized, and uses the OCR tool to recognize the URL contained in the picture to be recognized. When the recognized URL is a complete URL, the complete URL is fed back to the terminal 102, or directly visit the URL link, and return the access feedback data to the terminal 102. The user can browse to the information corresponding to the URL through the terminal; when the identified URL is an incomplete URL, extract the image to be identified through the OCR tool The characteristic information carried in the third-party Internet search engine obtains the associated URL of the characteristic information fed back by the third-party Internet search engine, and matches the associated URL with the incomplete URL to obtain the target URL. The server 104 can send the anti-target URL to the terminal 102 or directly access the URL Link to return the access feedback data to the terminal 102. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server 104 may be implemented as an independent server or a server cluster composed of multiple servers.
在一个实施例中,如图2所示,提供了一种网址识别方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:In one embodiment, as shown in FIG. 2, a method for identifying a website address is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
S200:获取待识别图片,通过OCR工具识别待识别图片中携带的网址。S200: Obtain a picture to be recognized, and use an OCR tool to recognize the URL carried in the picture to be recognized.
OCR工具是用于电子设备(例如扫描仪或数码相机)检查纸上打印的字符,通过检测暗、亮的模式确定其形状,然后用字符识别方法将形状翻译成计算机文字的过程。待识别图片是指进行网址识别的图片,待识别图片可以是终端执行截图操作获取的图片,可以是从互联网下载的图片,或者通过社交应用进行聊天的过程中的图片,通过OCR工具识别该图片中携带的网址。需要指出的是,当待识别图片中未携带有网址时,本次获取的网址即为空白。以聊天场景为例,用户A分享一个篇美食文章给好友B,用户A操作终端执行截屏动作,截取包含该篇美食文章网址的图片,用户A将包含该篇美食文章网址的图片发送至好友B的终端,则在这个场景中包含该篇美食文章网址的图片即为待识别图片;又或者以互联网下载场景为例,用户A从互联网上浏览到一个商品介绍图片,在该商品介绍图片中携带有该商品详细介绍网页的网址,用户A下载该商品介绍图片,通过本申请网址识别方法识别该商品介绍图片携带的网址,则在这个场景中该商品介绍图片即为待识别图片。The OCR tool is a process in which electronic devices (such as scanners or digital cameras) check characters printed on paper, determine their shape by detecting dark and light patterns, and then use character recognition methods to translate the shapes into computer text. The picture to be recognized refers to the picture to be recognized by the URL. The picture to be recognized can be a picture obtained by the terminal performing a screenshot operation, a picture downloaded from the Internet, or a picture in the process of chatting through a social application, and the picture is recognized through the OCR tool URL carried in. It should be pointed out that when the image to be identified does not carry a web address, the web address obtained this time is blank. Take the chat scenario as an example. User A shares a food article with friend B. User A operates the terminal to take a screenshot to capture the picture containing the food article URL. User A sends the picture containing the food article URL to friend B. In this scenario, the picture that contains the URL of the food article is the picture to be recognized; or in the Internet download scenario as an example, user A browses to a product introduction picture from the Internet and carries it in the product introduction picture There is a website address of the product introduction webpage, user A downloads the product introduction picture, and recognizes the website address carried by the product introduction picture through the URL identification method of this application. In this scenario, the product introduction picture is the picture to be identified.
S400:当识别得到的网址为不完整的网址时,通过OCR工具提取待识别图片中携带的特征信息。S400: When the recognized web address is an incomplete web address, extract the characteristic information carried in the picture to be recognized through an OCR tool.
一般来说,网址的生成是基于行业规则的,因此,通过对网址分析可以准确判定当前识别得到的网址是否为完整的网址。当为不完整网址时,再次通过OCR工具识别待识别图片中特征信息,特征信息主要包括文字特征信息和图形标志信息,文字特征信息可以获取待识别图片中文字数据,形成文字数据集合,可以对文字数据进行数据清洗,并且提取 其中关键字,关键的识别过程可以基于历史经验数据获取与网址有关的关键字集合,例如平安XX、凤凰XX、新浪XX等。图形标志信息具体可以品牌的商标,物品的形状等。Generally speaking, the generation of URLs is based on industry rules. Therefore, by analyzing the URLs, it can be accurately determined whether the currently identified URLs are complete URLs. When it is an incomplete URL, use the OCR tool to identify the feature information in the image to be recognized again. The feature information mainly includes text feature information and graphic logo information. The text feature information can obtain text data in the image to be recognized to form a text data collection. The text data is data cleaned and the keywords are extracted. The key recognition process can be based on historical experience data to obtain a set of keywords related to the website, such as Ping An XX, Phoenix XX, Sina XX, etc. The graphic logo information can specifically be the brand's trademark, the shape of the article, etc.
S600:获取第三方互联网搜索引擎反馈的特征信息的关联网址。S600: Obtain the associated URL of the characteristic information fed back by the third-party Internet search engine.
特征信息的关联网址查询可以由第三方互联网搜索引擎来完成,服务器将特征信息推送至第三方互联网搜索引擎,由第三方互联网搜索引擎基于互联网的大数据查询技术,将特征信息在互联网中查询相关内容的网址,例如当提取的特征信息为“平安科技X”时,服务器通过互联网与搜索引擎服务器通信,将“平安科技X”发送至互联网中搜索引擎服务器,获取到与“平安科技X”相关的网址。当提取的特征信息为某品牌的商标时,将某个品牌的商标发送至搜索引擎服务器,可以查询到该品牌相关的网址,相关的网址包括该品牌官网、该品牌相关广告、产品介绍、相关新闻等网址。具体来说,关联网址可以为一个也可以为多个。第三方搜索引擎可以为目前常见的网络搜索引擎,例如百度搜索引擎、谷歌搜索引擎等,服务器通过访问这些搜索引擎,将特征信息发送至搜索引擎,接收搜索引擎反馈的数据来获取特征信息的关联网址。The related URL query of feature information can be completed by a third-party Internet search engine, the server pushes the feature information to the third-party Internet search engine, and the third-party Internet search engine queries the related feature information on the Internet based on the Internet’s big data query technology The URL of the content, for example, when the extracted characteristic information is "Ping An Technology X", the server communicates with the search engine server via the Internet, and sends "Ping An Technology X" to the search engine server on the Internet, and obtains information related to "Ping An Technology X" URL. When the extracted feature information is the trademark of a certain brand, the trademark of a certain brand is sent to the search engine server, and the related website of the brand can be queried. The related website includes the official website of the brand, related advertisements of the brand, product introduction, and related News and other URLs. Specifically, there can be one or more associated URLs. The third-party search engine can be the current common network search engines, such as Baidu search engine, Google search engine, etc. The server sends characteristic information to the search engine by accessing these search engines, and receives data from the search engine to obtain the association of the characteristic information. URL.
S800:将关联网址与不完整的网址匹配,得到目标网址。S800: Match the associated URL with the incomplete URL to obtain the target URL.
由于获取的不完整的网址属于完整网址(目标网址)的一部分,因为其可以作为完整网址的匹配条件,将不完整的网址与查询获取的网址匹配,若匹配成功则表明已经在获取的网址中查找到了目标网址,调用浏览器打开该目标网址,实现网址高效且准确识别。Since the obtained incomplete URL is part of the complete URL (destination URL), it can be used as a matching condition for the complete URL to match the incomplete URL with the URL obtained by the query. If the match is successful, it means that it is already in the obtained URL The target URL is found, and the browser is called to open the target URL to realize efficient and accurate identification of the URL.
上述网址识别方法,通过OCR工具识别待识别图片中的网址,若直接识别的网址不完整,再识别待识别图片中特征信息,通过访问第三方互联网搜索引擎获取与特征信息相关的关联网址,将直接识别的网址与关联网址匹配,得到完整的网址,能够准确识别其携带的网址。The above URL identification method uses the OCR tool to identify the URL in the image to be identified. If the directly identified URL is incomplete, then identify the characteristic information in the image to be identified, access a third-party Internet search engine to obtain the associated URL related to the characteristic information, The directly identified URL matches the associated URL to obtain a complete URL, which can accurately identify the URL carried.
如图3所示,在其中一个实施例中,步骤S200包括:As shown in Figure 3, in one of the embodiments, step S200 includes:
S210:对待识别图片进行图像灰度化处理和边缘检测,并基于Hough变换进行直线检测。S210: Perform image grayscale processing and edge detection on the picture to be recognized, and perform straight line detection based on Hough transform.
S220:对直线检测结果进行Radon变换,计算每个方向的投影区域,查找投影区域宽度最小时的角度,将查找的角度做倾斜校正角度进行倾斜校正处理。S220: Perform Radon transformation on the straight line detection result, calculate the projection area in each direction, find the angle when the projection area has the smallest width, and use the searched angle as a tilt correction angle for tilt correction processing.
S230:对倾斜校正后的灰度图像进行二值化处理,并基于二值化处理后得到的水平投影与垂直投影确定携带网址信息的区域。S230: Binarize the grayscale image after the tilt correction, and determine the area carrying the website information based on the horizontal projection and the vertical projection obtained after the binarization process.
S240:剪切携带网址信息的区域,并根据预设尺寸对剪切的图像进行缩放处理。S240: Cut the area carrying the URL information, and perform zoom processing on the cut image according to a preset size.
S250:通过光学标识符工具识别缩放处理后的图像中携带的网址。S250: Identify the URL carried in the zoomed image through the optical identifier tool.
倾斜校正具体包括在对图像进行了灰度化处理、CANNY边缘检测后,基于Hough变换进行直线检测,再对直线检测结果进行Radon变换,并计算每个方向的投影区域,寻找投影区域宽度最小时的角度即为倾斜方向,然后按此角度对原始输入名片进行旋转校正。剪切包括对倾斜校正后的灰度图像进行二值化处理,其中,阈值确定方法采用最大类间方差法,再基于水平投影、垂直投影确定名片的区域,其中,阈值确定采用经验法,然后将 名片区域剪切出来。缩放包括:对剪切出的名片区域,按初始设定尺寸进行比例缩放,在缩放时采用双线性法作为插值方法。在本实施例中,先对待识别图片进行图像处理进行前期处理,便于其更高效、准确进行OCR识别。The tilt correction specifically includes the grayscale processing of the image, the CANNY edge detection, the straight line detection based on the Hough transform, the Radon transform on the straight line detection result, and the calculation of the projection area in each direction to find the smallest width of the projection area The angle of is the tilt direction, and then rotate and correct the original input business card at this angle. Clipping includes binarization of the gray image after tilt correction. The threshold determination method adopts the maximum between-class variance method, and then the area of the business card is determined based on horizontal projection and vertical projection. The threshold determination adopts the empirical method, and then Cut out the business card area. Scaling includes: scaling the cut out business card area according to the initial set size, and adopting bilinear method as the interpolation method when scaling. In this embodiment, the image to be recognized is first processed for pre-processing, so that it can perform OCR recognition more efficiently and accurately.
在其中一个实施例中,通过OCR工具识别缩放处理后的待识别图片中携带的网址之前,还包括:采用数据形态学处理技术以及连通区分析技术从缩放处理后的待识别图片中提取独立字符;将提取的独立字符作为子图像;通过OCR工具识别缩放处理后的待识别图片中携带的网址包括:通过OCR工具识别子图像中携带的网址。In one of the embodiments, before recognizing the URL carried in the zoomed image to be recognized by the OCR tool, it further includes: using data morphology processing technology and connected area analysis technology to extract independent characters from the zoomed image to be recognized ; The extracted independent characters are used as sub-images; the URLs carried in the images to be recognized after the zoom processing are recognized by the OCR tool include: the URLs carried in the sub-images are recognized by the OCR tool.
数字形态学处理为:对二值化结果图,进行数学形态学处理,以保留真正的字符区域。数学形态学处理包括图像膨胀、图像腐蚀、开运算、闭运算、连通区分析、噪声去除、异常区域去除;对保留下真正字符后的二值化结果图,进行连通区分析,并对每个连通区进行水平膨胀处理,然后再次进行连通区域分析,进而求出新连通区的外接矩形,最后根据外接矩形将字块区域作为子图像提取出来。得到的子图像可能为多个,当子图像为多个时,分别获取各个子图像中携带的网址,将获取的网址有序组合,得到待识别图片中携带的网址。The digital morphology processing is: mathematical morphology processing is performed on the binarization result graph to preserve the real character area. Mathematical morphology processing includes image expansion, image erosion, opening operation, closing operation, connected area analysis, noise removal, and abnormal area removal; the connected area analysis is performed on the binarized result map after the real characters are retained, and each The connected area is subjected to horizontal expansion processing, and then the connected area is analyzed again to obtain the circumscribed rectangle of the new connected area. Finally, the block area is extracted as a sub-image based on the circumscribed rectangle. There may be multiple sub-images obtained. When there are multiple sub-images, the URLs carried in each sub-image are obtained separately, and the obtained URLs are combined in an orderly manner to obtain the URLs carried in the picture to be identified.
如图3所示,在其中一个实施例中,步骤S400之前,还包括:As shown in FIG. 3, in one of the embodiments, before step S400, the method further includes:
S300:对识别得到的网址进行网址分析,判断识别得到的网址中末位标识字符是否为预设网址结束标识字符。S300: Perform web site analysis on the recognized web site, and determine whether the last identification character in the recognized web site is a preset URL end identification character.
预设网址结束标识字符为基于行业规范设定的标准字符,例如常规的.cn或.com或.html。例如https://baike,即为不完整的网址,当为不完整网址时,进入步骤S400。The preset URL end identification character is a standard character set based on industry standards, such as conventional .cn or .com or .html. For example, https://baike, which is an incomplete website address. When it is an incomplete website address, enter step S400.
在其中一个实施例中,特征信息包括文字特征信息,获取第三方互联网搜索引擎反馈的特征信息的关联网址包括:In one of the embodiments, the feature information includes text feature information, and the associated URL for obtaining the feature information fed back by a third-party Internet search engine includes:
对文字特征信息进行分词处理,得到多个分词词语;根据预设网络特征词语数据库,提取多个分词词语中的网络词语;推送网络词语至第三方互联网搜索引擎;接收第三方互联网搜索引擎查找的网络词语的关联网址。Perform word segmentation processing on text feature information to obtain multiple segmented words; extract network words in multiple segmented words according to the preset network feature word database; push network words to third-party Internet search engines; receive third-party Internet search engines The associated URL of the web term.
分词处理是指将完整一段话或者一个句子合理划分为多个分词词语,将分词得到的分词词语在预设网络特征词语数据库中进行查找,看多个分词词语中是否存在网络词语,再将网络词语发送至第三方互联网搜索引擎,查找对应的关联网址。具体来说,预设网络特征词语数据库是基于历史经验构建的数据库,其可以根据在日常应用中不断更新,网络词语具体可以为公司实体、产品名称、明星名字、网红地点等,一般来说,网络词语一般都是指在互联网上可以查找需要相关内容的词语。Word segmentation processing refers to dividing a complete paragraph or a sentence into multiple word segmentation words reasonably, searching the word segmentation words obtained by word segmentation in a preset network feature word database, and seeing whether there are network words in multiple word segmentation words, and then dividing the network The words are sent to a third-party Internet search engine to find the corresponding associated URL. Specifically, the preset network feature word database is a database constructed based on historical experience, which can be continuously updated according to daily applications. The network words can specifically be company entities, product names, celebrity names, Internet celebrity locations, etc. Generally speaking , Internet words generally refer to words that can be found on the Internet that require related content.
在其中一个实施例中,特征信息包括图形特征信息,获取第三方互联网搜索引擎反馈的特征信息的关联网址包括:推送图形特征信息至第三方互联网搜索引擎;接收第三方互联网搜索引擎查找的图形特征信息的关联网址,图形特征的关联网址由第三方互联网搜索引擎查找与图形特征信息关联的产品信息或公司实体名称、并查找产品信息或公司实体名称得到。In one of the embodiments, the feature information includes graphic feature information, and the associated URL for obtaining the feature information fed back by the third-party Internet search engine includes: pushing the graphic feature information to the third-party Internet search engine; receiving the graphic feature found by the third-party Internet search engine The associated URL of the information, the associated URL of the graphic feature is obtained by a third-party Internet search engine searching for the product information or company entity name associated with the graphic feature information, and looking up the product information or company entity name.
图形特征信息具体可以为公司商标、产品标志性外形等。基于图形特征信息,识别出关联的产品信息或公司实体名称,再查找与产品信息或公司实体名称关联的网址。以白酒为例,目前白酒类产品中很多公司都采用独特外形的酒瓶,当获得特征信息包括酒瓶外形数据时,即可根据该酒瓶外形数据,通过第三方互联网搜索引擎采用大数据方式查找到该酒类产品信息和/或生产该酒类产品的公司,再进一步查找与该酒类产品和/或公司的关联网址。The graphic feature information may specifically be company trademarks, product iconic appearances, etc. Based on the graphic feature information, the associated product information or company entity name is identified, and then the URL associated with the product information or company entity name is searched. Take liquor as an example. At present, many companies in liquor products use uniquely shaped bottles. When the characteristic information includes bottle shape data, the bottle shape data can be used to use big data through a third-party Internet search engine. Find the information of the alcoholic product and/or the company that produces the alcoholic product, and then further search for the website associated with the alcoholic product and/or company.
在其中一个实施例中,将关联网址与不完整的网址匹配,得到目标网址包括:对关联网址和不完整的网址进行相似度匹配,选择相似度匹配结果最高对应的关联网址作为目标网址。In one of the embodiments, matching the associated URL with the incomplete URL to obtain the target URL includes: performing similarity matching between the associated URL and the incomplete URL, and selecting the associated URL with the highest similarity matching result as the target URL.
由于关联网址可能存在多个,在本实施例中,采用相似度匹配的方式从关联网址中选择相似度最高的网址作为目标网址,实现高效且准确识别出目标网址。Since there may be multiple associated URLs, in this embodiment, the method of similarity matching is used to select the URL with the highest similarity from the associated URLs as the target URL to achieve efficient and accurate identification of the target URL.
为跟进一步详细解释上述网址识别方法的技术方案,下面将采用一具体应用实例进行说明。In order to further explain the technical solution of the above-mentioned URL identification method in detail, a specific application example will be used for description below.
在某一应用实例中,用户通过终端发送一款携带有化妆产品的图片至服务器,该图片为某个化妆产品介绍截图图片,服务器接收该待识别图片,通过OCR工具识别该待识别图片中携带的https://ABCD网址,对该网址进行分析,判定其未携带有网址结束标识字符,因此,其为不完整网址,服务器通过OCR工具提取该待识别图片中携带的特征信息包括XX美颜霜以及外形为近似人脸型的产品外形,采用大数据方式搜索XX美颜霜的相关网址或产品外形为近似人脸型的产品的网址,得到关联网址1、https://ABMP.com;2、https://ATMP.com;3、https://ABCDMPQ.com,将上述3个关联网址与不完整网址https://ABCD匹配,得到目标网址为https://ABCDMPQ.com。In an application example, the user sends a picture carrying makeup products to the server through the terminal. The picture is a screenshot picture of a makeup product introduction. The server receives the picture to be recognized, and uses the OCR tool to identify the picture to be recognized. The https://ABCD URL of, analyzes the URL and determines that it does not carry the end of the URL identification character. Therefore, it is an incomplete URL. The server uses the OCR tool to extract the characteristic information carried in the image to be recognized, including XX Beauty Cream and the product shape that is similar to the face shape, use big data to search for the related website of XX beauty cream or the website of the product shape that is similar to the face shape, and get the related URL 1. https://ABMP.com; 2. https://ATMP.com; 3. https://ABCDMPQ.com, match the above 3 associated URLs with the incomplete URL https://ABCD, and get the target URL https://ABCDMPQ.com.
应该理解的是,虽然图2-3的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-3中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the various steps in the flowchart of FIGS. 2-3 are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless specifically stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in Figure 2-3 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. These sub-steps or stages The execution order of is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
如图4所示,一种网址识别装置,装置包括:As shown in Fig. 4, a device for identifying a website address includes:
第一识别模块200,用于获取待识别图片,通过OCR工具识别待识别图片中携带的网址。The first recognition module 200 is configured to obtain the picture to be recognized, and use the OCR tool to recognize the URL carried in the picture to be recognized.
第二识别模块400,用于当识别得到的网址为不完整的网址时,通过OCR工具提取待识别图片中携带的特征信息。The second recognition module 400 is used for extracting the characteristic information carried in the picture to be recognized by the OCR tool when the recognized website is an incomplete website.
查找模块600,用于获取第三方互联网搜索引擎反馈的特征信息的关联网址。The search module 600 is used to obtain the associated website address of the characteristic information fed back by the third-party Internet search engine.
网址匹配模块800,用于将关联网址与不完整的网址匹配,得到目标网址。The URL matching module 800 is used to match an associated URL with an incomplete URL to obtain a target URL.
上述网址识别装置,第一识别模块200通过OCR工具识别待识别图片中的网址,若直接识别的网址不完整,第二识别模块400识别待识别图片中特征信息,查找模块600通过访问第三方互联网搜索引擎获取与特征信息相关的关联网址,网址匹配模块800将直接识别的网址与关联网址匹配,得到完整的网址,能够准确识别其携带的网址。In the above URL recognition device, the first recognition module 200 uses the OCR tool to recognize the URL in the picture to be recognized. If the directly recognized URL is incomplete, the second recognition module 400 recognizes the characteristic information in the picture to be recognized, and the search module 600 accesses the third-party Internet The search engine obtains the associated website address related to the characteristic information, and the website address matching module 800 matches the directly identified website address with the associated website address to obtain a complete website address, which can accurately identify the website address carried.
在其中一个实施例中,第一识别模块200还用于对待识别图片进行图像灰度化处理和边缘检测,并基于Hough变换进行直线检测;对直线检测结果进行Radon变换,计算每个方向的投影区域,查找投影区域宽度最小时的角度,将查找的角度做倾斜校正角度进行倾斜校正处理;对倾斜校正后的灰度图像进行二值化处理,并基于二值化处理后得到的水平投影与垂直投影确定携带网址信息的区域;剪切携带网址信息的区域,并根据预设尺寸对剪切的图像进行缩放处理;通过光学标识符工具识别缩放处理后的图像中携带的网址。In one of the embodiments, the first recognition module 200 is also used to perform image gray-scale processing and edge detection on the image to be recognized, and perform straight line detection based on Hough transform; perform Radon transformation on the straight line detection result to calculate the projection in each direction Area, find the angle when the width of the projection area is the smallest, and use the searched angle as the tilt correction angle for tilt correction processing; binarize the gray image after tilt correction, and based on the horizontal projection and The vertical projection determines the area carrying the website information; the area carrying the website information is cut, and the cut image is scaled according to the preset size; the optical identifier tool is used to identify the website carried in the scaled image.
在其中一个实施例中,第一识别模块200还用于采用数据形态学处理技术以及连通区分析技术从缩放处理后的待识别图片中提取独立字符;将提取的独立字符作为子图像;通过OCR工具识别子图像中携带的网址。In one of the embodiments, the first recognition module 200 is further configured to use data morphology processing technology and connected area analysis technology to extract independent characters from the zoomed image to be recognized; use the extracted independent characters as sub-images; and use OCR The tool recognizes the URL carried in the sub-image.
在其中一个实施例中,上述网址识别装置还包括判断模块,用于对识别得到的网址进行网址分析,判断识别得到的网址中末位标识字符是否为预设网址结束标识字符。In one of the embodiments, the above-mentioned URL recognition device further includes a judgment module, which is used to perform a website analysis on the recognized website and determine whether the last identification character in the recognized website is a preset website ending identification character.
在其中一个实施例中,特征信息包括文字特征信息,查找模块600还用于对文字特征信息进行分词处理,得到多个分词词语;根据预设网络特征词语数据库,提取多个分词词语中的网络词语;推送网络词语至第三方互联网搜索引擎;接收第三方互联网搜索引擎查找的网络词语的关联网址。In one of the embodiments, the feature information includes text feature information, and the search module 600 is also used to perform word segmentation processing on the text feature information to obtain multiple word segmentation words; according to a preset network feature word database, extract the network of the multiple word segmentation words Words; push online words to a third-party Internet search engine; receive related URLs of online words found by a third-party Internet search engine.
在其中一个实施例中,特征信息包括图形特征信息,查找模块600还用于推送图形特征信息至第三方互联网搜索引擎;接收第三方互联网搜索引擎查找的图形特征信息的关联网址,图形特征的关联网址由第三方互联网搜索引擎查找与图形特征信息关联的产品信息或公司实体名称、并查找产品信息或公司实体名称得到。In one of the embodiments, the feature information includes graphic feature information, and the search module 600 is also used to push the graphic feature information to a third-party Internet search engine; receive the associated URL of the graphic feature information found by the third-party Internet search engine, and the association of the graphic feature The URL is obtained by a third-party Internet search engine searching for product information or company entity name associated with graphic feature information, and searching for product information or company entity name.
在其中一个实施例中,网址匹配模块800用于对关联网址和不完整的网址进行相似度匹配;选择相似度匹配结果最高对应的关联网址作为目标网址。In one of the embodiments, the URL matching module 800 is used to perform similarity matching between associated URLs and incomplete URLs; the associated URL with the highest similarity matching result is selected as the target URL.
关于网址识别装置的具体限定可以参见上文中对于网址识别方法的限定,在此不再赘述。上述网址识别装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。Regarding the specific limitation of the website identification device, please refer to the above limitation of the website identification method, which will not be repeated here. Each module in the above-mentioned web address recognition device can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图5所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该 计算机设备的数据库用于存储特征信息以及关联网址等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种网址识别方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 5. The computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used to store data such as characteristic information and associated web addresses. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a method for identifying a website address is realized.
本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied. The specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
一种计算机设备,包括存储器和一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的网址识别方法的步骤。A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the processor, the one or more processors implement the methods provided in any of the embodiments of the present application. The steps of the URL identification method.
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的网址识别方法的步骤。One or more non-volatile computer-readable storage media storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors implement any one of the embodiments of the present application. Provide the steps of the URL identification method.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through computer-readable instructions, which can be stored in a non-volatile computer. In a readable storage medium, when the computer-readable instructions are executed, they may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, they should It is considered as the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation manners of the present application, and the description is relatively specific and detailed, but it should not be understood as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of this application, several modifications and improvements can be made, and these all fall within the protection scope of this application. Therefore, the scope of protection of the patent of this application shall be subject to the appended claims.

Claims (20)

  1. 一种网址识别方法,包括:A method for identifying URLs, including:
    获取待识别图片,通过光学标识符工具识别所述待识别图片中携带的网址;Obtain the picture to be recognized, and identify the URL carried in the picture to be recognized through an optical identifier tool;
    当识别得到的网址为不完整的网址时,通过光学标识符工具提取待识别图片中携带的特征信息;When the recognized URL is an incomplete URL, use the optical identifier tool to extract the characteristic information carried in the image to be identified;
    获取第三方互联网搜索引擎反馈的所述特征信息的关联网址;及Obtain the associated URL of the feature information fed back by a third-party Internet search engine; and
    将所述关联网址与所述不完整的网址匹配,得到目标网址。Match the associated URL with the incomplete URL to obtain a target URL.
  2. 根据权利要求1所述的方法,其特征在于,所述获取待识别图片,通过光学标识符工具识别所述待识别图片中携带的网址包括:The method according to claim 1, wherein the obtaining the picture to be recognized, and recognizing the URL carried in the picture to be recognized by an optical identifier tool comprises:
    对所述待识别图片进行图像灰度化处理和边缘检测,并基于Hough变换进行直线检测;Performing image grayscale processing and edge detection on the picture to be recognized, and performing straight line detection based on Hough transform;
    对直线检测结果进行Radon变换,计算每个方向的投影区域,查找投影区域宽度最小时的角度,将查找的角度做倾斜校正角度进行倾斜校正处理;Carry out Radon transformation on the detection result of the straight line, calculate the projection area in each direction, find the angle when the width of the projection area is the smallest, and use the searched angle as the tilt correction angle for tilt correction processing;
    对倾斜校正后的灰度图像进行二值化处理,并基于二值化处理后得到的水平投影与垂直投影确定携带网址信息的区域;Binarize the grayscale image after tilt correction, and determine the area carrying the website information based on the horizontal projection and vertical projection obtained after the binarization process;
    剪切携带网址信息的区域,并根据预设尺寸对剪切的图像进行缩放处理;及Cut the area carrying the URL information, and scale the cut image according to the preset size; and
    通过光学标识符工具识别缩放处理后的图像中携带的网址。Recognize the URL carried in the zoomed image through the optical identifier tool.
  3. 根据权利要求2所述的方法,其特征在于,所述通过光学标识符工具识别缩放处理后的所述待识别图片中携带的网址之前,还包括:The method according to claim 2, wherein before said recognizing the web address carried in the image to be recognized after the zooming process by the optical identifier tool, the method further comprises:
    采用数据形态学处理技术以及连通区分析技术从所述缩放处理后的所述待识别图片中提取独立字符;及Using data morphology processing technology and connected area analysis technology to extract independent characters from the image to be recognized after the zoom processing; and
    将提取的独立字符作为子图像;Use the extracted independent characters as sub-images;
    通过光学标识符工具识别缩放处理后的所述待识别图片中携带的网址包括:The URL carried in the image to be recognized after the zoom processing is recognized by the optical identifier tool includes:
    通过光学标识符工具识别所述子图像中携带的网址。Identify the URL carried in the sub-image through an optical identifier tool.
  4. 根据权利要求3所述的方法,其特征在于,所述采用数据形态学处理技术以及连通区分析技术从所述缩放处理后的所述待识别图片中提取独立字符包括:3. The method according to claim 3, wherein said using data morphology processing technology and connected area analysis technology to extract independent characters from said image to be recognized after said scaling processing comprises:
    对缩放处理后的所述待识别图片进行数据形态学处理,得到保留的图像,所述数学形态学处理包括图像膨胀、图像腐蚀、开运算、闭运算、连通区分析、噪声去除以及异常区域去除;Perform data morphology processing on the image to be recognized after the zoom processing to obtain a retained image. The mathematical morphology processing includes image expansion, image erosion, opening operation, closing operation, connected region analysis, noise removal, and abnormal area removal ;
    对所述保留的图像进行连通区分析,确定所述保留的图像中连通区;Performing connected region analysis on the retained image to determine the connected region in the retained image;
    对每个连通区进行水平膨胀处理,并对水平膨胀处理后的连通区再次进行连通区分析,得到新的连通区;Perform horizontal expansion treatment on each connected area, and perform connected area analysis again on the connected area after horizontal expansion treatment to obtain a new connected area;
    求解所述新的连通区的外接矩形;及Solving the circumscribed rectangle of the new connected region; and
    根据所述外接矩阵提取独立字符。Extract independent characters according to the external matrix.
  5. 根据权利要求1所述的方法,其特征在于,所述当识别得到的网址为不完整的网 址时,通过光学标识符工具提取待识别图片中携带的特征信息之前,还包括:The method according to claim 1, characterized in that, when the recognized web address is an incomplete web address, before extracting the characteristic information carried in the picture to be recognized by an optical identifier tool, the method further comprises:
    对识别得到的网址进行网址分析,判断所述识别得到的网址中末位标识字符是否为预设网址结束标识字符。Perform web site analysis on the recognized web site to determine whether the last identification character in the recognized web site is a preset end identification character of the web site.
  6. 根据权利要求1所述的方法,其特征在于,所述特征信息包括文字特征信息,所述获取第三方互联网搜索引擎反馈的所述特征信息的关联网址包括:The method according to claim 1, wherein the characteristic information includes text characteristic information, and the associated web address for obtaining the characteristic information fed back by a third-party Internet search engine includes:
    对文字特征信息进行分词处理,得到多个分词词语;Perform word segmentation processing on text feature information to obtain multiple word segmentation words;
    根据预设网络特征词语数据库,提取所述多个分词词语中的网络词语;Extracting network words among the plurality of word segmentation words according to a preset network feature word database;
    推送所述网络词语至第三方互联网搜索引擎;及Push the network words to a third-party Internet search engine; and
    接收所述第三方互联网搜索引擎查找的所述网络词语的关联网址。Receive the associated website address of the network term searched by the third-party Internet search engine.
  7. 根据权利要求1所述的方法,其特征在于,所述特征信息包括图形特征信息,所述获取第三方互联网搜索引擎反馈的所述特征信息的关联网址包括:The method according to claim 1, wherein the characteristic information includes graphic characteristic information, and the associated web address for obtaining the characteristic information fed back by a third-party Internet search engine includes:
    推送所述图形特征信息至第三方互联网搜索引擎;及Push the graphic feature information to a third-party Internet search engine; and
    接收所述第三方互联网搜索引擎查找的所述图形特征信息的关联网址,所述图形特征的关联网址由所述第三方互联网搜索引擎查找与所述图形特征信息关联的产品信息或公司实体名称、并查找所述产品信息或所述公司实体名称得到。Receive the associated URL of the graphic feature information searched by the third-party Internet search engine, and the third-party Internet search engine searches for the product information or company entity name associated with the graphic feature information, And look up the product information or the company entity name.
  8. 根据权利要求1所述的方法,其特征在于,所述将所述关联网址与所述不完整的网址匹配,得到目标网址包括:The method according to claim 1, wherein the matching the associated website with the incomplete website to obtain the target website comprises:
    对所述关联网址和所述不完整的网址进行相似度匹配;及Perform similarity matching on the associated URL and the incomplete URL; and
    选择相似度匹配结果最高对应的关联网址作为目标网址。Select the associated URL with the highest similarity matching result as the destination URL.
  9. 一种网址识别装置,包括:A web address recognition device includes:
    第一识别模块,用于获取待识别图片,通过光学标识符工具识别所述待识别图片中携带的网址;The first recognition module is used to obtain the picture to be recognized, and to recognize the web address carried in the picture to be recognized through an optical identifier tool;
    第二识别模块,用于当识别得到的网址为不完整的网址时,通过光学标识符工具提取待识别图片中携带的特征信息;The second recognition module is used to extract the characteristic information carried in the picture to be recognized by the optical identifier tool when the recognized website is an incomplete website;
    查找模块,用于获取第三方互联网搜索引擎反馈的所述特征信息的关联网址;及The search module is used to obtain the associated URL of the feature information fed back by a third-party Internet search engine; and
    网址匹配模块,用于将所述关联网址与所述不完整的网址匹配,得到目标网址。The URL matching module is used to match the associated URL with the incomplete URL to obtain a target URL.
  10. 根据权利要求9所述的装置,其特征在于,所述第一识别模块还用于对所述待识别图片进行图像灰度化处理和边缘检测,并基于Hough变换进行直线检测;对直线检测结果进行Radon变换,计算每个方向的投影区域,查找投影区域宽度最小时的角度,将查找的角度做倾斜校正角度进行倾斜校正处理;对倾斜校正后的灰度图像进行二值化处理,并基于二值化处理后得到的水平投影与垂直投影确定携带网址信息的区域;剪切携带网址信息的区域,并根据预设尺寸对剪切的图像进行缩放处理;及通过光学标识符工具识别缩放处理后的图像中携带的网址。The device according to claim 9, wherein the first recognition module is further configured to perform image gray-scale processing and edge detection on the image to be recognized, and perform straight line detection based on Hough transform; Carry out Radon transformation, calculate the projection area in each direction, find the angle when the width of the projection area is the smallest, and use the searched angle as the tilt correction angle for tilt correction processing; binarize the tilt-corrected grayscale image, and based The horizontal projection and vertical projection obtained after the binarization process determine the area carrying the website information; cut the area carrying the website information, and scale the cropped image according to the preset size; and identify the scaling process through the optical identifier tool The URL carried in the image after.
  11. 根据权利要求10所述的装置,其特征在于,所述第一识别模块还用于采用数据形态学处理技术以及连通区分析技术从所述缩放处理后的所述待识别图片中提取独立字 符;将提取的独立字符作为子图像;及通过光学标识符工具识别所述子图像中携带的网址。The device according to claim 10, wherein the first recognition module is further configured to use data morphology processing technology and connected region analysis technology to extract independent characters from the image to be recognized after the zoom processing; Taking the extracted independent characters as a sub-image; and identifying the URL carried in the sub-image through an optical identifier tool.
  12. 根据权利要求11所述的装置,其特征在于,所述第一识别模块还用于对缩放处理后的所述待识别图片进行数据形态学处理,得到保留的图像;对所述保留的图像进行连通区分析,确定所述保留的图像中连通区;对每个连通区进行水平膨胀处理,并对水平膨胀处理后的连通区再次进行连通区分析,得到新的连通区;求解所述新的连通区的外接矩形;及根据所述外接矩阵提取独立字符;所述数学形态学处理包括图像膨胀、图像腐蚀、开运算、闭运算、连通区分析、噪声去除以及异常区域去除。The device according to claim 11, wherein the first recognition module is further configured to perform data morphology processing on the image to be recognized after the scaling process to obtain a retained image; and perform data morphological processing on the retained image Connected area analysis to determine the connected areas in the retained image; perform horizontal expansion processing on each connected area, and perform connected area analysis again on the connected areas after the horizontal expansion processing to obtain a new connected area; solve the new connected area The circumscribed rectangle of the connected area; and extracting independent characters according to the circumscribed matrix; the mathematical morphology processing includes image expansion, image erosion, opening operation, closing operation, connected area analysis, noise removal and abnormal area removal.
  13. 根据权利要求9所述的装置,其特征在于,所述特征信息包括文字特征信息;所述查找模块还用于对文字特征信息进行分词处理,得到多个分词词语;根据预设网络特征词语数据库,提取所述多个分词词语中的网络词语;推送所述网络词语至第三方互联网搜索引擎;及接收所述第三方互联网搜索引擎查找的所述网络词语的关联网址。The device according to claim 9, wherein the feature information includes text feature information; the search module is further configured to perform word segmentation processing on the text feature information to obtain multiple word segmentation words; according to a preset network feature word database , Extracting web terms in the plurality of word segmentation terms; pushing the web terms to a third-party Internet search engine; and receiving the associated web addresses of the web terms searched by the third-party Internet search engine.
  14. 根据权利要求9所述的装置,其特征在于,所述特征信息包括图形特征信息;所述查找模块还用于推送所述图形特征信息至第三方互联网搜索引擎;及接收所述第三方互联网搜索引擎查找的所述图形特征信息的关联网址,所述图形特征的关联网址由所述第三方互联网搜索引擎查找与所述图形特征信息关联的产品信息或公司实体名称、并查找所述产品信息或所述公司实体名称得到。The device according to claim 9, wherein the feature information includes graphic feature information; the search module is further configured to push the graphic feature information to a third-party Internet search engine; and receive the third-party Internet search The associated web address of the graphic feature information searched by the engine, the associated web address of the graphic feature is searched by the third-party Internet search engine for product information or company entity name associated with the graphic feature information, and the product information or The name of the company entity is obtained.
  15. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:A computer device includes a memory and one or more processors. The memory stores computer-readable instructions. When the computer-readable instructions are executed by the one or more processors, the one or more Each processor performs the following steps:
    获取待识别图片,通过光学标识符工具识别所述待识别图片中携带的网址;Obtain the picture to be recognized, and identify the URL carried in the picture to be recognized through an optical identifier tool;
    当识别得到的网址为不完整的网址时,通过光学标识符工具提取待识别图片中携带的特征信息;When the recognized URL is an incomplete URL, use the optical identifier tool to extract the characteristic information carried in the image to be identified;
    获取第三方互联网搜索引擎反馈的所述特征信息的关联网址;及Obtain the associated URL of the feature information fed back by a third-party Internet search engine; and
    将所述关联网址与所述不完整的网址匹配,得到目标网址。Match the associated URL with the incomplete URL to obtain a target URL.
  16. 根据权利要求15所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:The computer device according to claim 15, wherein the processor further executes the following steps when executing the computer-readable instruction:
    对所述待识别图片进行图像灰度化处理和边缘检测,并基于Hough变换进行直线检测;Performing image grayscale processing and edge detection on the picture to be recognized, and performing straight line detection based on Hough transform;
    对直线检测结果进行Radon变换,计算每个方向的投影区域,查找投影区域宽度最小时的角度,将查找的角度做倾斜校正角度进行倾斜校正处理;Carry out Radon transformation on the detection result of the straight line, calculate the projection area in each direction, find the angle when the width of the projection area is the smallest, and use the searched angle as the tilt correction angle for tilt correction processing;
    对倾斜校正后的灰度图像进行二值化处理,并基于二值化处理后得到的水平投影与垂直投影确定携带网址信息的区域;Binarize the grayscale image after tilt correction, and determine the area carrying the website information based on the horizontal projection and vertical projection obtained after the binarization process;
    剪切携带网址信息的区域,并根据预设尺寸对剪切的图像进行缩放处理;及Cut the area carrying the URL information, and scale the cut image according to the preset size; and
    通过光学标识符工具识别缩放处理后的图像中携带的网址。Recognize the URL carried in the zoomed image through the optical identifier tool.
  17. 根据权利要求16所述的计算机设备,其特征在于,所述处理器执行所述计算机 可读指令时还执行以下步骤:The computer device according to claim 16, wherein the processor further executes the following steps when executing the computer readable instruction:
    采用数据形态学处理技术以及连通区分析技术从所述缩放处理后的所述待识别图片中提取独立字符;Using data morphology processing technology and connected area analysis technology to extract independent characters from the image to be recognized after the scaling process;
    将提取的独立字符作为子图像;及Use the extracted independent characters as sub-images; and
    通过光学标识符工具识别所述子图像中携带的网址。Identify the URL carried in the sub-image through an optical identifier tool.
  18. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
    获取待识别图片,通过光学标识符工具识别所述待识别图片中携带的网址;Obtain the picture to be recognized, and identify the URL carried in the picture to be recognized through an optical identifier tool;
    当识别得到的网址为不完整的网址时,通过光学标识符工具提取待识别图片中携带的特征信息;When the recognized URL is an incomplete URL, use the optical identifier tool to extract the characteristic information carried in the image to be identified;
    获取第三方互联网搜索引擎反馈的所述特征信息的关联网址;及Obtain the associated URL of the feature information fed back by a third-party Internet search engine; and
    将所述关联网址与所述不完整的网址匹配,得到目标网址。Match the associated URL with the incomplete URL to obtain a target URL.
  19. 根据权利要求18所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:18. The storage medium of claim 18, wherein the following steps are further performed when the computer-readable instructions are executed by the processor:
    对所述待识别图片进行图像灰度化处理和边缘检测,并基于Hough变换进行直线检测;Performing image grayscale processing and edge detection on the picture to be recognized, and performing straight line detection based on Hough transform;
    对直线检测结果进行Radon变换,计算每个方向的投影区域,查找投影区域宽度最小时的角度,将查找的角度做倾斜校正角度进行倾斜校正处理;Carry out Radon transformation on the detection result of the straight line, calculate the projection area in each direction, find the angle when the width of the projection area is the smallest, and use the searched angle as the tilt correction angle for tilt correction processing;
    对倾斜校正后的灰度图像进行二值化处理,并基于二值化处理后得到的水平投影与垂直投影确定携带网址信息的区域;Binarize the grayscale image after tilt correction, and determine the area carrying the website information based on the horizontal projection and vertical projection obtained after the binarization process;
    剪切携带网址信息的区域,并根据预设尺寸对剪切的图像进行缩放处理;及Cut the area carrying the URL information, and scale the cut image according to the preset size; and
    通过光学标识符工具识别缩放处理后的图像中携带的网址。Recognize the URL carried in the zoomed image through the optical identifier tool.
  20. 根据权利要求19所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:The storage medium according to claim 19, wherein the following steps are further executed when the computer-readable instructions are executed by the processor:
    采用数据形态学处理技术以及连通区分析技术从所述缩放处理后的所述待识别图片中提取独立字符;Using data morphology processing technology and connected area analysis technology to extract independent characters from the image to be recognized after the scaling process;
    将提取的独立字符作为子图像;及Use the extracted independent characters as sub-images; and
    通过光学标识符工具识别所述子图像中携带的网址。Identify the URL carried in the sub-image through an optical identifier tool.
PCT/CN2019/118243 2019-06-26 2019-11-14 Website identification method and apparatus, and computer device and storage medium WO2020258669A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910561370.5A CN110414518A (en) 2019-06-26 2019-06-26 Network address recognition methods, device, computer equipment and storage medium
CN201910561370.5 2019-06-26

Publications (1)

Publication Number Publication Date
WO2020258669A1 true WO2020258669A1 (en) 2020-12-30

Family

ID=68359744

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118243 WO2020258669A1 (en) 2019-06-26 2019-11-14 Website identification method and apparatus, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN110414518A (en)
WO (1) WO2020258669A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051876A (en) * 2021-04-02 2021-06-29 网易(杭州)网络有限公司 Malicious website identification method and device, storage medium and electronic equipment

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110414518A (en) * 2019-06-26 2019-11-05 平安科技(深圳)有限公司 Network address recognition methods, device, computer equipment and storage medium
CN111046365B (en) * 2019-12-16 2023-05-05 腾讯科技(深圳)有限公司 Face image transmission method, numerical value transfer method, device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103425993A (en) * 2012-05-22 2013-12-04 腾讯科技(深圳)有限公司 Method and system for recognizing images
CN103488983A (en) * 2013-09-13 2014-01-01 复旦大学 Business card OCR data correction method and system based on knowledge base
US20150046483A1 (en) * 2012-04-25 2015-02-12 Tencent Technology (Shenzhen) Company Limited Method, system and computer storage medium for visual searching based on cloud service
CN106709488A (en) * 2016-12-20 2017-05-24 深圳市深信服电子科技有限公司 Business card identification method and device
CN110414518A (en) * 2019-06-26 2019-11-05 平安科技(深圳)有限公司 Network address recognition methods, device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9148675B2 (en) * 2013-06-05 2015-09-29 Tveyes Inc. System for social media tag extraction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150046483A1 (en) * 2012-04-25 2015-02-12 Tencent Technology (Shenzhen) Company Limited Method, system and computer storage medium for visual searching based on cloud service
CN103425993A (en) * 2012-05-22 2013-12-04 腾讯科技(深圳)有限公司 Method and system for recognizing images
CN103488983A (en) * 2013-09-13 2014-01-01 复旦大学 Business card OCR data correction method and system based on knowledge base
CN106709488A (en) * 2016-12-20 2017-05-24 深圳市深信服电子科技有限公司 Business card identification method and device
CN110414518A (en) * 2019-06-26 2019-11-05 平安科技(深圳)有限公司 Network address recognition methods, device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051876A (en) * 2021-04-02 2021-06-29 网易(杭州)网络有限公司 Malicious website identification method and device, storage medium and electronic equipment
CN113051876B (en) * 2021-04-02 2024-04-23 杭州网易智企科技有限公司 Malicious website identification method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN110414518A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
US11645826B2 (en) Generating searchable text for documents portrayed in a repository of digital images utilizing orientation and text prediction neural networks
CN109492643B (en) Certificate identification method and device based on OCR, computer equipment and storage medium
WO2020232872A1 (en) Table recognition method and apparatus, computer device, and storage medium
US10032072B1 (en) Text recognition and localization with deep learning
CN107656922B (en) Translation method, translation device, translation terminal and storage medium
WO2021012382A1 (en) Method and apparatus for configuring chat robot, computer device and storage medium
US10867171B1 (en) Systems and methods for machine learning based content extraction from document images
WO2020258669A1 (en) Website identification method and apparatus, and computer device and storage medium
US9652553B2 (en) Method and device for displaying a web page
CN108763380B (en) Trademark identification retrieval method and device, computer equipment and storage medium
US9916499B2 (en) Method and system for linking printed objects with electronic content
CN111898411B (en) Text image labeling system, method, computer device and storage medium
WO2016018683A1 (en) Image based search to identify objects in documents
CN109871826A (en) Information displaying method, device, computer readable storage medium and computer equipment
CN114399396A (en) Insurance product recommendation method and device, computer equipment and storage medium
CN113806613B (en) Training image set generation method, training image set generation device, computer equipment and storage medium
US10963690B2 (en) Method for identifying main picture in web page
US20220058214A1 (en) Document information extraction method, storage medium and terminal
CN114241501A (en) Image document processing method and device and electronic equipment
CN108664945B (en) Image text and shape-pronunciation feature recognition method and device
CN111985467A (en) Chat record screenshot processing method and device, computer equipment and storage medium
CN109101973B (en) Character recognition method, electronic device and storage medium
CN110909733A (en) Template positioning method and device based on OCR picture recognition and computer equipment
CN115565193A (en) Questionnaire information input method and device, electronic equipment and storage medium
CN115186240A (en) Social network user alignment method, device and medium based on relevance information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19935573

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19935573

Country of ref document: EP

Kind code of ref document: A1