WO2016000511A1 - 互联网稀有资源的挖掘方法及装置 - Google Patents

互联网稀有资源的挖掘方法及装置 Download PDF

Info

Publication number
WO2016000511A1
WO2016000511A1 PCT/CN2015/080803 CN2015080803W WO2016000511A1 WO 2016000511 A1 WO2016000511 A1 WO 2016000511A1 CN 2015080803 W CN2015080803 W CN 2015080803W WO 2016000511 A1 WO2016000511 A1 WO 2016000511A1
Authority
WO
WIPO (PCT)
Prior art keywords
rare
term
type
terms
webpage
Prior art date
Application number
PCT/CN2015/080803
Other languages
English (en)
French (fr)
Inventor
王智广
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2016000511A1 publication Critical patent/WO2016000511A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the field of Internet applications, and in particular, to a method and apparatus for mining rare resources of the Internet.
  • the Internet is full of resources, some of which are relatively rich, and some of which are relatively rare. For example, some names, remote names, phone numbers, etc. are rare. The so-called rarity is that the number of valuable web pages involved in these contents is small.
  • the search engine When users use the search engine to query these queries, the search engine often has fewer results, and often cannot satisfy the user's query requirements.
  • Rare queries that satisfy users are an important indicator of search engine coverage, directly affecting the user's search experience. Therefore, most search engine companies will use rare queries (also known as rare coverage) as a single evaluation criterion.
  • the rare query scheme currently used is to analyze the user query history and analyze the query with fewer results.
  • the search engine includes webpages to extract some webpages to supplement the results of these queries.
  • the present invention has been made in order to provide an Internet rare resource mining method and corresponding apparatus that overcomes the above problems or at least partially solves the above problems.
  • an embodiment of the present invention provides a method for mining rare resources of the Internet, including:
  • the rare webpage is processed to mine rare network resources.
  • an embodiment of the present invention further provides an apparatus for mining an Internet rare resource, including:
  • a word-cutting module configured to perform word-cutting processing on a webpage included in a search engine to generate a plurality of words term
  • a locating module configured to search for, in the plurality of terms, a rare term having a number of occurrences less than a threshold of a number of times, wherein the meaning is that the term has a meaning of a real word and can indicate an event content;
  • the searching module is further configured to search for a webpage including a rare term, which is defined as a rare webpage;
  • the mining module is configured to process the rare webpage and mine the rare network resources.
  • a computer program comprising computer readable code, when the computer readable code is run on a computing device, causing the computing device to perform an Internet rare resource according to any of the above Mining method.
  • a computer readable medium wherein the computer program described above is stored.
  • the rare network resource mining is performed on all the webpages included in the search engine, and the coverage is large, which can provide more abundant data support for the user.
  • the word processing is used to process the webpages included in the search engine, and the rare webpages are searched for, and the rare network resources are mined, thereby greatly improving the efficiency of mining rare network resources and avoiding excavating each webpage.
  • the method for mining the rare resources of the Internet provided by the embodiment of the present invention is to mine the webpage of the search engine revenue, and does not need to analyze the previous query history of the user, and can provide the user with rare network resources when the user first queries. There is no lag, which improves the user experience.
  • the embodiment of the present invention can actively mine some high-quality rare resources before the user queries, and can provide relatively rich results when the user queries.
  • FIG. 1 is a flowchart showing a process of mining an Internet rare resource according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of an apparatus for mining an Internet rare resource
  • FIG. 3 is a block diagram schematically showing a computing device for performing a mining method of an Internet rare resource according to the present invention
  • Fig. 4 schematically shows a storage unit for holding or carrying program code for implementing a mining method for Internet rare resources according to the present invention.
  • FIG. 1 is a flow chart showing the processing of a method for mining rare resources of the Internet according to an embodiment of the present invention. Referring to FIG. 1, the method includes at least steps S102 to S108:
  • Step S102 Perform word segmentation processing on the webpage included in the search engine to generate a plurality of words (term);
  • Step S104 The plurality of terms generated by the word-cutting process in step S102 finds a rare term having a number of occurrences less than the threshold of the number of times and having meaning, wherein the meaning means that the term has the meaning of the real word and can indicate the event content;
  • the event in step S104 is a large event concept, such as a time event, a location event, a character event, a contact event, and the like. Because term can indicate the content of the event, the term can be a word with event meaning such as time, place, person, phone number, email address, etc.
  • Step S106 Search for a webpage including the rare term searched in step S104, and define it as a rare webpage (rare webpage);
  • Step S108 Processing the rare webpage defined in step S106, and mining the rare network resources.
  • rare network resources are dug for all webpages included in the search engine. Digging, covering a large area, can provide users with more data support. Further, the word processing is used to process the webpages included in the search engine, and the rare webpages are searched for, and the rare network resources are mined, thereby greatly improving the efficiency of mining rare network resources and avoiding excavating each webpage. The multiple steps involved. Moreover, the method for mining the rare resources of the Internet provided by the embodiment of the present invention is to mine the webpage of the search engine revenue, and does not need to analyze the previous query history of the user, and can provide the user with rare network resources when the user first queries. There is no lag, which improves the user experience. In summary, the embodiment of the present invention can actively mine some high-quality rare resources before the user queries, and can provide relatively rich results when the user queries.
  • the rare term with the number of occurrences less than the threshold of the number of times and having meaning is generated by using various means or methods.
  • This embodiment provides one of the following means:
  • the frequency of occurrence of all terms can be sorted from high to low, and the term ranked in the first N bits is directly filtered out.
  • the type of term with higher frequency usually does not have the meaning of the real word, such as common modal particles, conjunctions, auxiliary words, typed names, and so on.
  • Modal particles usually refer to words that increase the tone of the language, such as ah, wow, hi, etc. These words have no specific meaning and are only used to increase the tone.
  • Conjunctions are used to connect different subjects, predicates, objects, etc. Conjunctions are common: and, or, unless and so on.
  • Auxiliary words usually refer to auxiliary words of predicates, such as the place following the verb.
  • Other types of names refer to the names of a certain type of transaction, but such things cannot indicate the specific event content itself, and cannot form a distinction, such as companies, teams, associations, and so on.
  • N is an integer value, and the specific value depends on the actual situation.
  • a term that finds a number of occurrences on a web page that is less than a threshold and has meaning based on the type is a rare term.
  • the step S108 processes the rare webpage, and the embodiment of the present invention provides a specific processing manner, that is, the garbage removal processing of the rare webpage.
  • the reason for the garbage removal here is because many rare web pages are low-quality web pages, and you need to filter the spam and duplicate web pages.
  • filtering is required.
  • Rare term flooding refers to the ratio of the number of rare terms in the total term above a certain threshold.
  • the embodiment of the present invention implements de-waste processing by de-duplication processing.
  • the deduplication processing of ordinary web pages generally makes a signature on the web page (the longest sentence signature is one of them), and the same signature is the same web page.
  • Rare web pages are based on the need to sign rare terms, that is, to sign web pages for rare web pages, and to generate word signatures for rare terms in rare web pages, and then to have multiple signatures for the web page signature and word signature. Keep one of the pages. This approach is used because there are some cases where the longest sentence signatures are the same but there are different rare terms.
  • the embodiment of the invention can be used to identify that the webpage signatures are the same, and the rare term signatures are the same. Deduplicate based on rare term signatures, using the quality of rare web pages that are guaranteed to be mined
  • the method for mining rare resources of the Internet provided by the embodiment of the present invention is now embodied in a specific embodiment.
  • the webpage included in the search engine in this example includes the following paragraph: "Xinhuanet Beijing April 8 (Reporter Xiaoming) A country leader Zhao Gongming held talks with President B Pepe in the Great Hall of the People on the 8th. He exchanged in-depth views on developing bilateral relations and unanimously decided to promote bilateral friendly exchanges and cooperation to achieve greater development. Zhao Gongming hopes that the two countries will achieve substantive progress at an early date.”
  • the output may be a rare term. Filtered out the higher frequency term is to pre-analyze some web pages, record some common term, hit these term not as rare term output. More common such as: "", "company” and so on, there are some punctuation and so on. Filtering out these is for the convenience of subsequent processing, otherwise it will generate a lot of calculations and waste resources. These common non-rare terms are small but the number of pages containing them is very large. Filtering out these terms can greatly reduce the amount of subsequent calculations.
  • the term of the output is calculated, and each term type and the number of pages containing the term are calculated.
  • the types of the term in this embodiment are divided into six types. Of course, according to the needs of the user, more subdivisions can be performed on this basis.
  • One is the decimal point type, such as "0.5", "8.3 meters”, etc.
  • the second is the English string type, such as "abc", etc.
  • the third is the phone number type, such as "010-88067082", etc.
  • the fourth is Chinese characters, such as "Beijing", "Xinhuanet”, etc., five are mailbox type 123456@126.com, etc. Six are other.
  • the webpage containing the term is defined as a rare webpage.
  • Meaningful terms include phone number, mailbox, person name, place name, item name, etc., and which terms are valuable can be determined according to needs. Most of the term occurrences are meaningless, such as some decimals, some strings of letters, and typos, etc., so it is necessary to distinguish the quality of the term, and it is difficult to distinguish the quality of the term. In this embodiment, Classify the term and determine which types of terms are valuable based on the user's historical query needs.
  • Rare term flooding refers to the ratio of the number of rare terms in the total term above a certain threshold.
  • the deduplication processing of ordinary web pages generally makes a signature on the web page (the longest sentence signature is one of them), and the same signature is the same web page.
  • Rare web pages need to be signed on a rare term, because there are some cases where the longest sentence signature is the same but there are different rare terms.
  • the same page is used when the signature of the web page is the same and the rare term signature is the same.
  • the same page with the primary domain retains the better quality one.
  • an embodiment of the present invention further provides an Internet rare resource mining device, which is used to support the Internet rare resource mining method of any one of the above preferred embodiments or a combination thereof.
  • FIG. 2 provides a schematic structural diagram of an Internet rare resource mining device. Referring to Figure 2, the device includes at least:
  • the word cutting module 210 is configured to perform word segmentation processing on the webpage included in the search engine to generate a plurality of word terms
  • the search module 220 is coupled to the word-cutting module 210, and is configured to search for a rare term having a meaning less than the threshold value and having meaning in the plurality of terms generated by the remembering module 210, wherein the meaning means that the term has the meaning of the real word and can indicate Event content;
  • the lookup module 220 is further configured to look up a webpage including a rare term, defined as a rare webpage;
  • the mining module 230 is coupled to the search module 220 and configured to process the rare webpage defined by the search module 220 to mine rare network resources.
  • the lookup module 220 can also be configured to:
  • the lookup module 220 can also be configured to:
  • a term that finds a number of occurrences on a web page that is less than a threshold and has meaning based on the type is a rare term.
  • the types of N terms with a higher frequency of occurrence include at least one of the following:
  • the type of each term remaining includes at least one of the following:
  • the mining module 230 can also be configured to perform spam processing on rare web pages.
  • the mining module 230 can also be configured to perform deduplication processing on rare web pages.
  • the mining module 230 can also be configured to:
  • One of the multiple web pages with the same signature and word signature is reserved.
  • the meaningful term includes any of the following: a phone number, a mailbox, a person's name, a place name, and a name.
  • the rare network resource mining is performed on all the webpages included in the search engine, and the coverage is large, which can provide more abundant data support for the user.
  • the word processing is used to process the webpages included in the search engine, and the rare webpages are searched for, and the rare network resources are mined, thereby greatly improving the efficiency of mining rare network resources and avoiding excavating each webpage.
  • the method for mining the rare resources of the Internet provided by the embodiment of the present invention is to mine the webpage of the search engine revenue, and does not need to analyze the previous query history of the user, and can provide the user with rare network resources when the user first queries. There is no lag, which improves the user experience.
  • the embodiment of the present invention can actively mine some high-quality rare resources before the user queries, and can provide relatively rich results when the user queries.
  • modules in the devices of the embodiments can be adaptively changed and placed in one or more devices different from the embodiment.
  • the modules or units or components of the embodiments may be combined into one module or unit or component, and further they may be divided into a plurality of sub-modules or sub-units or sub-components.
  • any combination of the features disclosed in the specification, including the accompanying claims, the abstract and the drawings, and any methods so disclosed, or All processes or units of the device are combined.
  • Each feature disclosed in this specification (including the accompanying claims, the abstract and the drawings) may be replaced by alternative features that provide the same, equivalent or similar purpose.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • Those skilled in the art will appreciate that some or all of the functionality of some or all of the components of the Internet rare resource mining device in accordance with embodiments of the present invention may be implemented in practice using a microprocessor or digital signal processor (DSP).
  • DSP digital signal processor
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
  • the program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • FIG. 3 illustrates a computing device that can implement a method of mining Internet rare resources in accordance with the present invention.
  • the computing device conventionally includes a processor 310 and a computer program product or computer readable medium in the form of a memory 320.
  • the memory 320 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • the memory 320 has a memory space 330 for program code 331 for performing any of the method steps described above.
  • storage space 330 for program code may include various program code 331 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • Such computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 320 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 331', ie, code readable by a processor, such as 310, that when executed by a computing device causes the computing device to perform each of the methods described above step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种互联网稀有资源的挖掘方法及装置,该方法包括:对搜索引擎收录的网页进行切词处理,生成多个词term(S102);在所述多个term查找出现次数少于次数阈值、且具备意义的稀有term,其中,所述具备意义是指term具备实词含义,能够示意事件内容(S104);查找包括稀有term的网页,定义为稀有网页(S106);对所述稀有网页进行处理,挖掘得到稀有网络资源(S108)。在用户查询之前,主动挖掘一些高质量的稀有资源,在用户查询时能够提供相对比较丰富的结果。

Description

互联网稀有资源的挖掘方法及装置 技术领域
本发明涉及互联网应用领域,特别是涉及一种互联网稀有资源的挖掘方法及装置。
背景技术
互联网中充斥着各种资源,有些资源是比较丰富的,有些资源是比较稀有的,比如某些人名、偏远地名、电话号码等就是比较稀有的。所谓稀有就是涉及到这些内容的有价值网页数量较少,当用户利用搜索引擎查询这些查询(query)时搜索引擎给的结果数往往较少,很多时候无法满足用户的查询需求。
满足用户的稀有查询是搜索引擎覆盖率的一个重要指标,直接影响用户的搜索体验。所以多数搜索引擎公司会把稀有查询(也称稀有覆盖率)单独作为一个评价标准。
目前采用的稀有查询方案是,分析用户查询历史,把展现结果数较少的query单独分析,针对这些query分析搜索引擎收录的网页,从中挖掘出一些网页来补充这些query的展现结果。
但是,上述方案的缺点是存在滞后性,在用户没有查询前是不知道哪些是稀有的,用户第一次查询的体验是比较差的,而且覆盖面有限,仅仅是用户查询过的query。
发明内容
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的互联网稀有资源的挖掘方法和相应的装置。
基于本发明的一个方面,本发明实施例提供了一种互联网稀有资源的挖掘方法,包括:
对搜索引擎收录的网页进行切词处理,生成多个词term;
在所述多个term查找出现次数少于次数阈值、且具备意义的稀有term,其中,所述具备意义是指term具备实词含义,能够示意事件内容;
查找包括稀有term的网页,定义为稀有网页;
对所述稀有网页进行处理,挖掘得到稀有网络资源。
基于本发明的另一个方面,本发明实施例还提供了一种互联网稀有资源的挖掘装置,包括:
切词模块,配置为对搜索引擎收录的网页进行切词处理,生成多个词term;
查找模块,配置为在所述多个term查找出现次数少于次数阈值、且具备意义的稀有term,其中,所述具备意义是指term具备实词含义,能够示意事件内容;
所述查找模块还配置为查找包括稀有term的网页,定义为稀有网页;
挖掘模块,配置为对所述稀有网页进行处理,挖掘得到稀有网络资源。
根据本发明的又一个方面,提供了一种计算机程序,其包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据任一个上述的互联网稀有资源的挖掘方法。
根据本发明的再一个方面,提供了一种计算机可读介质,其中存储了上述的计算机程序。
在本发明实施例中,对搜索引擎收录的所有网页进行稀有网络资源挖掘,覆盖面大,能够为用户提供更丰富的数据支持。进一步,通过切词处理等手段对搜索引擎收录的网页进行处理,从中查找到稀有网页,进而挖掘得到稀有网络资源,大大提高了稀有网络资源挖掘的效率,避免对每个网页都进行一次挖掘所造成的多重步骤。并且,由于本发明实施例提供的互联网稀有资源的挖掘方法是对搜索引擎收入的网页进行挖掘,不需要对用户之前的查询历史进行分析,在用户首次查询时就能够为用户提供稀有网络资源,不存在滞后性,提高了用户感受体验。综上,本发明实施例能够在用户查询之前,主动挖掘一些高质量的稀有资源,在用户查询时能够提供相对比较丰富的结果。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示出了根据本发明一个实施例的互联网稀有资源的挖掘方法的处理流程图;
图2提供了一种互联网稀有资源的挖掘装置的结构示意图;
图3示意性地示出了用于执行根据本发明的互联网稀有资源的挖掘方法的计算设备的框图;以及
图4示意性地示出了用于保持或者携带实现根据本发明的互联网稀有资源的挖掘方法的程序代码的存储单元。
具体实施方式
下面结合附图和具体的实施方式对本发明作进一步的描述。
为解决上述技术问题,本发明实施例提供了一种互联网稀有资源的挖掘方法。图1示出了根据本发明一个实施例的互联网稀有资源的挖掘方法的处理流程图。参见图1,该方法至少包括步骤S102至步骤S108:
步骤S102、对搜索引擎收录的网页进行切词处理,生成多个词(term);
步骤S104、在步骤S102经切词处理生成的多个term查找出现次数少于次数阈值、且具备意义的稀有term,其中,具备意义是指term具备实词含义,能够示意事件内容;
其中,步骤S104中的事件是一个大的事件概念,例如时间事件、地点事件、人物事件、联系方式事件,等等。因term能够示意事件内容,因此,term相应的可为时间、地点、人物、电话号码、邮箱地址等具备事件意义的词。
步骤S106、查找包括步骤S104查找的稀有term的网页,定义为稀有网页(rare webpage);
步骤S108、对步骤S106定义的稀有网页进行处理,挖掘得到稀有网络资源。
在本发明实施例中,对搜索引擎收录的所有网页进行稀有网络资源挖 掘,覆盖面大,能够为用户提供更丰富的数据支持。进一步,通过切词处理等手段对搜索引擎收录的网页进行处理,从中查找到稀有网页,进而挖掘得到稀有网络资源,大大提高了稀有网络资源挖掘的效率,避免对每个网页都进行一次挖掘所造成的多重步骤。并且,由于本发明实施例提供的互联网稀有资源的挖掘方法是对搜索引擎收入的网页进行挖掘,不需要对用户之前的查询历史进行分析,在用户首次查询时就能够为用户提供稀有网络资源,不存在滞后性,提高了用户感受体验。综上,本发明实施例能够在用户查询之前,主动挖掘一些高质量的稀有资源,在用户查询时能够提供相对比较丰富的结果。
其中,步骤S104可采用多种手段或方式生成出现次数少于次数阈值、且具备意义的稀有term,本实施例提供了其中的一种手段:
首先,计算多个term在网页中的出现频率;
其次,在多个term中,将出现频率较高的N个term过滤掉;
最后,在过滤剩下的term中查找出现次数少于次数阈值、且具备意义的稀有term。
考虑到稀有term本身出现次数较少,因此,可以直接将出现频率较高的N个term过滤掉,从而大大减少了后续计算的量。具体执行时,可以将所有的term的出现频率从高到低进行排序,将排在前N位的term直接过滤掉。
从语法的角度进行分析,在文字类语句中,出现频率较高的term的类型通常是不具备实词意义的,例如常见的语气词、连词、助词、具有类型性的名称等等。语气词通常是指对语言进行语气幅度增强类的词语,例如,啊、哇、呀等等,这类词本身没有具体含义,仅用于增加语气幅度。连词是用于连接不同的主语、谓语、宾语等,连词常见的有:和、或者、除非等等。助词通常是指谓语的辅助用词,例如跟在动词后的地。其他具有类型性的名称,是指某一类事务的名称,但这一类事物本身不能示意具体事件内容,无法形成区分性,例如公司、团队、协会等等。
将排在前N位的term直接过滤掉之后,在剩下的term中,进一步根据term是否具备一定的意义进行筛选。此处的N是一个整数值,具体的数值根据实际情况而定。
而在过滤剩下的term中查找出现次数少于次数阈值、且具备意义的稀 有term,需要对于剩下的每个term,均执行如下操作:
计算每个term的类型,以及包含此term的网页数量;
查找在网页中的出现次数少于阈值、且根据类型具备意义的term,即为稀有term。
而剩下的每个term的类型可以有多种,现提供其中的5种作为例举,根据用户的需求可以在此基础上再做更多的细分。一是小数点型,比如“0.5”、“8.3米”等;二是英文字符串型,比如“abc”等;三是电话号码型,比如“010-88001235”等;四是汉字型的,比如“北京”、“新华网”等;五是邮箱型的123456@126.com等。
前文中提及,步骤S108对稀有网页进行处理,本发明实施例提供了一种具体的处理方式,即对稀有网页进行去垃圾处理。此处的去垃圾处理的原因是因为,稀有网页很多是质量较低的网页,需要过滤其中垃圾和重复的网页。对于稀有term泛滥的网页需要过滤,稀有term泛滥指的是稀有term数量在总的term的比例高于一定的阈值。另外需要去掉其中的采集、作弊等传统的垃圾。
具体的,本发明实施例通过去重处理实现去垃圾处理。普通网页的去重处理一般会对网页做个签名(最长句子签名是其中的一种),签名相同的则为相同的网页。稀有的网页去重在此基础上需要对稀有term做个签名,即,为稀有网页进行网页签名,以及为稀有网页中的稀有term生成词签名,进而对网页签名以及词签名均相同的多个网页,保留其中一个。采用这一方式是因为存在一些最长句子签名相同但是存在不同的稀有term的情况。采用本发明实施例可以鉴定出:网页签名相同,并且稀有term签名相同的则为同一个网页。根据稀有term签名来去重,采用保证了挖掘的稀有网页的质量
现提供一个对稀有term做个签名的实例。把网页包含的稀有term排序连城一个字符串,计算此字符串的签名,比如一个网页包含term1term2term3,首先对这三个稀有term排序,并且用特定的分隔符连起来,比如:term1/term2/term3,对此字符串计算签名作为网页的稀有term签名。
经去重处理后,若仍存在多个相同的网页,那么,同主域的相同网页保留质量较好的一个。
现以具体实施例对本发明实施例提供的互联网稀有资源的挖掘方法进 行说明。本例中搜索引擎收录的网页包括如下一段文本:“新华网北京4月8日电(记者小明)A国领导人赵公明8日在人民大会堂同B国总统佩佩举行会谈。两国元首就发展两国关系深入交换意见,一致决定推动双边友好交流合作取得更大发展。赵公明希望两国合谈早日取得实质性进展。”
首先,对搜索引擎收录的网页做切词处理,把网页中的文本切分成具有一定意义的term。切词后为“新华-网北京4月8日电(记者小明)A国-领导人赵-公明8日在人民-大-会-堂同B总统佩佩举行-会谈。两国元首就发展两国关系深入交换-意见,一致-决定推动双边友好交流合作取得更大发展。赵-公明希望两国合谈早日取得实质-性进展。”其中的空白处、“-”作为分词分隔符出现。
其次,统计网页切词后总共的term数量,并且过滤掉出现频率较高的term,输出可能是稀有的term。过滤掉的出现频率较高的term是预先分析一些网页,记录下来一些常见的term,命中这些term的不作为稀有term输出。比较常见比如:“的”、“公司”等等,还有一些标点符号等。过滤掉这些是为了后续处理的方便,否则会产生大量的计算,浪费资源。这些常见的非稀有term数量较少但是包含它们的网页数却非常之大,过滤掉这些term可以大大减少后续的计算量。
随后,统计输出的term,计算每个term类型以及包含此term的网页数量。通过分析用户的查询需求,本实施例中term的类型分为6种,当然根据用户的需求可以在此基础上再做更多的细分。一是小数点型,比如“0.5”、“8.3米”等、二是英文字符串型,比如“abc”等、三是电话号码型,比如“010-88067082”等、四是汉字型的,比如“北京”、“新华网”等、五是邮箱型的123456@126.com等、六是其他。
然后,对于出现次数少于一定阈值的的并且是具有一定意义的term作为稀有term,定义包含此term的网页为稀有网页。有意义的term包括电话号码、邮箱、人名、地名、物名等等,哪些term是有价值的可以根据需求确定。出现次数较少的term多数是无意义的,比如一些小数、一些字母堆积成的字符串、以及错别字等等,所以需要对term做质量区分,区分term的质量难度较大,本实施例中是把term分类,根据用户的历史查询需求来定哪些类型的term是有价值的。
最后,对稀有网页进行去垃圾处理。稀有网页很多是质量较低的网页, 需要过滤其中垃圾和重复的网页。对于稀有term泛滥的网页需要过滤,稀有term泛滥指的是稀有term数量在总的term的比例高于一定的阈值。另外需要去掉其中的采集、作弊等传统的垃圾。普通网页的去重处理一般会对网页做个签名(最长句子签名是其中的一种),签名相同的则为相同的网页。稀有的网页去重在此基础上需要对稀有term做个签名,因为存在一些最长句子签名相同但是存在不同的稀有term的情况。网页签名相同,并且稀有term签名相同的则为同一个网页。同主域的相同网页保留质量较好的一个。
基于同一发明构思,本发明实施例还提供了一种互联网稀有资源的挖掘装置,用于支持上述任意一个优选实施例或其组合的互联网稀有资源的挖掘方法。图2提供了一种互联网稀有资源的挖掘装置的结构示意图。参见图2,该装置至少包括:
切词模块210,配置为对搜索引擎收录的网页进行切词处理,生成多个词term;
查找模块220,与切词模块210耦合,配置为在切记模块210生成的多个term查找出现次数少于次数阈值、且具备意义的稀有term,其中,具备意义是指term具备实词含义,能够示意事件内容;
查找模块220还配置为查找包括稀有term的网页,定义为稀有网页;
挖掘模块230,与查找模块220耦合,配置为对查找模块220定义的稀有网页进行处理,挖掘得到稀有网络资源。
在一个优选的实施例中,查找模块220还可以配置为:
计算多个term在网页中的出现频率;
在多个term中,将出现频率较高的N个term过滤掉;
在过滤剩下的term中查找出现次数少于次数阈值、且具备意义的稀有term。
在一个优选的实施例中,查找模块220还可以配置为:
对于剩下的每个term,
计算每个term的类型,以及包含此term的网页数量;
查找在网页中的出现次数少于阈值、且根据类型具备意义的term,即为稀有term。
在一个优选的实施例中,出现频率较高的N个term的类型包括下列至少之一:
语气词;
连词;
助词;
具有类型性的名称。
在一个优选的实施例中,剩下的每个term的类型包括下列至少之一:
小数点型;
字符串型;
电话号码型;
汉字型;
邮箱型。
在一个优选的实施例中,挖掘模块230还可以配置为:对稀有网页进行去垃圾处理。
在一个优选的实施例中,挖掘模块230还可以配置为:对稀有网页进行去重处理。
在一个优选的实施例中,挖掘模块230还可以配置为:
为稀有网页进行网页签名,以及为稀有网页中的稀有term生成词签名;
对网页签名以及词签名均相同的多个网页,保留其中一个。
在一个优选的实施例中,具备意义的term包括下列任意之一:电话号码、邮箱、人名、地名、物名。
采用本发明实施例提供的互联网稀有资源的挖掘方法及装置,能够达到如下有效效果:
在本发明实施例中,对搜索引擎收录的所有网页进行稀有网络资源挖掘,覆盖面大,能够为用户提供更丰富的数据支持。进一步,通过切词处理等手段对搜索引擎收录的网页进行处理,从中查找到稀有网页,进而挖掘得到稀有网络资源,大大提高了稀有网络资源挖掘的效率,避免对每个网页都进行一次挖掘所造成的多重步骤。并且,由于本发明实施例提供的互联网稀有资源的挖掘方法是对搜索引擎收入的网页进行挖掘,不需要对用户之前的查询历史进行分析,在用户首次查询时就能够为用户提供稀有网络资源,不存在滞后性,提高了用户感受体验。综上,本发明实施例能够在用户查询之前,主动挖掘一些高质量的稀有资源,在用户查询时能够提供相对比较丰富的结果。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的互联网稀有资源的挖掘装置中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样 的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
例如,图3示出了可以实现根据本发明的互联网稀有资源的挖掘方法的计算设备。该计算设备传统上包括处理器310和以存储器320形式的计算机程序产品或者计算机可读介质。存储器320可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器320具有用于执行上述方法中的任何方法步骤的程序代码331的存储空间330。例如,用于程序代码的存储空间330可以包括分别用于实现上面的方法中的各种步骤的各个程序代码331。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图4所述的便携式或者固定存储单元。该存储单元可以具有与图3的计算设备中的存储器320类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码331’,即可以由例如诸如310之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所描述的方法中的各个步骤。
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。

Claims (20)

  1. 一种互联网稀有资源的挖掘方法,包括:
    对搜索引擎收录的网页进行切词处理,生成多个词term;
    在所述多个term查找出现次数少于次数阈值、且具备意义的稀有term,其中,所述具备意义是指term具备实词含义,能够示意事件内容;
    查找包括稀有term的网页,定义为稀有网页;
    对所述稀有网页进行处理,挖掘得到稀有网络资源。
  2. 根据权利要求1所述的方法,其中,在所述多个term查找出现次数少于次数阈值、且具备意义的稀有term,包括:
    计算所述多个term在网页中的出现频率;
    在所述多个term中,将出现频率较高的N个term过滤掉;
    在过滤剩下的term中查找出现次数少于次数阈值、且具备意义的稀有term。
  3. 根据权利要求1或2所述的方法,其中,所述在过滤剩下的term中查找出现次数少于次数阈值、且具备意义的稀有term,包括:
    对于剩下的每个term,
    计算每个term的类型,以及包含此term的网页数量;
    查找在网页中的出现次数少于阈值、且根据类型具备意义的term,即为稀有term。
  4. 根据权利要求1至3任一项所述的方法,其中,出现频率较高的N个term的类型包括下列至少之一:
    语气词;
    连词;
    助词;
    具有类型性的名称。
  5. 根据权利要求1至4任一项所述的方法,其中,所述term的类型包括下列至少之一:
    小数点型;
    字符串型;
    电话号码型;
    汉字型;
    邮箱型。
  6. 根据权利要求1至5任一项所述的方法,其中,对所述稀有网页进行处理,包括:对所述稀有网页进行去垃圾处理。
  7. 根据权利要求1至6任一项所述的方法,其中,对所述稀有网页进行去垃圾处理,包括:对所述稀有网页进行去重处理。
  8. 根据权利要求1至7任一项所述的方法,其中,对所述稀有网页进行去重处理,包括:
    为所述稀有网页进行网页签名,以及为所述稀有网页中的稀有term生成词签名;
    对网页签名以及词签名均相同的多个网页,保留其中一个。
  9. 根据权利要求1至8任一项所述的方法,其中,所述具备意义的term包括下列任意之一:电话号码、邮箱、人名、地名、物名。
  10. 一种互联网稀有资源的挖掘装置,包括:
    切词模块,配置为对搜索引擎收录的网页进行切词处理,生成多个词term;
    查找模块,配置为在所述多个term查找出现次数少于次数阈值、且具备意义的稀有term,其中,所述具备意义是指term具备实词含义,能够示意事件内容;
    所述查找模块还配置为查找包括稀有term的网页,定义为稀有网页;
    挖掘模块,配置为对所述稀有网页进行处理,挖掘得到稀有网络资源。
  11. 根据权利要求10所述的装置,其中,所述查找模块还配置为:
    计算所述多个term在网页中的出现频率;
    在所述多个term中,将出现频率较高的N个term过滤掉;
    在过滤剩下的term中查找出现次数少于次数阈值、且具备意义的稀有term。
  12. 根据权利要求10或11所述的装置,其中,所述查找模块还配置为:
    对于剩下的每个term,
    计算每个term的类型,以及包含此term的网页数量;
    查找在网页中的出现次数少于阈值、且根据类型具备意义的term,即为稀有term。
  13. 根据权利要求10至12任一项所述的装置,其中,出现频率较高的N个term的类型包括下列至少之一:
    语气词;
    连词;
    助词;
    具有类型性的名称。
  14. 根据权利要求10至13任一项所述的装置,其中,所述term的类型包括下列至少之一:
    小数点型;
    字符串型;
    电话号码型;
    汉字型;
    邮箱型。
  15. 根据权利要求10至14任一项所述的装置,其中,所述挖掘模块还配置为:对所述稀有网页进行去垃圾处理。
  16. 根据权利要求10至15任一项所述的装置,其中,所述挖掘模块还配置为:对所述稀有网页进行去重处理。
  17. 根据权利要求10至16任一项所述的装置,其中,所述挖掘模块还配置为:
    为所述稀有网页进行网页签名,以及为所述稀有网页中的稀有term生成词签名;
    对网页签名以及词签名均相同的多个网页,保留其中一个。
  18. 根据权利要求10至17任一项所述的装置,其中,所述具备意义的term包括下列任意之一:电话号码、邮箱、人名、地名、物名。
  19. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-9中的任一个所述的互联网稀有资源的挖掘方法。
  20. 一种计算机可读介质,其中存储了如权利要求19所述的计算机程序。
PCT/CN2015/080803 2014-06-30 2015-06-04 互联网稀有资源的挖掘方法及装置 WO2016000511A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410307883.0 2014-06-30
CN201410307883.0A CN104050294A (zh) 2014-06-30 2014-06-30 互联网稀有资源的挖掘方法及装置

Publications (1)

Publication Number Publication Date
WO2016000511A1 true WO2016000511A1 (zh) 2016-01-07

Family

ID=51503126

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/080803 WO2016000511A1 (zh) 2014-06-30 2015-06-04 互联网稀有资源的挖掘方法及装置

Country Status (2)

Country Link
CN (1) CN104050294A (zh)
WO (1) WO2016000511A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050294A (zh) * 2014-06-30 2014-09-17 北京奇虎科技有限公司 互联网稀有资源的挖掘方法及装置
CN107729327A (zh) * 2017-09-30 2018-02-23 联想(北京)有限公司 一种释义方法及一种释义装置
CN110889274B (zh) * 2018-08-17 2022-02-08 北大方正集团有限公司 信息质量评估方法、装置、设备及计算机可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101149739A (zh) * 2007-08-24 2008-03-26 中国科学院计算技术研究所 一种面向互联网的有意义串的挖掘方法和系统
CN102479191A (zh) * 2010-11-22 2012-05-30 阿里巴巴集团控股有限公司 提供多粒度分词结果的方法及其装置
CN103841173A (zh) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 一种垂直网络蜘蛛
CN104050294A (zh) * 2014-06-30 2014-09-17 北京奇虎科技有限公司 互联网稀有资源的挖掘方法及装置
CN104484388A (zh) * 2014-12-10 2015-04-01 北京奇虎科技有限公司 稀缺信息页面的筛选方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710343A (zh) * 2009-12-11 2010-05-19 北京中机科海科技发展有限公司 一种基于文本挖掘的本体自动构建系统及方法
CN101968801A (zh) * 2010-09-21 2011-02-09 上海大学 一种单篇文本关键词的提取方法
CN102682085A (zh) * 2012-04-18 2012-09-19 北京十分科技有限公司 一种网页去重的方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101149739A (zh) * 2007-08-24 2008-03-26 中国科学院计算技术研究所 一种面向互联网的有意义串的挖掘方法和系统
CN102479191A (zh) * 2010-11-22 2012-05-30 阿里巴巴集团控股有限公司 提供多粒度分词结果的方法及其装置
CN103841173A (zh) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 一种垂直网络蜘蛛
CN104050294A (zh) * 2014-06-30 2014-09-17 北京奇虎科技有限公司 互联网稀有资源的挖掘方法及装置
CN104484388A (zh) * 2014-12-10 2015-04-01 北京奇虎科技有限公司 稀缺信息页面的筛选方法和装置

Also Published As

Publication number Publication date
CN104050294A (zh) 2014-09-17

Similar Documents

Publication Publication Date Title
US10552462B1 (en) Systems and methods for tokenizing user-annotated names
US7424421B2 (en) Word collection method and system for use in word-breaking
US7461056B2 (en) Text mining apparatus and associated methods
Wang et al. Mining geographic knowledge using location aware topic model
US9507867B2 (en) Discovery engine
US20140180934A1 (en) Systems and Methods for Using Non-Textual Information In Analyzing Patent Matters
WO2015149533A1 (zh) 一种基于网页内容分类进行分词处理的方法和装置
US20180181646A1 (en) System and method for determining identity relationships among enterprise data entities
CN110020422A (zh) 特征词的确定方法、装置和服务器
JP2005085285A5 (zh)
US11361030B2 (en) Positive/negative facet identification in similar documents to search context
US20160283593A1 (en) Salient terms and entities for caption generation and presentation
CN106933787A (zh) 判决文书相似度的计算方法、查找装置及计算机设备
US10083398B2 (en) Framework for annotated-text search using indexed parallel fields
Vani et al. Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system
US10417285B2 (en) Corpus generation based upon document attributes
CN109063184B (zh) 多语言新闻文本聚类方法、存储介质及终端设备
CN104978332A (zh) 用户生成内容标签数据生成方法、装置及相关方法和装置
Rao et al. External & intrinsic plagiarism detection: VSM & discourse markers based approach
WO2016000511A1 (zh) 互联网稀有资源的挖掘方法及装置
WO2022105178A1 (zh) 一种关键词提取的方法及相关装置
CN110738048B (zh) 一种关键词提取方法、装置及终端设备
CN109902148B (zh) 一种通讯录联系人的企业名称自动补全的方法
CN107577667B (zh) 一种实体词处理方法和装置
CN103092838A (zh) 一种获取英文词的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15814490

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15814490

Country of ref document: EP

Kind code of ref document: A1