WO2019196209A1 - Event information analysis method, readable storage medium, terminal device and apparatus - Google Patents

Event information analysis method, readable storage medium, terminal device and apparatus Download PDF

Info

Publication number
WO2019196209A1
WO2019196209A1 PCT/CN2018/093346 CN2018093346W WO2019196209A1 WO 2019196209 A1 WO2019196209 A1 WO 2019196209A1 CN 2018093346 W CN2018093346 W CN 2018093346W WO 2019196209 A1 WO2019196209 A1 WO 2019196209A1
Authority
WO
WIPO (PCT)
Prior art keywords
search result
initial
initial search
extended
keyword
Prior art date
Application number
PCT/CN2018/093346
Other languages
French (fr)
Chinese (zh)
Inventor
陈一恋
汪伟
王晓伟
罗傲雪
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019196209A1 publication Critical patent/WO2019196209A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the present application belongs to the field of computer technology, and in particular, to an event information analysis method, a computer readable storage medium, a terminal device and a device.
  • the embodiment of the present application provides an event information analysis method, a computer readable storage medium, a terminal device, and a device, so as to solve the problem that the search result obtained by the existing event information analysis method is limited and the analysis efficiency is low. .
  • a first aspect of the embodiment of the present application provides an event information analysis method, which may include:
  • An extended keyword is selected in the initial search result, where the extended keyword is a word whose similarity with the initial keyword is greater than a preset similarity threshold;
  • Extracting the initial search result and the target event statement in the extended search result where the target event statement is a statement including an event keyword and a preset matching field, where the event keyword is the initial keyword or Extended keyword
  • the matching field in the target event statement is determined as the event body corresponding to the target event statement.
  • a second aspect of embodiments of the present application provides a computer readable storage medium storing computer readable instructions that, when executed by a processor, implement the event information analysis method described above step.
  • a third aspect of an embodiment of the present application provides an event information analysis terminal device including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor executing The computer readable instructions implement the steps of the event information analysis method described above.
  • a fourth aspect of the embodiments of the present application provides an event information analyzing apparatus, which may include a module for implementing the steps of the event information analyzing method.
  • the embodiment of the present application has the beneficial effects that: based on the initial keyword, the embodiment of the present application further introduces an extended keyword to expand it, and can obtain a broader search result, and The use of regular expressions achieves automatic matching of event subjects, greatly improving the efficiency of analysis.
  • FIG. 1 is a flowchart of an embodiment of an event information analysis method according to an embodiment of the present application
  • FIG. 2 is a schematic flow chart of an initial search result storage process
  • 3 is a schematic flow chart of filtering out extended keywords in initial search results
  • FIG. 4 is a schematic flow chart of a process of selecting an extended search result
  • FIG. 5 is a structural diagram of an embodiment of an event information analysis apparatus according to an embodiment of the present application.
  • FIG. 6 is a schematic block diagram of an event information analysis terminal device according to an embodiment of the present application.
  • an embodiment of an event information analysis method in an embodiment of the present application may include:
  • Step S101 Acquire an initial search result corresponding to the preset initial keyword by using a preset web search engine.
  • one or more network search engines may be used to perform automatic search in the Internet according to actual conditions.
  • searching you can specify the search scope and search only under certain websites. For example, if you need to search for financial information, you can specify one or more financial websites as the search scope, only in the search scope. Search within, or you can search without searching for the entire Internet.
  • the initial keyword can be determined according to the actual analysis field. For example, in the field of quantitative investment, the investor is more concerned about the debt situation of the investor, especially if the debt default of the investor is used, the “debt default” can be adopted as the The initial keyword is used to search to obtain related webpage content, that is, the initial search result.
  • the initial search results obtained by the search may be extremely large, and if all of the content is stored, it will consume huge storage resources. Therefore, in the present embodiment, the number of the initial search results is set in advance, and it is recorded as PageNum, and only the search results within the number are stored.
  • the value of PageNum can be determined according to the storage capacity of the preset storage medium. The relationship between the two is positive, that is, the larger the storage capacity, the larger the value of PageNum. Conversely, the smaller the storage capacity, the value of PageNum is also The smaller.
  • the specific storage process may include the steps shown in FIG. 2:
  • Step S1011 Perform a hash operation on the initial search result to obtain a hash value of the initial search result.
  • a method of hashing the complete content of the webpage may be adopted, but such an operation process consumes a large amount of time. Therefore, for the sake of simplicity, only the digest content of the webpage may be hashed.
  • the way to speed up the operation specifically:
  • PageContent is the webpage text in the initial search result
  • Head (PageContent) is the first M characters of the webpage text in the initial search result
  • Tail (PageContent) is the post of the webpage text in the initial search result.
  • N characters, M and N are integers greater than 1
  • SubContent is the summary content of the initial search result.
  • the hash is a preset hash function
  • the Key is a hash value of the initial search result.
  • Step S1012 Search for a hash value of the initial search result in a preset hash value set.
  • the hash value set is used to record the hash value of the webpage that has been stored in the preset storage medium, and the calculation process of each of the hash values is similar to that in step S1011. For details, refer to the content in step S1011. , will not repeat them here.
  • step S1013 is performed, and if the search is successful, the search is found in the hash value set.
  • the hash value of the initial search result is then executed in step S1014.
  • Step S1013 Add a hash value of the initial search result into the hash value set, and store the initial search result in the storage medium.
  • the process of adding the hash value of the initial search result to the hash value set may be expressed as:
  • HashList HashList ⁇ Key.
  • step S1014 the initial search result is discarded.
  • the hash value of the initial search result is found in the hash value set, it indicates that the same content as the initial search result has been stored in the storage medium, and it is not necessary to store it again.
  • Step S102 Filter out the extended keyword in the initial search result.
  • the extended keyword is a word whose similarity with the initial keyword is greater than a preset similarity threshold.
  • step S102 may include the steps as shown in FIG. 3:
  • Step S1021 respectively calculating a degree of literal overlap between each word in the initial search result and the initial keyword.
  • the literal overlap between each word in the initial search result and the initial keyword may be separately calculated according to the following formula:
  • w is any word in the initial search result
  • Step S1022 respectively calculating a search overlap degree between each word in the initial search result and the initial keyword.
  • search overlap between each word in the initial search result and the initial keyword may be separately calculated according to the following formula:
  • Step S1023 respectively calculating the similarity between each word in the initial search result and the initial keyword.
  • the similarity between each word in the initial search result and the initial keyword may be separately calculated according to the following formula:
  • Step S1024 determining a word whose similarity with the initial keyword is greater than the similarity threshold as the extended keyword.
  • the initial keyword is “debt default”, through the above process, it can be determined that the extended keywords are “debt dispute”, “debt litigation”, “debt storm”, “debt collapse”, “debt warning”. , "debt rights protection” and other similar words.
  • Step S103 Acquire an extended search result corresponding to the extended keyword by using the network search engine.
  • search results obtained by the search may be extremely large, and if all of the content is stored, it will consume huge storage resources. Therefore, in this embodiment, only a part of the extended search result can be selected and stored by the steps shown in FIG. 4:
  • step S1031 the importance scores of the respective extended keywords are respectively calculated.
  • the importance scores of each of the extended keywords may be separately calculated according to the following formula:
  • freq(ew) is the frequency at which ew appears in the initial search result
  • Freq(ew) is the frequency at which ew appears in the preset sample corpus. This frequency is passed.
  • the large-scale statistics of the language materials that have actually appeared in the actual use of the language are fixed, and can be obtained directly by looking up the table
  • ExWord is a collection composed of each of the extended keywords. Max[Freq(ExWord)] is the maximum value of the frequency of occurrence of each of the extended keywords in the sample corpus, namely:
  • ew s s is the number of keyword expansion, 1 ⁇ s ⁇ S, S is the number of the expanded keyword, ln degree of importance scores logarithmic function, Score (ew) of ew is a natural.
  • the importance score of an extended keyword is positively correlated with the frequency of occurrence in the initial search result, and negatively correlated with the frequency of occurrence in the sample corpus. That is to say, if the frequency of occurrence of an extended keyword in normal language use is less, and the more frequently it appears in the initial search result, the importance score is higher.
  • Step S1032 respectively calculating the number of cuts of the extended search results corresponding to the respective extended keywords.
  • the number of interception of the extended search result corresponding to each of the extended keywords may be separately calculated according to the following formula:
  • is a preset scale factor
  • PageNum is the preset number of the initial search results
  • ExPageNum(ew) is the number of cuts of the extended search result corresponding to ew.
  • Step S1033 Obtain an extended search result corresponding to each of the extended keywords according to the intercepted number.
  • Step S104 extracting the initial search result and the target event statement in the extended search result.
  • the target event statement is a statement including an event keyword and a preset matching field, and the event keyword is the initial keyword or the extended keyword.
  • the matching field is a candidate event body.
  • the matching field may be a specific company, organization, institution name, or the like.
  • the event body database may be set in advance, and the event subjects that may be involved are saved in the event body database. For example, you can extract all bond issuance information from the bond management department's database, or obtain these bond issuance information from other third-party agencies or networks, and identify specific issuers, and save those issuers in the event body database. in.
  • the event subject database is constantly being updated. For example, when a certain issued bond has been fully executed as scheduled, the bond has no possibility of default, and if the corresponding issuer does not Other bonds that are being issued can be deleted from the event body database at this time. For example, if a newly issued bond appears and its corresponding issuer is not stored in the event body database, the issuer can Added to the event body database.
  • Step S105 matching the target event statement by a preset regular expression.
  • step S106 is performed.
  • Step S106 determining the matching field in the target event statement as an event body corresponding to the target event statement.
  • the event_keyword is the event keyword, and the keyword is the matching field.
  • a positive match that is, determining that a matching field is an event body
  • a negative matching may also be performed, that is, determining that a matching field is not an event body.
  • this method makes the extraction of debt default events more accurate and efficient, and is a kind of welfare for users, especially for investment institutions such as banks, which can achieve the effect of rapid warning.
  • the embodiment of the present application further introduces an extended keyword to expand on the basis of the initial keyword, and can obtain a wider search result, and realizes the event subject by using the regular expression.
  • the automatic matching greatly improves the analysis efficiency.
  • FIG. 5 is a structural diagram of an embodiment of an event information analysis apparatus provided by an embodiment of the present application.
  • an event information analyzing apparatus may include:
  • the initial search module 501 is configured to obtain an initial search result corresponding to the preset initial keyword by using a preset web search engine;
  • the extended keyword screening module 502 is configured to filter out an extended keyword in the initial search result, where the extended keyword is a word whose similarity with the initial keyword is greater than a preset similarity threshold;
  • An extended search module 503, configured to obtain, by using the network search engine, an extended search result corresponding to the extended keyword
  • the target event statement extraction module 504 is configured to extract the initial search result and the target event statement in the extended search result, where the target event statement is a statement including an event keyword and a preset matching field, where the event key The word is the initial keyword or the extended keyword;
  • a regular matching module 505 configured to match the target event statement by using a preset regular expression
  • the event body determining module 506 is configured to determine, when the matching is successful, the matching field in the target event statement as an event body corresponding to the target event statement.
  • the event information analyzing apparatus may further include:
  • a hash operation module configured to perform a hash operation on the initial search result to obtain a hash value of the initial search result
  • a hash value searching module configured to search, in a preset hash value set, a hash value of the initial search result, where the hash value set is used to record a webpage that has been stored in a preset storage medium Greek value
  • a search result storage module configured to add a hash value of the initial search result to the hash value set if a hash value of the initial search result is not found in the hash value set, And storing the initial search result in the storage medium;
  • a search result discarding module configured to discard the initial search result if a hash value of the initial search result is found in the hash value set.
  • the hash operation module may include:
  • a summary content obtaining unit configured to obtain a summary content of the initial search result according to the following formula:
  • PageContent is the webpage text in the initial search result
  • Head (PageContent) is the first M characters of the webpage text in the initial search result
  • Tail (PageContent) is the post of the webpage text in the initial search result.
  • N characters, M and N are integers greater than 1
  • SubContent is a summary content of the initial search result;
  • a hash value calculation unit for calculating a hash value of the initial search result according to the following formula:
  • the hash is a preset hash function
  • the Key is a hash value of the initial search result.
  • the extended keyword screening module may include:
  • a literal overlap calculation unit configured to respectively calculate a literal overlap between each word in the initial search result and the initial keyword according to the following formula:
  • w is any word in the initial search result
  • For the initial keyword For w and The number of words that are included together, The number of words contained in w and The maximum number of words included, For w with Literal overlap between them;
  • a similarity calculation unit configured to respectively calculate a similarity between each word in the initial search result and the initial keyword according to the following formula:
  • an extended keyword determining unit configured to determine, as the extended keyword, a word whose similarity with the initial keyword is greater than the similarity threshold.
  • the extended search module may include:
  • the importance score calculation unit is configured to separately calculate the importance scores of each of the extended keywords according to the following formula:
  • the interception number calculation unit is configured to separately calculate the number of interception of the extended search result corresponding to each of the extended keywords according to the following formula:
  • ew s is the extended keyword with the serial number s, 1 ⁇ s ⁇ S, S is the number of the extended keywords, ⁇ is a preset proportional coefficient, and PageNum is the preset initial search result.
  • the number, ExPageNum(ew) is the number of intercepted search results corresponding to ew;
  • an extended search result obtaining unit configured to respectively obtain extended search results corresponding to the respective extended keywords according to the intercepted number.
  • FIG. 6 is a schematic block diagram of an event information analysis terminal device provided by an embodiment of the present application. For convenience of description, only parts related to the embodiment of the present application are shown.
  • the event information analysis terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, or a cloud server.
  • the event information analysis terminal device 6 may include a processor 60, a memory 61, and computer readable instructions 62 stored in the memory 61 and operable on the processor 60, such as performing the event information analysis method described above.
  • Computer readable instructions When the processor 60 executes the computer readable instructions 62, the steps in the foregoing various event information analysis method embodiments are implemented, such as steps S101 to S106 shown in FIG.
  • the processor 60 when executing the computer readable instructions 62, implements the functions of the various modules/units in the various apparatus embodiments described above, such as the functions of the modules 501 through 506 shown in FIG.
  • the functional units in the various embodiments of the present application may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, the technical solution of the present application, in essence or the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium.
  • a number of computer readable instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.

Abstract

The present application belongs to the technical field of computers, and relates in particular to an event information analysis method, a computer-readable storage medium, a terminal device and an apparatus. The method comprises: first, acquiring initial search results corresponding to a preset initial keyword by means of a preset network search engine, and screening extension keywords out from the initial search results; next, acquiring extension search results corresponding to the extension keywords by means of the network search engine, and then extracting target event sentences in the initial search results and the extension search results; and finally, matching the target event sentences by means of a preset regular expression, and if the matching is successful, determining matching fields in the target event sentences to be event main-bodies which correspond to the target event sentences. Since extension keywords are introduced, broader search results may be obtained, automatic matching for event main-bodies is achieved by using the regular expression, and analysis efficiency is thus greatly improved.

Description

事件信息分析方法、可读存储介质、终端设备及装置Event information analysis method, readable storage medium, terminal device and device
本申请要求于2018年4月8日提交中国专利局、申请号为201810305412.4、发明名称为“一种事件信息分析方法、计算机可读存储介质及终端设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. 201101305412.4, entitled "An Event Information Analysis Method, Computer Readable Storage Medium, and Terminal Equipment", filed on April 8, 2018, the entire disclosure of which is incorporated herein by reference. The content is incorporated herein by reference.
技术领域Technical field
本申请属于计算机技术领域,尤其涉及一种事件信息分析方法、计算机可读存储介质、终端设备及装置。The present application belongs to the field of computer technology, and in particular, to an event information analysis method, a computer readable storage medium, a terminal device and a device.
背景技术Background technique
随着社会经济的不断发展,人们越来越意识到信息的重要性,能否快速准确地获取到信息,已成为影响企业成功与否的重要因素之一。其中,对互联网中的各种新闻事件进行分析整理是获取信息的一种有效途径,但是,仅仅通过简单的关键词搜索,得到的搜索结果往往较为局限,代表性不足,而且,在得到这些搜索结果之后,往往还需要进一步地分析其中的事件主体,例如,具体的公司、组织、机构等,但目前这种分析主要由专业人员通过人工分析完成的,分析效率较低。With the continuous development of the social economy, people are increasingly aware of the importance of information, and whether it can quickly and accurately obtain information has become one of the important factors affecting the success of enterprises. Among them, the analysis and sorting of various news events in the Internet is an effective way to obtain information. However, the search results obtained by simple keyword search are often limited and underrepresented, and these searches are obtained. After the results, it is often necessary to further analyze the main body of the event, for example, specific companies, organizations, institutions, etc., but at present the analysis is mainly done by professionals through manual analysis, and the analysis efficiency is low.
技术问题technical problem
有鉴于此,本申请实施例提供了一种事件信息分析方法、计算机可读存储介质、终端设备及装置,以解决现有的事件信息分析方法得到的搜索结果较为局限且分析效率较低的问题。In view of this, the embodiment of the present application provides an event information analysis method, a computer readable storage medium, a terminal device, and a device, so as to solve the problem that the search result obtained by the existing event information analysis method is limited and the analysis efficiency is low. .
技术解决方案Technical solution
本申请实施例的第一方面提供了一种事件信息分析方法,可以包括:A first aspect of the embodiment of the present application provides an event information analysis method, which may include:
通过预设的网络搜索引擎获取与预设的初始关键词对应的初始搜索结果;Acquiring initial search results corresponding to preset initial keywords by a preset web search engine;
在所述初始搜索结果中筛选出扩展关键词,所述扩展关键词为与所述初始关键词的相似度大于预设的相似度阈值的词语;An extended keyword is selected in the initial search result, where the extended keyword is a word whose similarity with the initial keyword is greater than a preset similarity threshold;
通过所述网络搜索引擎获取与所述扩展关键词对应的扩展搜索结果;Acquiring an extended search result corresponding to the extended keyword by the network search engine;
提取所述初始搜索结果和所述扩展搜索结果中的目标事件语句,所述目标事件语句为包含事件关键词和预设的匹配字段的语句,所述事件关键词为所述初始关键词或所述扩展关键词;Extracting the initial search result and the target event statement in the extended search result, where the target event statement is a statement including an event keyword and a preset matching field, where the event keyword is the initial keyword or Extended keyword
通过预设的正则表达式对所述目标事件语句进行匹配;Matching the target event statement by a preset regular expression;
若匹配成功,则将所述目标事件语句中的所述匹配字段确定为与所述目标事件语句对应的事件主体。If the matching is successful, the matching field in the target event statement is determined as the event body corresponding to the target event statement.
本申请实施例的第二方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现上述事件信息分析方法的步骤。A second aspect of embodiments of the present application provides a computer readable storage medium storing computer readable instructions that, when executed by a processor, implement the event information analysis method described above step.
本申请实施例的第三方面提供了一种事件信息分析终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述事件信息分析方法的步骤。A third aspect of an embodiment of the present application provides an event information analysis terminal device including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, the processor executing The computer readable instructions implement the steps of the event information analysis method described above.
本申请实施例的第四方面提供了一种事件信息分析装置,可以包括用于实现上述事件信息分析方法的步骤的模块。A fourth aspect of the embodiments of the present application provides an event information analyzing apparatus, which may include a module for implementing the steps of the event information analyzing method.
有益效果Beneficial effect
本申请实施例与现有技术相比存在的有益效果是:本申请实施例在初始关键词的基础上,进一步引入了扩展关键词对其进行扩充,能够得到更为广阔的搜索结果,而且由于通过正则表达式的使用实现了对于事件主体的自动匹配,大大提升了分析效率。Compared with the prior art, the embodiment of the present application has the beneficial effects that: based on the initial keyword, the embodiment of the present application further introduces an extended keyword to expand it, and can obtain a broader search result, and The use of regular expressions achieves automatic matching of event subjects, greatly improving the efficiency of analysis.
附图说明DRAWINGS
图1为本申请实施例中一种事件信息分析方法的一个实施例流程图;FIG. 1 is a flowchart of an embodiment of an event information analysis method according to an embodiment of the present application;
图2为初始搜索结果存储过程的示意流程图;2 is a schematic flow chart of an initial search result storage process;
图3为在初始搜索结果中筛选出扩展关键词的示意流程图;3 is a schematic flow chart of filtering out extended keywords in initial search results;
图4为扩展搜索结果的选取过程的示意流程图;4 is a schematic flow chart of a process of selecting an extended search result;
图5为本申请实施例中一种事件信息分析装置的一个实施例结构图;FIG. 5 is a structural diagram of an embodiment of an event information analysis apparatus according to an embodiment of the present application;
图6为本申请实施例中一种事件信息分析终端设备的示意框图。FIG. 6 is a schematic block diagram of an event information analysis terminal device according to an embodiment of the present application.
本发明的实施方式Embodiments of the invention
请参阅图1,本申请实施例中一种事件信息分析方法的一个实施例可以包括:Referring to FIG. 1, an embodiment of an event information analysis method in an embodiment of the present application may include:
步骤S101,通过预设的网络搜索引擎获取与预设的初始关键词对应的初始搜索结果。Step S101: Acquire an initial search result corresponding to the preset initial keyword by using a preset web search engine.
在本实施例中,可以根据实际情况采用一个或者多个网络搜索引擎在互联网中进行自动搜索。在进行搜索时,可以指定搜索范围,仅在某些特定的网站下进行搜索,例如,若需要搜索财经方面的信息,则可将一个或者多个财经网站指定为搜索范围,只在该搜索范围内进行搜索,也可以不指定搜索范围,即在整个互联网中进行搜索。In this embodiment, one or more network search engines may be used to perform automatic search in the Internet according to actual conditions. When searching, you can specify the search scope and search only under certain websites. For example, if you need to search for financial information, you can specify one or more financial websites as the search scope, only in the search scope. Search within, or you can search without searching for the entire Internet.
所述初始关键词可以根据实际的分析领域确定,例如,在量化投资领域,投资者比较关心被投资者的债务情况,尤其是被投资者的债务违约情况,则可以采用“债务违约”作为所述初始关键词来进行搜索,得到相关的网页内容,也即所述初始搜索结果。The initial keyword can be determined according to the actual analysis field. For example, in the field of quantitative investment, the investor is more concerned about the debt situation of the investor, especially if the debt default of the investor is used, the “debt default” can be adopted as the The initial keyword is used to search to obtain related webpage content, that is, the initial search result.
需要注意的是,搜索得到的所述初始搜索结果可能数量极为巨大,如果将这些内容全部存储下来,将会消耗巨大的存储资源。因此,本实施例中,预先设置了所述初始搜索结果的数目,将其记为PageNum,只保存在该数目以内的搜索结果。PageNum的取值可以根据预设的存储介质的存储容量来确定,两者成正相关的关系,即存储容量越大,PageNum的取值也越大,反之,存储容量越小,PageNum的取值也越小。It should be noted that the initial search results obtained by the search may be extremely large, and if all of the content is stored, it will consume huge storage resources. Therefore, in the present embodiment, the number of the initial search results is set in advance, and it is recorded as PageNum, and only the search results within the number are stored. The value of PageNum can be determined according to the storage capacity of the preset storage medium. The relationship between the two is positive, that is, the larger the storage capacity, the larger the value of PageNum. Conversely, the smaller the storage capacity, the value of PageNum is also The smaller.
其中,具体的存储过程可以包括如图2所示的步骤:The specific storage process may include the steps shown in FIG. 2:
步骤S1011,对所述初始搜索结果进行哈希运算,得到所述初始搜索结果的哈希值。Step S1011: Perform a hash operation on the initial search result to obtain a hash value of the initial search result.
在本实施例中,可以采用对网页的完整内容进行哈希运算的方式,但这样的运算过程会消耗大量的时间,因此,为了简化起见,还可以采用仅对网页的摘要内容进行哈希运算的方式来加快运算速度,具体地:In this embodiment, a method of hashing the complete content of the webpage may be adopted, but such an operation process consumes a large amount of time. Therefore, for the sake of simplicity, only the digest content of the webpage may be hashed. The way to speed up the operation, specifically:
首先,根据下式获取所述初始搜索结果的摘要内容:First, the summary content of the initial search result is obtained according to the following formula:
SubContent=Head(PageContent)∪Tail(PageContent)SubContent=Head(PageContent)∪Tail(PageContent)
其中,PageContent为所述初始搜索结果中的网页正文,Head(PageContent)为所述初始搜索结果中的网页正文的前M个字符,Tail(PageContent)为所述初始搜索结果中的网页正文的后N个字符,M和N均为大于1的整数,SubContent为所述初始搜索结果的摘要内容。Wherein, PageContent is the webpage text in the initial search result, Head (PageContent) is the first M characters of the webpage text in the initial search result, and Tail (PageContent) is the post of the webpage text in the initial search result. N characters, M and N are integers greater than 1, and SubContent is the summary content of the initial search result.
然后,根据下式计算所述初始搜索结果的哈希值:Then, the hash value of the initial search result is calculated according to the following formula:
Key=Hash(SubContent)=Hash[Head(PageContent)∪Tail(PageContent)]Key=Hash(SubContent)=Hash[Head(PageContent)∪Tail(PageContent)]
其中,Hash为预设的哈希函数,Key为所述初始搜索结果的哈希值。The hash is a preset hash function, and the Key is a hash value of the initial search result.
步骤S1012,在预设的哈希值集合中查找所述初始搜索结果的哈希值。Step S1012: Search for a hash value of the initial search result in a preset hash value set.
所述哈希值集合用于记录已存储在预设的存储介质中的网页的哈希值,其中的每个哈希值的计算过程与步骤S1011中的类似,具体可参照步骤S1011中的内容,在此不再赘述。The hash value set is used to record the hash value of the webpage that has been stored in the preset storage medium, and the calculation process of each of the hash values is similar to that in step S1011. For details, refer to the content in step S1011. , will not repeat them here.
若查找失败,也即在所述哈希值集合中未查找到所述初始搜索结果的哈希值,则执行步骤S1013,若查找成功,也即在所述哈希值集合中查找到所述初始搜索结果的哈希值,则执行步骤S1014。If the search fails, that is, the hash value of the initial search result is not found in the hash value set, step S1013 is performed, and if the search is successful, the search is found in the hash value set. The hash value of the initial search result is then executed in step S1014.
步骤S1013,将所述初始搜索结果的哈希值添加入所述哈希值集合中,并将所述初始搜索结果存储在所述存储介质中。Step S1013: Add a hash value of the initial search result into the hash value set, and store the initial search result in the storage medium.
若用HashList表示所述哈希值集合,则将所述初始搜索结果的哈希值添加入所述哈希值集合中的过程可表示为:If the hash value set is represented by a HashList, the process of adding the hash value of the initial search result to the hash value set may be expressed as:
HashList=HashList∪Key。HashList=HashList∪Key.
步骤S1014,丢弃所述初始搜索结果。In step S1014, the initial search result is discarded.
若在所述哈希值集合中查找到所述初始搜索结果的哈希值,则说明在所述存储介质中已经存储了与所述初始搜索结果相同的内容,无需再次对其进行存储。If the hash value of the initial search result is found in the hash value set, it indicates that the same content as the initial search result has been stored in the storage medium, and it is not necessary to store it again.
步骤S102,在所述初始搜索结果中筛选出扩展关键词。Step S102: Filter out the extended keyword in the initial search result.
所述扩展关键词为与所述初始关键词的相似度大于预设的相似度阈值的词语。The extended keyword is a word whose similarity with the initial keyword is greater than a preset similarity threshold.
具体地,步骤S102可以包括如图3所示的步骤:Specifically, step S102 may include the steps as shown in FIG. 3:
步骤S1021,分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的字面重叠度。Step S1021, respectively calculating a degree of literal overlap between each word in the initial search result and the initial keyword.
例如,可以根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的字面重叠度:For example, the literal overlap between each word in the initial search result and the initial keyword may be separately calculated according to the following formula:
Figure PCTCN2018093346-appb-000001
Figure PCTCN2018093346-appb-000001
其中,w为所述初始搜索结果中的任一词语, 为所述初始关键词,
Figure PCTCN2018093346-appb-000003
为w和
Figure PCTCN2018093346-appb-000004
共同包含的字的个数,
Figure PCTCN2018093346-appb-000005
为w包含的字数和
Figure PCTCN2018093346-appb-000006
包含的字数的最大值,
Figure PCTCN2018093346-appb-000007
为w与
Figure PCTCN2018093346-appb-000008
之间的字面重叠度。
Where w is any word in the initial search result, For the initial keyword,
Figure PCTCN2018093346-appb-000003
For w and
Figure PCTCN2018093346-appb-000004
The number of words that are included together,
Figure PCTCN2018093346-appb-000005
The number of words contained in w and
Figure PCTCN2018093346-appb-000006
The maximum number of words included,
Figure PCTCN2018093346-appb-000007
For w with
Figure PCTCN2018093346-appb-000008
The literal overlap between them.
步骤S1022,分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的搜索重叠度。Step S1022, respectively calculating a search overlap degree between each word in the initial search result and the initial keyword.
例如,可以根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的搜索重叠度:For example, the search overlap between each word in the initial search result and the initial keyword may be separately calculated according to the following formula:
Figure PCTCN2018093346-appb-000009
Figure PCTCN2018093346-appb-000009
其中,
Figure PCTCN2018093346-appb-000010
为在所述初始搜索结果中w与
Figure PCTCN2018093346-appb-000011
共同出现的页面数,
Figure PCTCN2018093346-appb-000012
为在所述初始搜索结果中w出现的页面数和
Figure PCTCN2018093346-appb-000013
出现的页面数的最大值,
Figure PCTCN2018093346-appb-000014
为w与
Figure PCTCN2018093346-appb-000015
之间的搜索重叠度。
among them,
Figure PCTCN2018093346-appb-000010
For the initial search results in w and
Figure PCTCN2018093346-appb-000011
The number of pages that appear together,
Figure PCTCN2018093346-appb-000012
The number of pages that appear in the initial search results and
Figure PCTCN2018093346-appb-000013
The maximum number of pages that appear,
Figure PCTCN2018093346-appb-000014
For w with
Figure PCTCN2018093346-appb-000015
The degree of overlap between searches.
步骤S1023,分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的相似度。Step S1023, respectively calculating the similarity between each word in the initial search result and the initial keyword.
例如,可以根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的相似度:For example, the similarity between each word in the initial search result and the initial keyword may be separately calculated according to the following formula:
Figure PCTCN2018093346-appb-000016
Figure PCTCN2018093346-appb-000016
其中,k 1、k 2均为预设的权重系数,且k 1+k 2=1,
Figure PCTCN2018093346-appb-000017
为w与
Figure PCTCN2018093346-appb-000018
之间的相似度。
Where k 1 and k 2 are preset weight coefficients, and k 1 +k 2 =1,
Figure PCTCN2018093346-appb-000017
For w with
Figure PCTCN2018093346-appb-000018
The similarity between the two.
步骤S1024,将与所述初始关键词的相似度大于所述相似度阈值的词语确定为所述扩展关键词。Step S1024, determining a word whose similarity with the initial keyword is greater than the similarity threshold as the extended keyword.
例如,若所述初始关键词为“债务违约”,则通过上述过程,可确定其扩展关键词为“债务纠纷”、“债务诉讼”、“债务风暴”、“债务崩塌”、“债务预警”、 “债务维权”等等类似的词语。For example, if the initial keyword is “debt default”, through the above process, it can be determined that the extended keywords are “debt dispute”, “debt litigation”, “debt storm”, “debt collapse”, “debt warning”. , "debt rights protection" and other similar words.
步骤S103,通过所述网络搜索引擎获取与所述扩展关键词对应的扩展搜索结果。Step S103: Acquire an extended search result corresponding to the extended keyword by using the network search engine.
需要注意的是,搜索得到的所述扩展搜索结果可能数量极为巨大,如果将这些内容全部存储下来,将会消耗巨大的存储资源。因此,本实施例中,可以通过如图4所示的步骤仅选取部分的扩展搜索结果进行存储:It should be noted that the search results obtained by the search may be extremely large, and if all of the content is stored, it will consume huge storage resources. Therefore, in this embodiment, only a part of the extended search result can be selected and stored by the steps shown in FIG. 4:
步骤S1031,分别计算各个所述扩展关键词的重要度分值。In step S1031, the importance scores of the respective extended keywords are respectively calculated.
例如,可以根据下式分别计算各个所述扩展关键词的重要度分值:For example, the importance scores of each of the extended keywords may be separately calculated according to the following formula:
Figure PCTCN2018093346-appb-000019
Figure PCTCN2018093346-appb-000019
其中,ew为任一所述扩展关键词,freq(ew)为ew在所述初始搜索结果中出现的频次,Freq(ew)为ew在预设的样本语料库中出现的频次,这个频次是通过对在语言的实际使用中真实出现过的语言材料进行大规模统计而得到的,其取值是固定的,可以直接通过查表等方式获取,ExWord为由各个所述扩展关键词组成的集合,max[Freq(ExWord)]为各个所述扩展关键词在所述样本语料库中出现的频次的最大值,即:Where ew is any of the extended keywords, freq(ew) is the frequency at which ew appears in the initial search result, and Freq(ew) is the frequency at which ew appears in the preset sample corpus. This frequency is passed. The large-scale statistics of the language materials that have actually appeared in the actual use of the language are fixed, and can be obtained directly by looking up the table, and ExWord is a collection composed of each of the extended keywords. Max[Freq(ExWord)] is the maximum value of the frequency of occurrence of each of the extended keywords in the sample corpus, namely:
max[Freq(ExWord)]=max[Freq(ew 1),Freq(ew 2),......,Freq(ew s),......,Freq(ew S)] Max[Freq(ExWord)]=max[Freq(ew 1 ), Freq(ew 2 ), ..., Freq(ew s ), ..., Freq(ew S )]
ew s为序号为s的所述扩展关键词,1≤s≤S,S为所述扩展关键词的数目,ln为自然对数函数,Score(ew)为ew的重要度分值。 ew s s is the number of keyword expansion, 1≤s≤S, S is the number of the expanded keyword, ln degree of importance scores logarithmic function, Score (ew) of ew is a natural.
由上述过程可知,某一扩展关键词的重要度分值与其在所述初始搜索结果中出现的频次正相关,与其在所述样本语料库中出现的频次负相关。也就是说,若某一扩展关键词在正常的语言使用中出现的频次越少,而其在所述初始搜索结果中出现的频次越多,则其重要度分值就越高。It can be seen from the above process that the importance score of an extended keyword is positively correlated with the frequency of occurrence in the initial search result, and negatively correlated with the frequency of occurrence in the sample corpus. That is to say, if the frequency of occurrence of an extended keyword in normal language use is less, and the more frequently it appears in the initial search result, the importance score is higher.
步骤S1032,分别计算与各个所述扩展关键词对应的扩展搜索结果的截取数目。Step S1032, respectively calculating the number of cuts of the extended search results corresponding to the respective extended keywords.
例如,可以根据下式分别计算与各个所述扩展关键词对应的扩展搜索结果的截取数目:For example, the number of interception of the extended search result corresponding to each of the extended keywords may be separately calculated according to the following formula:
Figure PCTCN2018093346-appb-000020
Figure PCTCN2018093346-appb-000020
其中,α为预设的比例系数,PageNum为预设的所述初始搜索结果的数目,ExPageNum(ew)为与ew对应的扩展搜索结果的截取数目。Where α is a preset scale factor, PageNum is the preset number of the initial search results, and ExPageNum(ew) is the number of cuts of the extended search result corresponding to ew.
步骤S1033,分别按照所述截取数目获取与各个所述扩展关键词对应的扩展搜索结果。Step S1033: Obtain an extended search result corresponding to each of the extended keywords according to the intercepted number.
由以上过程可知,与各个所述扩展关键词对应的扩展搜索结果的截取数目是与其重要度分值正相关的,某一扩展关键词的重要度分值越高,则其扩展搜索结果的截取 数目也越多。It can be seen from the above process that the intercepted number of the extended search result corresponding to each of the extended keywords is positively correlated with the importance score, and the higher the score of the extended keyword is, the extended search result is intercepted. The more the number.
步骤S104,提取所述初始搜索结果和所述扩展搜索结果中的目标事件语句。Step S104, extracting the initial search result and the target event statement in the extended search result.
所述目标事件语句为包含事件关键词和预设的匹配字段的语句,所述事件关键词为所述初始关键词或所述扩展关键词。The target event statement is a statement including an event keyword and a preset matching field, and the event keyword is the initial keyword or the extended keyword.
所述匹配字段即为候选的事件主体,具体地,所述匹配字段可以为具体的公司、组织、机构名称等。优选地,可以预先设置事件主体数据库,将可能涉及到的事件主体均保存在该事件主体数据库中。例如,可以从债券管理部门的数据库中提取出所有的债券发行信息,或者从其它第三方机构或者网络上获取这些债券发行信息,并确定出具体的发行者,将这些发行者保存在事件主体数据库中。The matching field is a candidate event body. Specifically, the matching field may be a specific company, organization, institution name, or the like. Preferably, the event body database may be set in advance, and the event subjects that may be involved are saved in the event body database. For example, you can extract all bond issuance information from the bond management department's database, or obtain these bond issuance information from other third-party agencies or networks, and identify specific issuers, and save those issuers in the event body database. in.
需要注意的是,该事件主体数据库中是在不断的更新中的,例如,当某个发行的债券已经全部如期履约,则该债券已无发生违约的可能性,若其对应的发行者也没有其它正在发行的债券,此时可以将其从事件主体数据库中删除,又如,若出现了新发行的债券,而其对应的发行者并未保存在事件主体数据库中,则可将该发行者新增入事件主体数据库中。It should be noted that the event subject database is constantly being updated. For example, when a certain issued bond has been fully executed as scheduled, the bond has no possibility of default, and if the corresponding issuer does not Other bonds that are being issued can be deleted from the event body database at this time. For example, if a newly issued bond appears and its corresponding issuer is not stored in the event body database, the issuer can Added to the event body database.
步骤S105,通过预设的正则表达式对所述目标事件语句进行匹配。Step S105, matching the target event statement by a preset regular expression.
若匹配成功,则执行步骤S106。If the matching is successful, step S106 is performed.
步骤S106,将所述目标事件语句中的所述匹配字段确定为与所述目标事件语句对应的事件主体。Step S106, determining the matching field in the target event statement as an event body corresponding to the target event statement.
例如,可以采用如下的正则匹配表达式进行匹配:For example, you can use the following regular match expression to match:
“(.*)+keyword+(.*event_keyword)”"(.*)+keyword+(.*event_keyword)"
其中,event_keyword为所述事件关键词,keyword为所述匹配字段。The event_keyword is the event keyword, and the keyword is the matching field.
若某一所述目标事件语句为:If one of the target event statements is:
“公司A发生债务违约的公告”,则上述正则表达式与该目标事件语句匹配成功,那么就可以确定出事件主体为“公司A”,即发生债务违约的是“公司A”。"Announcement of debt default of company A", if the above regular expression matches the target event statement successfully, then it can be determined that the subject of the event is "company A", that is, "company A" in which the debt default occurs.
需要注意的是,以上进行的是肯定匹配,即确定某一匹配字段是事件主体,可选地,还可以进行否定匹配,即确定某一匹配字段不是事件主体。It should be noted that the above is a positive match, that is, determining that a matching field is an event body, and optionally, a negative matching may also be performed, that is, determining that a matching field is not an event body.
例如,可以采用如下的正则匹配表达式进行否定匹配:For example, you can use the following regular match expression to make a negative match:
“(.*:|:)+keyword+(.*关于)”"(.*:|:)+keyword+(.*about)"
若某一所述目标事件语句为:If one of the target event statements is:
“公司A:银行B关于本公司发生债务违约的公告”,则上述正则表达式与该目标事件语句匹配成功,那么就可以确定出“银行B”不是与所述目标事件语句对应的事件主体,即发生债务违约的不是“银行B”。"Company A: Bank B's announcement about the debt default of the company", if the above regular expression matches the target event statement successfully, then it can be determined that "Bank B" is not the event subject corresponding to the target event statement. That is, it is not "Bank B" that causes a debt default.
在以上关于债务违约事件的分析中,这种方式使得债务违约事件的提取更加准确高效,对于用户而言是种福利,特别是银行等投资机构,能达到快速预警的效果。In the above analysis of debt default events, this method makes the extraction of debt default events more accurate and efficient, and is a kind of welfare for users, especially for investment institutions such as banks, which can achieve the effect of rapid warning.
综上所述,本申请实施例在初始关键词的基础上,进一步引入了扩展关键词对其进行扩充,能够得到更为广阔的搜索结果,而且由于通过正则表达式的使用实现了对于事件主体的自动匹配,大大提升了分析效率。In summary, the embodiment of the present application further introduces an extended keyword to expand on the basis of the initial keyword, and can obtain a wider search result, and realizes the event subject by using the regular expression. The automatic matching greatly improves the analysis efficiency.
对应于上文实施例所述的一种事件信息分析方法,图5示出了本申请实施例提供的一种事件信息分析装置的一个实施例结构图。Corresponding to an event information analysis method according to the above embodiment, FIG. 5 is a structural diagram of an embodiment of an event information analysis apparatus provided by an embodiment of the present application.
本实施例中,一种事件信息分析装置可以包括:In this embodiment, an event information analyzing apparatus may include:
初始搜索模块501,用于通过预设的网络搜索引擎获取与预设的初始关键词对应的初始搜索结果;The initial search module 501 is configured to obtain an initial search result corresponding to the preset initial keyword by using a preset web search engine;
扩展关键词筛选模块502,用于在所述初始搜索结果中筛选出扩展关键词,所述扩展关键词为与所述初始关键词的相似度大于预设的相似度阈值的词语;The extended keyword screening module 502 is configured to filter out an extended keyword in the initial search result, where the extended keyword is a word whose similarity with the initial keyword is greater than a preset similarity threshold;
扩展搜索模块503,用于通过所述网络搜索引擎获取与所述扩展关键词对应的扩展搜索结果;An extended search module 503, configured to obtain, by using the network search engine, an extended search result corresponding to the extended keyword;
目标事件语句提取模块504,用于提取所述初始搜索结果和所述扩展搜索结果中的目标事件语句,所述目标事件语句为包含事件关键词和预设的匹配字段的语句,所述事件关键词为所述初始关键词或所述扩展关键词;The target event statement extraction module 504 is configured to extract the initial search result and the target event statement in the extended search result, where the target event statement is a statement including an event keyword and a preset matching field, where the event key The word is the initial keyword or the extended keyword;
正则匹配模块505,用于通过预设的正则表达式对所述目标事件语句进行匹配;a regular matching module 505, configured to match the target event statement by using a preset regular expression;
事件主体确定模块506,用于若匹配成功,则将所述目标事件语句中的所述匹配字段确定为与所述目标事件语句对应的事件主体。The event body determining module 506 is configured to determine, when the matching is successful, the matching field in the target event statement as an event body corresponding to the target event statement.
进一步地,所述事件信息分析装置还可以包括:Further, the event information analyzing apparatus may further include:
哈希运算模块,用于对所述初始搜索结果进行哈希运算,得到所述初始搜索结果的哈希值;a hash operation module, configured to perform a hash operation on the initial search result to obtain a hash value of the initial search result;
哈希值查找模块,用于在预设的哈希值集合中查找所述初始搜索结果的哈希值,所述哈希值集合用于记录已存储在预设的存储介质中的网页的哈希值;a hash value searching module, configured to search, in a preset hash value set, a hash value of the initial search result, where the hash value set is used to record a webpage that has been stored in a preset storage medium Greek value
搜索结果存储模块,用于若在所述哈希值集合中未查找到所述初始搜索结果的哈希值,则将所述初始搜索结果的哈希值添加入所述哈希值集合中,并将所述初始搜索结果存储在所述存储介质中;a search result storage module, configured to add a hash value of the initial search result to the hash value set if a hash value of the initial search result is not found in the hash value set, And storing the initial search result in the storage medium;
搜索结果丢弃模块,用于若在所述哈希值集合中查找到所述初始搜索结果的哈希值,则丢弃所述初始搜索结果。And a search result discarding module, configured to discard the initial search result if a hash value of the initial search result is found in the hash value set.
进一步地,所述哈希运算模块可以包括:Further, the hash operation module may include:
摘要内容获取单元,用于根据下式获取所述初始搜索结果的摘要内容:a summary content obtaining unit, configured to obtain a summary content of the initial search result according to the following formula:
SubContent=Head(PageContent)∪Tail(PageContent)SubContent=Head(PageContent)∪Tail(PageContent)
其中,PageContent为所述初始搜索结果中的网页正文,Head(PageContent)为所述初始搜索结果中的网页正文的前M个字符,Tail(PageContent)为所述初始搜索结果中的网页正文的后N个字符,M和N均为大于1的整数,SubContent为所述初始搜索结果的摘要内容;Wherein, PageContent is the webpage text in the initial search result, Head (PageContent) is the first M characters of the webpage text in the initial search result, and Tail (PageContent) is the post of the webpage text in the initial search result. N characters, M and N are integers greater than 1, and SubContent is a summary content of the initial search result;
哈希值计算单元,用于根据下式计算所述初始搜索结果的哈希值:a hash value calculation unit for calculating a hash value of the initial search result according to the following formula:
Key=Hash(SubContent)=Hash[Head(PageContent)∪Tail(PageContent)]Key=Hash(SubContent)=Hash[Head(PageContent)∪Tail(PageContent)]
其中,Hash为预设的哈希函数,Key为所述初始搜索结果的哈希值。The hash is a preset hash function, and the Key is a hash value of the initial search result.
进一步地,所述扩展关键词筛选模块可以包括:Further, the extended keyword screening module may include:
字面重叠度计算单元,用于根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的字面重叠度:a literal overlap calculation unit, configured to respectively calculate a literal overlap between each word in the initial search result and the initial keyword according to the following formula:
Figure PCTCN2018093346-appb-000021
Figure PCTCN2018093346-appb-000021
其中,w为所述初始搜索结果中的任一词语,
Figure PCTCN2018093346-appb-000022
为所述初始关键词,
Figure PCTCN2018093346-appb-000023
为w和
Figure PCTCN2018093346-appb-000024
共同包含的字的个数,
Figure PCTCN2018093346-appb-000025
为w包含的字数和
Figure PCTCN2018093346-appb-000026
包含的字数的最大值,
Figure PCTCN2018093346-appb-000027
为w与
Figure PCTCN2018093346-appb-000028
之间的字面重叠度;
Where w is any word in the initial search result,
Figure PCTCN2018093346-appb-000022
For the initial keyword,
Figure PCTCN2018093346-appb-000023
For w and
Figure PCTCN2018093346-appb-000024
The number of words that are included together,
Figure PCTCN2018093346-appb-000025
The number of words contained in w and
Figure PCTCN2018093346-appb-000026
The maximum number of words included,
Figure PCTCN2018093346-appb-000027
For w with
Figure PCTCN2018093346-appb-000028
Literal overlap between them;
搜索重叠度计算单元,用于根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的搜索重叠度:Searching the overlap calculation unit for respectively calculating search overlap between each word in the initial search result and the initial keyword according to the following formula:
Figure PCTCN2018093346-appb-000029
Figure PCTCN2018093346-appb-000029
其中,
Figure PCTCN2018093346-appb-000030
为在所述初始搜索结果中w与
Figure PCTCN2018093346-appb-000031
共同出现的页面数,
Figure PCTCN2018093346-appb-000032
为在所述初始搜索结果中w出现的页面数和
Figure PCTCN2018093346-appb-000033
出现的页面数的最大值,
Figure PCTCN2018093346-appb-000034
为w与
Figure PCTCN2018093346-appb-000035
之间的搜索重叠度;
among them,
Figure PCTCN2018093346-appb-000030
For the initial search results in w and
Figure PCTCN2018093346-appb-000031
The number of pages that appear together,
Figure PCTCN2018093346-appb-000032
The number of pages that appear in the initial search results and
Figure PCTCN2018093346-appb-000033
The maximum number of pages that appear,
Figure PCTCN2018093346-appb-000034
For w with
Figure PCTCN2018093346-appb-000035
Search overlap between;
相似度计算单元,用于根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的相似度:a similarity calculation unit, configured to respectively calculate a similarity between each word in the initial search result and the initial keyword according to the following formula:
Figure PCTCN2018093346-appb-000036
Figure PCTCN2018093346-appb-000036
其中,k 1、k 2均为预设的权重系数,且k 1+k 2=1,
Figure PCTCN2018093346-appb-000037
为w与
Figure PCTCN2018093346-appb-000038
之间的相似度;
Where k 1 and k 2 are preset weight coefficients, and k 1 +k 2 =1,
Figure PCTCN2018093346-appb-000037
For w with
Figure PCTCN2018093346-appb-000038
Similarity between
扩展关键词确定单元,用于将与所述初始关键词的相似度大于所述相似度阈值的词语确定为所述扩展关键词。And an extended keyword determining unit, configured to determine, as the extended keyword, a word whose similarity with the initial keyword is greater than the similarity threshold.
进一步地,所述扩展搜索模块可以包括:Further, the extended search module may include:
重要度分值计算单元,用于根据下式分别计算各个所述扩展关键词的重要度分值:The importance score calculation unit is configured to separately calculate the importance scores of each of the extended keywords according to the following formula:
Figure PCTCN2018093346-appb-000039
Figure PCTCN2018093346-appb-000039
其中,ew为任一所述扩展关键词,freq(ew)为ew在所述初始搜索结果中出现的频次,Freq(ew)为ew在预设的样本语料库中出现的频次,ExWord为由各个所述扩展关键词组成的集合,max[Freq(ExWord)]为各个所述扩展关键词在所述样本语料库中出现的频次的最大值,ln为自然对数函数,Score(ew)为ew的重要度分值;Where ew is any of the extended keywords, freq(ew) is the frequency at which ew appears in the initial search result, and Freq(ew) is the frequency at which ew appears in the preset sample corpus, ExWord The set of the expanded keywords, max[Freq(ExWord)] is the maximum value of the frequency of occurrence of each of the extended keywords in the sample corpus, ln is a natural logarithm function, and Score(ew) is ew Importance score
截取数目计算单元,用于根据下式分别计算与各个所述扩展关键词对应的扩展搜索结果的截取数目:The interception number calculation unit is configured to separately calculate the number of interception of the extended search result corresponding to each of the extended keywords according to the following formula:
Figure PCTCN2018093346-appb-000040
Figure PCTCN2018093346-appb-000040
其中,ew s为序号为s的所述扩展关键词,1≤s≤S,S为所述扩展关键词的数目,α为预设的比例系数,PageNum为预设的所述初始搜索结果的数目,ExPageNum(ew)为与ew对应的扩展搜索结果的截取数目; Where ew s is the extended keyword with the serial number s, 1≤s≤S, S is the number of the extended keywords, α is a preset proportional coefficient, and PageNum is the preset initial search result. The number, ExPageNum(ew) is the number of intercepted search results corresponding to ew;
扩展搜索结果获取单元,用于分别按照所述截取数目获取与各个所述扩展关键词对应的扩展搜索结果。And an extended search result obtaining unit, configured to respectively obtain extended search results corresponding to the respective extended keywords according to the intercepted number.
图6示出了本申请实施例提供的一种事件信息分析终端设备的示意框图,为了便于说明,仅示出了与本申请实施例相关的部分。FIG. 6 is a schematic block diagram of an event information analysis terminal device provided by an embodiment of the present application. For convenience of description, only parts related to the embodiment of the present application are shown.
在本实施例中,所述事件信息分析终端设备6可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。该事件信息分析终端设备6可包括:处理器60、存储器61以及存储在所述存储器61中并可在所述处理器60上运行的计算机可读指令62,例如执行上述的事件信息分析方法的计算机可读指令。所述处理器60执行所述计算机可读指令62时实现上述各个事件信息分析方法实施例中的步骤,例如图1所示的步骤S101至S106。或者,所述处理器60执行所述计算机可读指令62时实现上述各装置实施例中各模块/单元的功能,例如图5所示模块501至506的功能。In this embodiment, the event information analysis terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, or a cloud server. The event information analysis terminal device 6 may include a processor 60, a memory 61, and computer readable instructions 62 stored in the memory 61 and operable on the processor 60, such as performing the event information analysis method described above. Computer readable instructions. When the processor 60 executes the computer readable instructions 62, the steps in the foregoing various event information analysis method embodiments are implemented, such as steps S101 to S106 shown in FIG. Alternatively, the processor 60, when executing the computer readable instructions 62, implements the functions of the various modules/units in the various apparatus embodiments described above, such as the functions of the modules 501 through 506 shown in FIG.
在本申请各个实施例中的各功能单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干计算机可读指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。The functional units in the various embodiments of the present application may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, the technical solution of the present application, in essence or the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. A number of computer readable instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.

Claims (20)

  1. 一种事件信息分析方法,其特征在于,包括:An event information analysis method, comprising:
    通过预设的网络搜索引擎获取与预设的初始关键词对应的初始搜索结果;Acquiring initial search results corresponding to preset initial keywords by a preset web search engine;
    在所述初始搜索结果中筛选出扩展关键词,所述扩展关键词为与所述初始关键词的相似度大于预设的相似度阈值的词语;An extended keyword is selected in the initial search result, where the extended keyword is a word whose similarity with the initial keyword is greater than a preset similarity threshold;
    通过所述网络搜索引擎获取与所述扩展关键词对应的扩展搜索结果;Acquiring an extended search result corresponding to the extended keyword by the network search engine;
    提取所述初始搜索结果和所述扩展搜索结果中的目标事件语句,所述目标事件语句为包含事件关键词和预设的匹配字段的语句,所述事件关键词为所述初始关键词或所述扩展关键词;Extracting the initial search result and the target event statement in the extended search result, where the target event statement is a statement including an event keyword and a preset matching field, where the event keyword is the initial keyword or Extended keyword
    通过预设的正则表达式对所述目标事件语句进行匹配;Matching the target event statement by a preset regular expression;
    若匹配成功,则将所述目标事件语句中的所述匹配字段确定为与所述目标事件语句对应的事件主体。If the matching is successful, the matching field in the target event statement is determined as the event body corresponding to the target event statement.
  2. 根据权利要求1所述的事件信息分析方法,其特征在于,在通过预设的网络搜索引擎获取与预设的初始关键词对应的初始搜索结果之后,还包括:The event information analysis method according to claim 1, further comprising: after obtaining the initial search result corresponding to the preset initial keyword by the preset network search engine, the method further comprising:
    对所述初始搜索结果进行哈希运算,得到所述初始搜索结果的哈希值;Performing a hash operation on the initial search result to obtain a hash value of the initial search result;
    在预设的哈希值集合中查找所述初始搜索结果的哈希值,所述哈希值集合用于记录已存储在预设的存储介质中的网页的哈希值;Searching, in a preset hash value set, a hash value of the initial search result, where the hash value set is used to record a hash value of a webpage that has been stored in a preset storage medium;
    若在所述哈希值集合中未查找到所述初始搜索结果的哈希值,则将所述初始搜索结果的哈希值添加入所述哈希值集合中,并将所述初始搜索结果存储在所述存储介质中;If the hash value of the initial search result is not found in the hash value set, adding a hash value of the initial search result to the hash value set, and the initial search result Stored in the storage medium;
    若在所述哈希值集合中查找到所述初始搜索结果的哈希值,则丢弃所述初始搜索结果。If the hash value of the initial search result is found in the hash value set, the initial search result is discarded.
  3. 根据权利要求2所述的事件信息分析方法,其特征在于,所述对所述初始搜索结果进行哈希运算,得到所述初始搜索结果的哈希值包括:The event information analysis method according to claim 2, wherein the hashing the initial search result to obtain the hash value of the initial search result comprises:
    根据下式获取所述初始搜索结果的摘要内容:Obtain the summary content of the initial search result according to the following formula:
    SubContent=Head(PageContent)∪Tail(PageContent)SubContent=Head(PageContent)∪Tail(PageContent)
    其中,PageContent为所述初始搜索结果中的网页正文,Head(PageContent)为所述初始搜索结果中的网页正文的前M个字符,Tail(PageContent)为所述初始搜索结果中的网页正文的后N个字符,M和N均为大于1的整数,SubContent为所述初始搜索结果的摘要内容;Wherein, PageContent is the webpage text in the initial search result, Head (PageContent) is the first M characters of the webpage text in the initial search result, and Tail (PageContent) is the post of the webpage text in the initial search result. N characters, M and N are integers greater than 1, and SubContent is a summary content of the initial search result;
    根据下式计算所述初始搜索结果的哈希值:Calculating the hash value of the initial search result according to the following formula:
    Key=Hash(SubContent)=Hash[Head(PageContent)∪Tail(PageContent)]Key=Hash(SubContent)=Hash[Head(PageContent)∪Tail(PageContent)]
    其中,Hash为预设的哈希函数,Key为所述初始搜索结果的哈希值。The hash is a preset hash function, and the Key is a hash value of the initial search result.
  4. 根据权利要求1所述的事件信息分析方法,其特征在于,所述在所述初始搜索结果中筛选出扩展关键词包括:The event information analysis method according to claim 1, wherein the filtering out the extended keywords in the initial search result comprises:
    根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的字面重叠度:Calculating the literal overlap between each word in the initial search result and the initial keyword according to the following formula:
    Figure PCTCN2018093346-appb-100001
    Figure PCTCN2018093346-appb-100001
    其中,w为所述初始搜索结果中的任一词语,
    Figure PCTCN2018093346-appb-100002
    为所述初始关键词,
    Figure PCTCN2018093346-appb-100003
    为w和
    Figure PCTCN2018093346-appb-100004
    共同包含的字的个数,
    Figure PCTCN2018093346-appb-100005
    为w包含的字数和
    Figure PCTCN2018093346-appb-100006
    包含的字数的最大值,
    Figure PCTCN2018093346-appb-100007
    为w与
    Figure PCTCN2018093346-appb-100008
    之间的字面重叠度;
    Where w is any word in the initial search result,
    Figure PCTCN2018093346-appb-100002
    For the initial keyword,
    Figure PCTCN2018093346-appb-100003
    For w and
    Figure PCTCN2018093346-appb-100004
    The number of words that are included together,
    Figure PCTCN2018093346-appb-100005
    The number of words contained in w and
    Figure PCTCN2018093346-appb-100006
    The maximum number of words included,
    Figure PCTCN2018093346-appb-100007
    For w with
    Figure PCTCN2018093346-appb-100008
    Literal overlap between them;
    根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的搜索重叠度:Calculating the search overlap between each word in the initial search result and the initial keyword according to the following formula:
    Figure PCTCN2018093346-appb-100009
    Figure PCTCN2018093346-appb-100009
    其中,
    Figure PCTCN2018093346-appb-100010
    为在所述初始搜索结果中w与
    Figure PCTCN2018093346-appb-100011
    共同出现的页面数,
    Figure PCTCN2018093346-appb-100012
    为在所述初始搜索结果中w出现的页面数和
    Figure PCTCN2018093346-appb-100013
    出现的页面数的最大值,
    Figure PCTCN2018093346-appb-100014
    为w与
    Figure PCTCN2018093346-appb-100015
    之间的搜索重叠度;
    among them,
    Figure PCTCN2018093346-appb-100010
    For the initial search results in w and
    Figure PCTCN2018093346-appb-100011
    The number of pages that appear together,
    Figure PCTCN2018093346-appb-100012
    The number of pages that appear in the initial search results and
    Figure PCTCN2018093346-appb-100013
    The maximum number of pages that appear,
    Figure PCTCN2018093346-appb-100014
    For w with
    Figure PCTCN2018093346-appb-100015
    Search overlap between;
    根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的相似度:Calculating the similarity between each word in the initial search result and the initial keyword according to the following formula:
    Figure PCTCN2018093346-appb-100016
    Figure PCTCN2018093346-appb-100016
    其中,k 1、k 2均为预设的权重系数,且k 1+k 2=1,
    Figure PCTCN2018093346-appb-100017
    为w与
    Figure PCTCN2018093346-appb-100018
    之间的相似度;
    Where k 1 and k 2 are preset weight coefficients, and k 1 +k 2 =1,
    Figure PCTCN2018093346-appb-100017
    For w with
    Figure PCTCN2018093346-appb-100018
    Similarity between
    将与所述初始关键词的相似度大于所述相似度阈值的词语确定为所述扩展关键词。A word having a degree of similarity to the initial keyword greater than the similarity threshold is determined as the extended keyword.
  5. 根据权利要求1至4中任一项所述的事件信息分析方法,其特征在于,所述通过所述网络搜索引擎获取与所述扩展关键词对应的扩展搜索结果包括:The event information analysis method according to any one of claims 1 to 4, wherein the obtaining, by the network search engine, the extended search result corresponding to the extended keyword comprises:
    根据下式分别计算各个所述扩展关键词的重要度分值:Calculate the importance scores of each of the extended keywords according to the following formula:
    Figure PCTCN2018093346-appb-100019
    Figure PCTCN2018093346-appb-100019
    其中,ew为任一所述扩展关键词,freq(ew)为ew在所述初始搜索结果中出现的频次,Freq(ew)为ew在预设的样本语料库中出现的频次,ExWord为由各个所述扩展关键词组成的集合,max[Freq(ExWord)]为各个所述扩展关键词在所述样本语料库中出现的频次的最大值,ln为自然对数函数,Score(ew)为ew的重要度分值;Where ew is any of the extended keywords, freq(ew) is the frequency at which ew appears in the initial search result, and Freq(ew) is the frequency at which ew appears in the preset sample corpus, ExWord The set of the expanded keywords, max[Freq(ExWord)] is the maximum value of the frequency of occurrence of each of the extended keywords in the sample corpus, ln is a natural logarithm function, and Score(ew) is ew Importance score
    根据下式分别计算与各个所述扩展关键词对应的扩展搜索结果的截取数目:Calculating the number of interception of the extended search result corresponding to each of the extended keywords according to the following formula:
    Figure PCTCN2018093346-appb-100020
    Figure PCTCN2018093346-appb-100020
    其中,ew s为序号为s的所述扩展关键词,1≤s≤S,S为所述扩展关键词的数目,α为预设的比例系数,PageNum为预设的所述初始搜索结果的数目,ExPageNum(ew)为与ew对应的扩展搜索结果的截取数目; Where ew s is the extended keyword with the serial number s, 1≤s≤S, S is the number of the extended keywords, α is a preset proportional coefficient, and PageNum is the preset initial search result. The number, ExPageNum(ew) is the number of intercepted search results corresponding to ew;
    分别按照所述截取数目获取与各个所述扩展关键词对应的扩展搜索结果。The extended search results corresponding to the respective extended keywords are respectively acquired according to the intercepted number.
  6. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行时实现如下步骤:A computer readable storage medium storing computer readable instructions, wherein the computer readable instructions, when executed by a processor, implement the following steps:
    通过预设的网络搜索引擎获取与预设的初始关键词对应的初始搜索结果;Acquiring initial search results corresponding to preset initial keywords by a preset web search engine;
    在所述初始搜索结果中筛选出扩展关键词,所述扩展关键词为与所述初始关键词的相似度大于预设的相似度阈值的词语;An extended keyword is selected in the initial search result, where the extended keyword is a word whose similarity with the initial keyword is greater than a preset similarity threshold;
    通过所述网络搜索引擎获取与所述扩展关键词对应的扩展搜索结果;Acquiring an extended search result corresponding to the extended keyword by the network search engine;
    提取所述初始搜索结果和所述扩展搜索结果中的目标事件语句,所述目标事件语句为包含事件关键词和预设的匹配字段的语句,所述事件关键词为所述初始关键词或所述扩展关键词;Extracting the initial search result and the target event statement in the extended search result, where the target event statement is a statement including an event keyword and a preset matching field, where the event keyword is the initial keyword or Extended keyword
    通过预设的正则表达式对所述目标事件语句进行匹配;Matching the target event statement by a preset regular expression;
    若匹配成功,则将所述目标事件语句中的所述匹配字段确定为与所述目标事件语句对应的事件主体。If the matching is successful, the matching field in the target event statement is determined as the event body corresponding to the target event statement.
  7. 根据权利要求6所述的计算机可读存储介质,其特征在于,在通过预设的网络搜索引擎获取与预设的初始关键词对应的初始搜索结果之后,还包括:The computer readable storage medium according to claim 6, wherein after the initial search result corresponding to the preset initial keyword is obtained by the preset web search engine, the method further includes:
    对所述初始搜索结果进行哈希运算,得到所述初始搜索结果的哈希值;Performing a hash operation on the initial search result to obtain a hash value of the initial search result;
    在预设的哈希值集合中查找所述初始搜索结果的哈希值,所述哈希值集合用于记录已存储在预设的存储介质中的网页的哈希值;Searching, in a preset hash value set, a hash value of the initial search result, where the hash value set is used to record a hash value of a webpage that has been stored in a preset storage medium;
    若在所述哈希值集合中未查找到所述初始搜索结果的哈希值,则将所述初始搜索结果的哈希值添加入所述哈希值集合中,并将所述初始搜索结果存储在所述存储介质中;If the hash value of the initial search result is not found in the hash value set, adding a hash value of the initial search result to the hash value set, and the initial search result Stored in the storage medium;
    若在所述哈希值集合中查找到所述初始搜索结果的哈希值,则丢弃所述初始搜索结果。If the hash value of the initial search result is found in the hash value set, the initial search result is discarded.
  8. 根据权利要求7所述的计算机可读存储介质,其特征在于,所述对所述初始搜索结果进行哈希运算,得到所述初始搜索结果的哈希值包括:The computer readable storage medium according to claim 7, wherein the hashing the initial search result to obtain the hash value of the initial search result comprises:
    根据下式获取所述初始搜索结果的摘要内容:Obtain the summary content of the initial search result according to the following formula:
    SubContent=Head(PageContent)∪Tail(PageContent)SubContent=Head(PageContent)∪Tail(PageContent)
    其中,PageContent为所述初始搜索结果中的网页正文,Head(PageContent)为所述初始搜索结果中的网页正文的前M个字符,Tail(PageContent)为所述初始搜索结果中的网页正文的后N个字符,M和N均为大于1的整数,SubContent为所述初始搜索结果的摘要内容;Wherein, PageContent is the webpage text in the initial search result, Head (PageContent) is the first M characters of the webpage text in the initial search result, and Tail (PageContent) is the post of the webpage text in the initial search result. N characters, M and N are integers greater than 1, and SubContent is a summary content of the initial search result;
    根据下式计算所述初始搜索结果的哈希值:Calculating the hash value of the initial search result according to the following formula:
    Key=Hash(SubContent)=Hash[Head(PageContent)∪Tail(PageContent)]Key=Hash(SubContent)=Hash[Head(PageContent)∪Tail(PageContent)]
    其中,Hash为预设的哈希函数,Key为所述初始搜索结果的哈希值。The hash is a preset hash function, and the Key is a hash value of the initial search result.
  9. 根据权利要求6所述的计算机可读存储介质,其特征在于,所述在所述初始搜索结果中筛选出扩展关键词包括:The computer readable storage medium according to claim 6, wherein the filtering out the extended keywords in the initial search result comprises:
    根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的字面重叠度:Calculating the literal overlap between each word in the initial search result and the initial keyword according to the following formula:
    Figure PCTCN2018093346-appb-100021
    Figure PCTCN2018093346-appb-100021
    其中,w为所述初始搜索结果中的任一词语,
    Figure PCTCN2018093346-appb-100022
    为所述初始关键词,
    Figure PCTCN2018093346-appb-100023
    为w和
    Figure PCTCN2018093346-appb-100024
    共同包含的字的个数,
    Figure PCTCN2018093346-appb-100025
    为w包含的字数和
    Figure PCTCN2018093346-appb-100026
    包含的字数的最大值,
    Figure PCTCN2018093346-appb-100027
    为w与
    Figure PCTCN2018093346-appb-100028
    之间的字面重叠度;
    Where w is any word in the initial search result,
    Figure PCTCN2018093346-appb-100022
    For the initial keyword,
    Figure PCTCN2018093346-appb-100023
    For w and
    Figure PCTCN2018093346-appb-100024
    The number of words that are included together,
    Figure PCTCN2018093346-appb-100025
    The number of words contained in w and
    Figure PCTCN2018093346-appb-100026
    The maximum number of words included,
    Figure PCTCN2018093346-appb-100027
    For w with
    Figure PCTCN2018093346-appb-100028
    Literal overlap between them;
    根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的搜索重叠度:Calculating the search overlap between each word in the initial search result and the initial keyword according to the following formula:
    其中,
    Figure PCTCN2018093346-appb-100030
    为在所述初始搜索结果中w与
    Figure PCTCN2018093346-appb-100031
    共同出现的页面数,
    Figure PCTCN2018093346-appb-100032
    为在所述初始搜索结果中w出现的页面数和
    Figure PCTCN2018093346-appb-100033
    出现的页面数的最大值,
    Figure PCTCN2018093346-appb-100034
    为w与
    Figure PCTCN2018093346-appb-100035
    之间的搜索重叠度;
    among them,
    Figure PCTCN2018093346-appb-100030
    For the initial search results in w and
    Figure PCTCN2018093346-appb-100031
    The number of pages that appear together,
    Figure PCTCN2018093346-appb-100032
    The number of pages that appear in the initial search results and
    Figure PCTCN2018093346-appb-100033
    The maximum number of pages that appear,
    Figure PCTCN2018093346-appb-100034
    For w with
    Figure PCTCN2018093346-appb-100035
    Search overlap between;
    根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的相似度:Calculating the similarity between each word in the initial search result and the initial keyword according to the following formula:
    Figure PCTCN2018093346-appb-100036
    Figure PCTCN2018093346-appb-100036
    其中,k 1、k 2均为预设的权重系数,且k 1+k 2=1,
    Figure PCTCN2018093346-appb-100037
    为w与
    Figure PCTCN2018093346-appb-100038
    之间的相似度;
    Where k 1 and k 2 are preset weight coefficients, and k 1 +k 2 =1,
    Figure PCTCN2018093346-appb-100037
    For w with
    Figure PCTCN2018093346-appb-100038
    Similarity between
    将与所述初始关键词的相似度大于所述相似度阈值的词语确定为所述扩展关键词。A word having a degree of similarity to the initial keyword greater than the similarity threshold is determined as the extended keyword.
  10. 根据权利要求6至9中任一项所述的计算机可读存储介质,其特征在于,所述通过所述网络搜索引擎获取与所述扩展关键词对应的扩展搜索结果包括:The computer readable storage medium according to any one of claims 6 to 9, wherein the obtaining, by the network search engine, the extended search result corresponding to the extended keyword comprises:
    根据下式分别计算各个所述扩展关键词的重要度分值:Calculate the importance scores of each of the extended keywords according to the following formula:
    Figure PCTCN2018093346-appb-100039
    Figure PCTCN2018093346-appb-100039
    其中,ew为任一所述扩展关键词,freq(ew)为ew在所述初始搜索结果中出现的频次,Freq(ew)为ew在预设的样本语料库中出现的频次,ExWord为由各个所述扩展关键词组成的集合,max[Freq(ExWord)]为各个所述扩展关键词在所述样本语料库中出现的频次的最大值,ln为自然对数函数,Score(ew)为ew的重要度分值;Where ew is any of the extended keywords, freq(ew) is the frequency at which ew appears in the initial search result, and Freq(ew) is the frequency at which ew appears in the preset sample corpus, ExWord The set of the expanded keywords, max[Freq(ExWord)] is the maximum value of the frequency of occurrence of each of the extended keywords in the sample corpus, ln is a natural logarithm function, and Score(ew) is ew Importance score
    根据下式分别计算与各个所述扩展关键词对应的扩展搜索结果的截取数目:Calculating the number of interception of the extended search result corresponding to each of the extended keywords according to the following formula:
    Figure PCTCN2018093346-appb-100040
    Figure PCTCN2018093346-appb-100040
    其中,ew s为序号为s的所述扩展关键词,1≤s≤S,S为所述扩展关键词的数目,α为预设的比例系数,PageNum为预设的所述初始搜索结果的数目,ExPageNum(ew)为与ew对应的扩展搜索结果的截取数目; Where ew s is the extended keyword with the serial number s, 1≤s≤S, S is the number of the extended keywords, α is a preset proportional coefficient, and PageNum is the preset initial search result. The number, ExPageNum(ew) is the number of intercepted search results corresponding to ew;
    分别按照所述截取数目获取与各个所述扩展关键词对应的扩展搜索结果。The extended search results corresponding to the respective extended keywords are respectively acquired according to the intercepted number.
  11. 一种事件信息分析终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:An event information analysis terminal device includes a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the processor executes the computer readable instructions The following steps are implemented:
    通过预设的网络搜索引擎获取与预设的初始关键词对应的初始搜索结果;Acquiring initial search results corresponding to preset initial keywords by a preset web search engine;
    在所述初始搜索结果中筛选出扩展关键词,所述扩展关键词为与所述初始关键词的相似度大于预设的相似度阈值的词语;An extended keyword is selected in the initial search result, where the extended keyword is a word whose similarity with the initial keyword is greater than a preset similarity threshold;
    通过所述网络搜索引擎获取与所述扩展关键词对应的扩展搜索结果;Acquiring an extended search result corresponding to the extended keyword by the network search engine;
    提取所述初始搜索结果和所述扩展搜索结果中的目标事件语句,所述目标事件语句为包含事件关键词和预设的匹配字段的语句,所述事件关键词为所述初始关键词或所述扩展关键词;Extracting the initial search result and the target event statement in the extended search result, where the target event statement is a statement including an event keyword and a preset matching field, where the event keyword is the initial keyword or Extended keyword
    通过预设的正则表达式对所述目标事件语句进行匹配;Matching the target event statement by a preset regular expression;
    若匹配成功,则将所述目标事件语句中的所述匹配字段确定为与所述目标事件语句对应的事件主体。If the matching is successful, the matching field in the target event statement is determined as the event body corresponding to the target event statement.
  12. 根据权利要求11所述的事件信息分析终端设备,其特征在于,在通过预设的网络搜索引擎获取与预设的初始关键词对应的初始搜索结果之后,还包括:The event information analysis terminal device according to claim 11, wherein after the initial search result corresponding to the preset initial keyword is obtained by the preset network search engine, the method further includes:
    对所述初始搜索结果进行哈希运算,得到所述初始搜索结果的哈希值;Performing a hash operation on the initial search result to obtain a hash value of the initial search result;
    在预设的哈希值集合中查找所述初始搜索结果的哈希值,所述哈希值集合用于记录已存储在预设的存储介质中的网页的哈希值;Searching, in a preset hash value set, a hash value of the initial search result, where the hash value set is used to record a hash value of a webpage that has been stored in a preset storage medium;
    若在所述哈希值集合中未查找到所述初始搜索结果的哈希值,则将所述初始搜索结果的哈希值添加入所述哈希值集合中,并将所述初始搜索结果存储在所述存储介质 中;If the hash value of the initial search result is not found in the hash value set, adding a hash value of the initial search result to the hash value set, and the initial search result Stored in the storage medium;
    若在所述哈希值集合中查找到所述初始搜索结果的哈希值,则丢弃所述初始搜索结果。If the hash value of the initial search result is found in the hash value set, the initial search result is discarded.
  13. 根据权利要求12所述的事件信息分析终端设备,其特征在于,所述对所述初始搜索结果进行哈希运算,得到所述初始搜索结果的哈希值包括:The event information analysis terminal device according to claim 12, wherein the hashing the initial search result to obtain the hash value of the initial search result comprises:
    根据下式获取所述初始搜索结果的摘要内容:Obtain the summary content of the initial search result according to the following formula:
    SubContent=Head(PageContent)∪Tail(PageContent)SubContent=Head(PageContent)∪Tail(PageContent)
    其中,PageContent为所述初始搜索结果中的网页正文,Head(PageContent)为所述初始搜索结果中的网页正文的前M个字符,Tail(PageContent)为所述初始搜索结果中的网页正文的后N个字符,M和N均为大于1的整数,SubContent为所述初始搜索结果的摘要内容;Wherein, PageContent is the webpage text in the initial search result, Head (PageContent) is the first M characters of the webpage text in the initial search result, and Tail (PageContent) is the post of the webpage text in the initial search result. N characters, M and N are integers greater than 1, and SubContent is a summary content of the initial search result;
    根据下式计算所述初始搜索结果的哈希值:Calculating the hash value of the initial search result according to the following formula:
    Key=Hash(SubContent)=Hash[Head(PageContent)∪Tail(PageContent)]Key=Hash(SubContent)=Hash[Head(PageContent)∪Tail(PageContent)]
    其中,Hash为预设的哈希函数,Key为所述初始搜索结果的哈希值。The hash is a preset hash function, and the Key is a hash value of the initial search result.
  14. 根据权利要求11所述的事件信息分析终端设备,其特征在于,所述在所述初始搜索结果中筛选出扩展关键词包括:The event information analysis terminal device according to claim 11, wherein the filtering out the extended keywords in the initial search result comprises:
    根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的字面重叠度:Calculating the literal overlap between each word in the initial search result and the initial keyword according to the following formula:
    Figure PCTCN2018093346-appb-100041
    Figure PCTCN2018093346-appb-100041
    其中,w为所述初始搜索结果中的任一词语,
    Figure PCTCN2018093346-appb-100042
    为所述初始关键词,
    Figure PCTCN2018093346-appb-100043
    为w和
    Figure PCTCN2018093346-appb-100044
    共同包含的字的个数,
    Figure PCTCN2018093346-appb-100045
    为w包含的字数和
    Figure PCTCN2018093346-appb-100046
    包含的字数的最大值,
    Figure PCTCN2018093346-appb-100047
    为w与
    Figure PCTCN2018093346-appb-100048
    之间的字面重叠度;
    Where w is any word in the initial search result,
    Figure PCTCN2018093346-appb-100042
    For the initial keyword,
    Figure PCTCN2018093346-appb-100043
    For w and
    Figure PCTCN2018093346-appb-100044
    The number of words that are included together,
    Figure PCTCN2018093346-appb-100045
    The number of words contained in w and
    Figure PCTCN2018093346-appb-100046
    The maximum number of words included,
    Figure PCTCN2018093346-appb-100047
    For w with
    Figure PCTCN2018093346-appb-100048
    Literal overlap between them;
    根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的搜索重叠度:Calculating the search overlap between each word in the initial search result and the initial keyword according to the following formula:
    Figure PCTCN2018093346-appb-100049
    Figure PCTCN2018093346-appb-100049
    其中,
    Figure PCTCN2018093346-appb-100050
    为在所述初始搜索结果中w与
    Figure PCTCN2018093346-appb-100051
    共同出现的页面数,
    Figure PCTCN2018093346-appb-100052
    为在所述初始搜索结果中w出现的页面数和
    Figure PCTCN2018093346-appb-100053
    出现的页面数的最大值,
    Figure PCTCN2018093346-appb-100054
    为w与
    Figure PCTCN2018093346-appb-100055
    之间的搜索重叠度;
    among them,
    Figure PCTCN2018093346-appb-100050
    For the initial search results in w and
    Figure PCTCN2018093346-appb-100051
    The number of pages that appear together,
    Figure PCTCN2018093346-appb-100052
    The number of pages that appear in the initial search results and
    Figure PCTCN2018093346-appb-100053
    The maximum number of pages that appear,
    Figure PCTCN2018093346-appb-100054
    For w with
    Figure PCTCN2018093346-appb-100055
    Search overlap between;
    根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的相似度:Calculating the similarity between each word in the initial search result and the initial keyword according to the following formula:
    Figure PCTCN2018093346-appb-100056
    Figure PCTCN2018093346-appb-100056
    其中,k 1、k 2均为预设的权重系数,且k 1+k 2=1,
    Figure PCTCN2018093346-appb-100057
    为w与
    Figure PCTCN2018093346-appb-100058
    之间的相似度;
    Where k 1 and k 2 are preset weight coefficients, and k 1 +k 2 =1,
    Figure PCTCN2018093346-appb-100057
    For w with
    Figure PCTCN2018093346-appb-100058
    Similarity between
    将与所述初始关键词的相似度大于所述相似度阈值的词语确定为所述扩展关键词。A word having a degree of similarity to the initial keyword greater than the similarity threshold is determined as the extended keyword.
  15. 根据权利要求11至14中任一项所述的事件信息分析终端设备,其特征在于,所述通过所述网络搜索引擎获取与所述扩展关键词对应的扩展搜索结果包括:The event information analysis terminal device according to any one of claims 11 to 14, wherein the obtaining, by the network search engine, the extended search result corresponding to the extended keyword comprises:
    根据下式分别计算各个所述扩展关键词的重要度分值:Calculate the importance scores of each of the extended keywords according to the following formula:
    Figure PCTCN2018093346-appb-100059
    Figure PCTCN2018093346-appb-100059
    其中,ew为任一所述扩展关键词,freq(ew)为ew在所述初始搜索结果中出现的频次,Freq(ew)为ew在预设的样本语料库中出现的频次,ExWord为由各个所述扩展关键词组成的集合,max[Freq(ExWord)]为各个所述扩展关键词在所述样本语料库中出现的频次的最大值,ln为自然对数函数,Score(ew)为ew的重要度分值;Where ew is any of the extended keywords, freq(ew) is the frequency at which ew appears in the initial search result, and Freq(ew) is the frequency at which ew appears in the preset sample corpus, ExWord The set of the expanded keywords, max[Freq(ExWord)] is the maximum value of the frequency of occurrence of each of the extended keywords in the sample corpus, ln is a natural logarithm function, and Score(ew) is ew Importance score
    根据下式分别计算与各个所述扩展关键词对应的扩展搜索结果的截取数目:Calculating the number of interception of the extended search result corresponding to each of the extended keywords according to the following formula:
    Figure PCTCN2018093346-appb-100060
    Figure PCTCN2018093346-appb-100060
    其中,ew s为序号为s的所述扩展关键词,1≤s≤S,S为所述扩展关键词的数目,α为预设的比例系数,PageNum为预设的所述初始搜索结果的数目,ExPageNum(ew)为与ew对应的扩展搜索结果的截取数目; Where ew s is the extended keyword with the serial number s, 1≤s≤S, S is the number of the extended keywords, α is a preset proportional coefficient, and PageNum is the preset initial search result. The number, ExPageNum(ew) is the number of intercepted search results corresponding to ew;
    分别按照所述截取数目获取与各个所述扩展关键词对应的扩展搜索结果。The extended search results corresponding to the respective extended keywords are respectively acquired according to the intercepted number.
  16. 一种事件信息分析装置,其特征在于,包括:An event information analysis device, comprising:
    初始搜索模块,用于通过预设的网络搜索引擎获取与预设的初始关键词对应的初始搜索结果;An initial search module, configured to obtain an initial search result corresponding to the preset initial keyword by using a preset web search engine;
    扩展关键词筛选模块,用于在所述初始搜索结果中筛选出扩展关键词,所述扩展关键词为与所述初始关键词的相似度大于预设的相似度阈值的词语;An extended keyword screening module, configured to filter out an extended keyword in the initial search result, where the extended keyword is a word whose similarity with the initial keyword is greater than a preset similarity threshold;
    扩展搜索模块,用于通过所述网络搜索引擎获取与所述扩展关键词对应的扩展搜索结果;An extended search module, configured to obtain, by using the network search engine, an extended search result corresponding to the extended keyword;
    目标事件语句提取模块,用于提取所述初始搜索结果和所述扩展搜索结果中的目标事件语句,所述目标事件语句为包含事件关键词和预设的匹配字段的语句,所述事件关键词为所述初始关键词或所述扩展关键词;a target event statement extracting module, configured to extract the initial search result and a target event statement in the extended search result, where the target event statement is a statement including an event keyword and a preset matching field, the event keyword Is the initial keyword or the extended keyword;
    正则匹配模块,用于通过预设的正则表达式对所述目标事件语句进行匹配;a regular matching module, configured to match the target event statement by a preset regular expression;
    事件主体确定模块,用于若匹配成功,则将所述目标事件语句中的所述匹配字段确定为与所述目标事件语句对应的事件主体。The event body determining module is configured to determine, when the matching is successful, the matching field in the target event statement as an event body corresponding to the target event statement.
  17. 根据权利要求16所述的事件信息分析装置,其特征在于,还包括:The event information analysis device according to claim 16, further comprising:
    哈希运算模块,用于对所述初始搜索结果进行哈希运算,得到所述初始搜索结果的哈希值;a hash operation module, configured to perform a hash operation on the initial search result to obtain a hash value of the initial search result;
    哈希值查找模块,用于在预设的哈希值集合中查找所述初始搜索结果的哈希值,所述哈希值集合用于记录已存储在预设的存储介质中的网页的哈希值;a hash value searching module, configured to search, in a preset hash value set, a hash value of the initial search result, where the hash value set is used to record a webpage that has been stored in a preset storage medium Greek value
    搜索结果存储模块,用于若在所述哈希值集合中未查找到所述初始搜索结果的哈希值,则将所述初始搜索结果的哈希值添加入所述哈希值集合中,并将所述初始搜索结果存储在所述存储介质中;a search result storage module, configured to add a hash value of the initial search result to the hash value set if a hash value of the initial search result is not found in the hash value set, And storing the initial search result in the storage medium;
    搜索结果丢弃模块,用于若在所述哈希值集合中查找到所述初始搜索结果的哈希值,则丢弃所述初始搜索结果。And a search result discarding module, configured to discard the initial search result if a hash value of the initial search result is found in the hash value set.
  18. 根据权利要求17所述的事件信息分析装置,其特征在于,所述哈希运算模块包括:The event information analysis apparatus according to claim 17, wherein the hash operation module comprises:
    摘要内容获取单元,用于根据下式获取所述初始搜索结果的摘要内容:a summary content obtaining unit, configured to obtain a summary content of the initial search result according to the following formula:
    SubContent=Head(PageContent)∪Tail(PageContent)SubContent=Head(PageContent)∪Tail(PageContent)
    其中,PageContent为所述初始搜索结果中的网页正文,Head(PageContent)为所述初始搜索结果中的网页正文的前M个字符,Tail(PageContent)为所述初始搜索结果中的网页正文的后N个字符,M和N均为大于1的整数,SubContent为所述初始搜索结果的摘要内容;Wherein, PageContent is the webpage text in the initial search result, Head (PageContent) is the first M characters of the webpage text in the initial search result, and Tail (PageContent) is the post of the webpage text in the initial search result. N characters, M and N are integers greater than 1, and SubContent is a summary content of the initial search result;
    哈希值计算单元,用于根据下式计算所述初始搜索结果的哈希值:a hash value calculation unit for calculating a hash value of the initial search result according to the following formula:
    Key=Hash(SubContent)=Hash[Head(PageContent)∪Tail(PageContent)]Key=Hash(SubContent)=Hash[Head(PageContent)∪Tail(PageContent)]
    其中,Hash为预设的哈希函数,Key为所述初始搜索结果的哈希值。The hash is a preset hash function, and the Key is a hash value of the initial search result.
  19. 根据权利要求16所述的事件信息分析装置,其特征在于,所述扩展关键词筛选模块包括:The event information analysis apparatus according to claim 16, wherein the extended keyword screening module comprises:
    字面重叠度计算单元,用于根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的字面重叠度:a literal overlap calculation unit, configured to respectively calculate a literal overlap between each word in the initial search result and the initial keyword according to the following formula:
    Figure PCTCN2018093346-appb-100061
    Figure PCTCN2018093346-appb-100061
    其中,w为所述初始搜索结果中的任一词语,
    Figure PCTCN2018093346-appb-100062
    为所述初始关键词,
    Figure PCTCN2018093346-appb-100063
    为w和
    Figure PCTCN2018093346-appb-100064
    共同包含的字的个数,
    Figure PCTCN2018093346-appb-100065
    为w包含的字数和
    Figure PCTCN2018093346-appb-100066
    包含的字数的最大值,
    Figure PCTCN2018093346-appb-100067
    为w与
    Figure PCTCN2018093346-appb-100068
    之间的字面重叠度;
    Where w is any word in the initial search result,
    Figure PCTCN2018093346-appb-100062
    For the initial keyword,
    Figure PCTCN2018093346-appb-100063
    For w and
    Figure PCTCN2018093346-appb-100064
    The number of words that are included together,
    Figure PCTCN2018093346-appb-100065
    The number of words contained in w and
    Figure PCTCN2018093346-appb-100066
    The maximum number of words included,
    Figure PCTCN2018093346-appb-100067
    For w with
    Figure PCTCN2018093346-appb-100068
    Literal overlap between them;
    搜索重叠度计算单元,用于根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的搜索重叠度:Searching the overlap calculation unit for respectively calculating search overlap between each word in the initial search result and the initial keyword according to the following formula:
    Figure PCTCN2018093346-appb-100069
    Figure PCTCN2018093346-appb-100069
    其中,
    Figure PCTCN2018093346-appb-100070
    为在所述初始搜索结果中w与
    Figure PCTCN2018093346-appb-100071
    共同出现的页面数,
    Figure PCTCN2018093346-appb-100072
    为在所述初始搜索结果中w出现的页面数和
    Figure PCTCN2018093346-appb-100073
    出现的页面数的最大值,
    Figure PCTCN2018093346-appb-100074
    为w与
    Figure PCTCN2018093346-appb-100075
    之间的搜索重叠度;
    among them,
    Figure PCTCN2018093346-appb-100070
    For the initial search results in w and
    Figure PCTCN2018093346-appb-100071
    The number of pages that appear together,
    Figure PCTCN2018093346-appb-100072
    The number of pages that appear in the initial search results and
    Figure PCTCN2018093346-appb-100073
    The maximum number of pages that appear,
    Figure PCTCN2018093346-appb-100074
    For w with
    Figure PCTCN2018093346-appb-100075
    Search overlap between;
    相似度计算单元,用于根据下式分别计算所述初始搜索结果中的各个词语与所述初始关键词之间的相似度:a similarity calculation unit, configured to respectively calculate a similarity between each word in the initial search result and the initial keyword according to the following formula:
    Figure PCTCN2018093346-appb-100076
    Figure PCTCN2018093346-appb-100076
    其中,k 1、k 2均为预设的权重系数,且k 1+k 2=1,
    Figure PCTCN2018093346-appb-100077
    为w与
    Figure PCTCN2018093346-appb-100078
    之间的相似度;
    Where k 1 and k 2 are preset weight coefficients, and k 1 +k 2 =1,
    Figure PCTCN2018093346-appb-100077
    For w with
    Figure PCTCN2018093346-appb-100078
    Similarity between
    扩展关键词确定单元,用于将与所述初始关键词的相似度大于所述相似度阈值的词语确定为所述扩展关键词。And an extended keyword determining unit, configured to determine, as the extended keyword, a word whose similarity with the initial keyword is greater than the similarity threshold.
  20. 根据权利要求16至19中任一项所述的事件信息分析装置,其特征在于,所述扩展搜索模块可以包括:The event information analysis apparatus according to any one of claims 16 to 19, wherein the extended search module may include:
    重要度分值计算单元,用于根据下式分别计算各个所述扩展关键词的重要度分值:The importance score calculation unit is configured to separately calculate the importance scores of each of the extended keywords according to the following formula:
    Figure PCTCN2018093346-appb-100079
    Figure PCTCN2018093346-appb-100079
    其中,ew为任一所述扩展关键词,freq(ew)为ew在所述初始搜索结果中出现的频次,Freq(ew)为ew在预设的样本语料库中出现的频次,ExWord为由各个所述扩展关键词组成的集合,max[Freq(ExWord)]为各个所述扩展关键词在所述样本语料库中出现的频次的最大值,ln为自然对数函数,Score(ew)为ew的重要度分值;Where ew is any of the extended keywords, freq(ew) is the frequency at which ew appears in the initial search result, and Freq(ew) is the frequency at which ew appears in the preset sample corpus, ExWord The set of the expanded keywords, max[Freq(ExWord)] is the maximum value of the frequency of occurrence of each of the extended keywords in the sample corpus, ln is a natural logarithm function, and Score(ew) is ew Importance score
    截取数目计算单元,用于根据下式分别计算与各个所述扩展关键词对应的扩展搜索结果的截取数目:The interception number calculation unit is configured to separately calculate the number of interception of the extended search result corresponding to each of the extended keywords according to the following formula:
    Figure PCTCN2018093346-appb-100080
    Figure PCTCN2018093346-appb-100080
    其中,ew s为序号为s的所述扩展关键词,1≤s≤S,S为所述扩展关键词的数目,α为预设的比例系数,PageNum为预设的所述初始搜索结果的数目,ExPageNum(ew)为与ew对应的扩展搜索结果的截取数目; Where ew s is the extended keyword with the serial number s, 1≤s≤S, S is the number of the extended keywords, α is a preset proportional coefficient, and PageNum is the preset initial search result. The number, ExPageNum(ew) is the number of intercepted search results corresponding to ew;
    扩展搜索结果获取单元,用于分别按照所述截取数目获取与各个所述扩展关键词对应的扩展搜索结果。And an extended search result obtaining unit, configured to respectively obtain extended search results corresponding to the respective extended keywords according to the intercepted number.
PCT/CN2018/093346 2018-04-08 2018-06-28 Event information analysis method, readable storage medium, terminal device and apparatus WO2019196209A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810305412.4 2018-04-08
CN201810305412.4A CN108763272B (en) 2018-04-08 2018-04-08 A kind of event information analysis method, computer readable storage medium and terminal device

Publications (1)

Publication Number Publication Date
WO2019196209A1 true WO2019196209A1 (en) 2019-10-17

Family

ID=63981090

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/093346 WO2019196209A1 (en) 2018-04-08 2018-06-28 Event information analysis method, readable storage medium, terminal device and apparatus

Country Status (2)

Country Link
CN (1) CN108763272B (en)
WO (1) WO2019196209A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468321A (en) * 2021-09-01 2021-10-01 江苏金陵科技集团有限公司 Event aggregation analysis method and system based on big data

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763272B (en) * 2018-04-08 2019-09-17 平安科技(深圳)有限公司 A kind of event information analysis method, computer readable storage medium and terminal device
CN110458296B (en) * 2019-08-02 2023-08-29 腾讯科技(深圳)有限公司 Method and device for marking target event, storage medium and electronic device
CN111309299A (en) * 2020-01-15 2020-06-19 珠海格力智能装备有限公司 Industrial robot language processing method and device, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017099454A1 (en) * 2015-12-08 2017-06-15 전자부품연구원 Keyword search method on basis of mind map and apparatus therefor
CN107229624A (en) * 2016-03-23 2017-10-03 百度在线网络技术(北京)有限公司 A kind of page provides method and the page provides device
CN107330111A (en) * 2017-07-07 2017-11-07 长沙沃本智能科技有限公司 The search method and device of domain body based on common version body
CN107590169A (en) * 2017-04-14 2018-01-16 南方科技大学 A kind of preprocess method and system of carrier gateway data
CN108763272A (en) * 2018-04-08 2018-11-06 平安科技(深圳)有限公司 A kind of event information analysis method, computer readable storage medium and terminal device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273404A (en) * 2017-04-26 2017-10-20 努比亚技术有限公司 Appraisal procedure, device and the computer-readable recording medium of search engine

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017099454A1 (en) * 2015-12-08 2017-06-15 전자부품연구원 Keyword search method on basis of mind map and apparatus therefor
CN107229624A (en) * 2016-03-23 2017-10-03 百度在线网络技术(北京)有限公司 A kind of page provides method and the page provides device
CN107590169A (en) * 2017-04-14 2018-01-16 南方科技大学 A kind of preprocess method and system of carrier gateway data
CN107330111A (en) * 2017-07-07 2017-11-07 长沙沃本智能科技有限公司 The search method and device of domain body based on common version body
CN108763272A (en) * 2018-04-08 2018-11-06 平安科技(深圳)有限公司 A kind of event information analysis method, computer readable storage medium and terminal device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113468321A (en) * 2021-09-01 2021-10-01 江苏金陵科技集团有限公司 Event aggregation analysis method and system based on big data
CN113468321B (en) * 2021-09-01 2022-01-04 江苏金陵科技集团有限公司 Event aggregation analysis method and system based on big data

Also Published As

Publication number Publication date
CN108763272A (en) 2018-11-06
CN108763272B (en) 2019-09-17

Similar Documents

Publication Publication Date Title
WO2019196209A1 (en) Event information analysis method, readable storage medium, terminal device and apparatus
Ramakrishna et al. Linguistic analysis of differences in portrayal of movie characters
US9106698B2 (en) Method and server for intelligent categorization of bookmarks
WO2021159613A1 (en) Text semantic similarity analysis method and apparatus, and computer device
US9558263B2 (en) Identifying and displaying relationships between candidate answers
US9311823B2 (en) Caching natural language questions and results in a question and answer system
KR102080362B1 (en) Query expansion
WO2020215667A1 (en) Text content quick duplicate removal method and apparatus, computer device, and storage medium
US9507867B2 (en) Discovery engine
US20150356091A1 (en) Method and system for identifying microblog user identity
WO2014000508A1 (en) Duplicated web page deletion method and device
WO2008014702A1 (en) Method and system of extracting new words
WO2014206241A1 (en) Document similarity calculation method, and method and device for detecting approximately duplicate documents
CN111460153A (en) Hot topic extraction method and device, terminal device and storage medium
WO2020248379A1 (en) Method for searching for similar network pages, and apparatus
WO2017096777A1 (en) Document normalization method, document searching method, corresponding apparatuses, device, and storage medium
WO2015085805A1 (en) Method and apparatus for determining core word of image cluster description text
WO2019218452A1 (en) Method, computer readable storage medium, terminal apparatus, and device for analyzing trending terms
Bar-Yossef et al. Efficient search engine measurements
Vetriselvi et al. RETRACTED ARTICLE: An improved key term weightage algorithm for text summarization using local context information and fuzzy graph sentence score
Hansen et al. Comparing open source search engine functionality, efficiency and effectiveness with respect to digital forensic search
CN111930949A (en) Search string processing method and device, computer readable medium and electronic equipment
CN109344397B (en) Text feature word extraction method and device, storage medium and program product
US10380195B1 (en) Grouping documents by content similarity
TWI636370B (en) Establishing chart indexing method and computer program product by text information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18914577

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21/01/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18914577

Country of ref document: EP

Kind code of ref document: A1