WO2013107297A1 - Information aggregation method and device - Google Patents

Information aggregation method and device Download PDF

Info

Publication number
WO2013107297A1
WO2013107297A1 PCT/CN2013/070051 CN2013070051W WO2013107297A1 WO 2013107297 A1 WO2013107297 A1 WO 2013107297A1 CN 2013070051 W CN2013070051 W CN 2013070051W WO 2013107297 A1 WO2013107297 A1 WO 2013107297A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
relationship
amount
distance
different amounts
Prior art date
Application number
PCT/CN2013/070051
Other languages
French (fr)
Chinese (zh)
Inventor
刘冰
Original Assignee
华为终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为终端有限公司 filed Critical 华为终端有限公司
Publication of WO2013107297A1 publication Critical patent/WO2013107297A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Definitions

  • the present invention relates to the field of information processing technologies, and in particular, to an information aggregation method and apparatus. Background technique
  • Information aggregation is the combination of different information with intrinsic links into a structure, such as person name, phone number, email address. If the information belongs to someone's data, then the person name, phone number, and email address can be combined into one. A large block of information, which is a structure: (name, phone number, email address).
  • users can provide one-stop personalized service with multi-source information. For example, the terminal device monitors the user's mail or short message, automatically extracts information of interest, such as contact information, event information, etc. from the mail or short message, and then generates a calendar event, a transaction reminder event, or an address book. Contacts, and then store the information in the corresponding location, such as schedules, transaction reminders, contact lists, etc., to help users process information and improve work efficiency.
  • Information aggregation is a necessary prerequisite for information extraction. Aggregating information with a quantifiable standard is the core work of information aggregation. Choosing different metrics will affect the effect of information aggregation, which will affect the final result of information extraction.
  • Grammatical structure analysis uses grammatical principles to combine information according to different grammatical components.
  • the sentence components are subject, predicate, object, attributive, adverbial, and complement.
  • Each component has requirements for lexical attributes.
  • a noun can act as a subject
  • a verb can be used as a predicate
  • an adjective modifies a noun, and so on.
  • the sentence components can be aggregated.
  • the complexity of sentences and the diversity of components make grammatical structure analysis difficult to quantify.
  • the embodiments of the present invention provide an information aggregation method and apparatus for improving the accuracy of information aggregation.
  • An information aggregation method including:
  • the different amounts of information are aggregated according to the calculated distance between different amounts of information.
  • An information aggregation device includes:
  • An information determining unit configured to determine related information of the information amount in the file
  • a calculating unit configured to calculate a distance between different information amounts according to the related information
  • an aggregation unit configured to aggregate different information amounts according to the calculated distance between different information amounts.
  • the information aggregation method and apparatus determine the distance between different information amounts in the file by determining related information of the information amount in the file, and calculating the distance between different information amounts according to the related information.
  • the quantization process is performed, and the quantized distance is used to aggregate different amounts of information, thereby effectively improving the accuracy of information aggregation.
  • FIG. 1 is a flowchart of an information aggregation method according to an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of an information aggregation apparatus according to an embodiment of the present invention
  • 3 is a schematic structural view of a polymerization unit in an embodiment of the present invention. detailed description
  • the information aggregation method and device quantifies the distance between different information amounts in the file by determining relevant information of the information amount in the file, and calculating a distance between different information amounts according to the related information. Processing, using the quantized distance to aggregate different amounts of information, effectively improving the accuracy of information aggregation.
  • the information aggregation method in the embodiment of the present invention can be applied to a terminal device or a server.
  • the terminal device monitors the user's mail or short message, and automatically aggregates the information that the user pays attention to.
  • Step 101 Determine related information of an information volume in a file.
  • the amount of information refers to information that the user pays attention to, for example, a person name, a phone number, an email address, a conference theme, a meeting place, a meeting content, and the like.
  • Each amount of information consists of one or more strings, each of which has its associated information.
  • the step may be to determine related information of different amounts of information in the file. It can also be understood that the related information corresponding to the information that the user pays attention to in the file is obtained, or the related information corresponding to the amount of information in the file is obtained.
  • the file may be a mail or a short message of the user, and may be other files, which are not limited in this embodiment of the present invention.
  • the file may be the mail or the short message of the user currently received by the terminal device, or may be the mail or the short message of the user that has been stored on the terminal device, which is not limited in the embodiment of the present invention.
  • the related information may be location information, for example, a paragraph position, a starting position, and an ending position of the information amount in the file.
  • the position of the paragraph indicates a position of the natural paragraph in the file, which is a constant; the start position and the end position indicate the position of the information amount in the sentence in the file.
  • the amount of information is in the first paragraph of the document, and the position of the paragraph is 1 . If it is in the second paragraph, the position of the paragraph is 2, and so on.
  • the document has the following contents: "Xiao Ming went to Beijing for a business trip today, his phone number is 12345678. ,,
  • the amount of information that needs attention is: Xiao Ming, He, Phone, and 12345678.
  • the related information may also include other information, such as information such as a grammatical attribute of the amount of information.
  • Step 102 Calculate a distance between different information amounts according to the related information.
  • the tag value of the information amount may be calculated according to the related information of the information amount to obtain the tag value corresponding to the different information amount.
  • the label value is calculated, so that each information quantity can obtain its corresponding label value, and then calculated according to the calculated label value. The distance between different amounts of information.
  • the related information includes only the location information, and the content in the above file is taken as an example for description.
  • L paragraph position * label coefficient + (start position + end position) /2 ( 1 ) where L is the label value of the information amount.
  • the label coefficient is added to the above formula (1) in order to ensure the uniqueness of the calculated label value of each information amount.
  • the label factor can be the maximum number of characters in the paragraph containing the most number of characters in all paragraphs in the file.
  • max_size is not the maximum number of characters in the paragraph containing the most characters in all paragraphs in the file, but other values, such as taking the character value of the current paragraph, the uniqueness of the label cannot be guaranteed.
  • the label value of the information quantity in the first paragraph is: 1 *1000 + (start position + end position) /2, the value of this label value
  • the range is (1000, 2000), the starting position of this segment is 1, and the ending position is 1000, and their intermediate value range is (1, 1000);
  • the value of the tag in the second paragraph is: 2*500 + (start position + end position) /2, the value range of this tag is (1000, 1500), the starting position of this segment is 1, and the end position is 500 , their median range is (1, 500);
  • the range of tag values for the amount of information in the third segment is (1800, 2400). It can be seen that the range of the label value of the information quantity in the first paragraph covers the range of the label value of the information quantity in the second paragraph, and the label value of the information quantity in the first paragraph and the label value of the information quantity in the third paragraph overlap.
  • the above label coefficient may also be a number greater than the maximum value of the number of characters in the paragraph containing the largest number of characters in all the paragraphs in the file.
  • the distance between different amounts of information when the distance between different amounts of information is to be calculated, it can be understood as: Calculating the distance between any two different amounts of information in the file.
  • the absolute value of the difference between the tag values corresponding to two different information amounts may be taken as the distance between the two information amounts, that is, the distance between different information amounts is calculated according to the following formula (2):
  • a plurality of distances can be obtained by the above calculation. It can also be understood as: Through the above calculation process, the distance between different information amounts in the file can be quantized, so that the terminal device can accurately identify the distance between different information amounts, thereby providing a quasi-information for information aggregation. The exact basis for the reference.
  • step 103 different amounts of information are aggregated according to the calculated distance between different amounts of information.
  • the information that needs to be aggregated can be different categories and related information, usually information such as person name, phone number, address, and mailbox, and can also be aggregated according to the information category defined by the user.
  • d (Xiaoming, 12345678) can be corrected to 5, which is the same value as d (he, phone).
  • the above-mentioned referential relationship and the judgment of the peer relationship can be determined according to the grammatical attribute and the distance relationship of each information amount.
  • the referential relationship or the peer relationship between different information amounts can be determined according to the syntax attribute of each information amount, and further, according to the syntax attribute and the distance relationship of each information amount, further The judgment of a referential relationship or a peer relationship between different amounts of information. For example, “telephone” and “12345678" are connected by the conjunction "yes" to determine that they are peer-to-peer.
  • Xiaoming is a person's name
  • He is a pronoun
  • there are no other pronouns in the above text so it can be determined that they have a referential relationship.
  • a file has the following contents:
  • the amount of information that the user needs to pay attention to is the person's name, phone number, and email address.
  • the first paragraph has 40 characters and the second paragraph has 146 characters.
  • Paragraph value - information amount - the first such as the third in the second paragraph
  • the total number of kings that appeared was: 2—Wang Zong—3, and so on.
  • the information about the above information in the file is:
  • Email address (2, 121, 124);
  • the distance between the two information amounts is calculated.
  • the specific calculation process is similar to the previous example, and is not described here.
  • the referential relationship and the peer relationship between the amounts of information are determined. (1) Determine the referential relationship of the pronoun "he”.
  • the person name, the phone number, and the email address in the relevant information amount with the smallest distance are selected for aggregation, and finally the following aggregation result is obtained:
  • the aggregation result may be saved in a corresponding file, and/or displayed to the user for selection by the user.
  • the information aggregation method in the embodiment of the present invention determines the related information in the file by determining the amount of information. Interest, and calculate the distance between different information amounts according to the related information, so that the distance between different information amounts has a specific value, thereby quantizing the distance between different information amounts in the file, using quantization
  • the latter distance aggregates different amounts of information, which not only can automatically realize the aggregation of information by using terminal equipment, but also can effectively improve the accuracy of information aggregation, and thus provide an accurate information source for information extraction processing.
  • the accuracy of information aggregation in turn, can provide more accurate services for information that users need to pay attention to, thereby improving the user experience.
  • an embodiment of the present invention further provides an information aggregation apparatus, which may be part of a device such as a terminal device or a server.
  • the terminal device may be an intelligent terminal device such as a mobile phone, a PDA, or a tablet computer.
  • the apparatus includes:
  • the information determining unit 201 is configured to determine related information of the amount of information in the file.
  • the information volume refers to information that the user pays attention to, for example, may be a person name, a phone number, an email address, or a conference topic, a meeting place, a meeting content, and the like.
  • Each amount of information consists of one or more strings, each of which has its associated information.
  • the information determining unit 201 may be related information that determines different amounts of information in the file.
  • the file may be a mail or a short message of the user, and may be other files, which are not limited in this embodiment of the present invention.
  • the file may be the mail or the short message of the user currently received by the terminal device, or may be the mail or the short message of the user that has been stored on the terminal device, which is not limited in the embodiment of the present invention.
  • the calculating unit 202 is configured to calculate a distance between different information amounts according to the related information.
  • the calculation unit 202 may first calculate the label value of the information amount according to the information about the information amount.
  • the calculation unit 202 calculates the label value, so that each information amount can be Obtain the corresponding label value, and then calculate the distance between different information quantities based on the calculated label value.
  • the aggregating unit 203 is configured to aggregate different amounts of information according to the calculated distance between different amounts of information. In this embodiment, in the aggregation process, it is necessary to consider the distance between different amounts of information, and perform aggregation according to the principle of proximity.
  • the information that needs to be aggregated may be different categories and related information, usually information such as person name, phone number, address, and mailbox, or may be aggregated according to the information category defined by the user.
  • the above-mentioned referential relationship and the judgment of the peer relationship can be determined according to the grammatical attributes and distance relationships of the respective information amounts.
  • the referential relationship or the peer relationship between different information amounts can be determined according to the syntax attribute of each information amount, and further, according to the syntax attribute and the distance relationship of each information amount, further The judgment of a referential relationship or a peer relationship between different amounts of information.
  • the sentence segmentation technique can be used to first divide the continuous character string in each sentence in the file into different words, and then determine whether each of the words is the amount of information to be concerned. For example, it is possible to predefine categories of information that need to be focused on, classify the segmented word segments, and then determine whether it is the amount of information to be concerned according to the category of each word.
  • other ways can be used to identify the amount of information in the file. For example, you can set some vocabulary that needs attention, and then filter the contents of the file according to these vocabularies to find out the amount of information that needs to be paid attention to.
  • the information determining unit 201 can determine the relevant information in the file only for the amount of information that needs attention.
  • the related information may be location information, such as a paragraph position, a start position, and an end position.
  • the paragraph position represents a natural paragraph position of the information amount in the file; the start position and the end position indicate a position of the information amount in a sentence in the file.
  • the related information may also include other information, such as information such as a grammatical attribute of the amount of information.
  • a specific structure of the calculating unit 202 includes: a first calculating subunit and a second calculating subunit (not shown). among them:
  • the first calculating subunit is configured to calculate a tag value of the information amount according to the related information, and specifically, calculate a tag value of each information amount according to the above formula (1).
  • each information amount since each information amount has a corresponding related information, by calculating the label value, each information amount can obtain its corresponding label value.
  • the second calculating subunit is configured to calculate a distance between different information amounts according to the label value.
  • the distance between different amounts of information when it is to be calculated, it can be understood as: Calculating the distance between any two different amounts of information in the file.
  • the absolute value of the difference between the tag values corresponding to two different information amounts may be taken as the distance between the two information amounts, that is, the distance between different information amounts is calculated according to the above formula (2). .
  • FIG. 3 is a schematic diagram showing a specific structure of the polymerization unit in the embodiment of the present invention.
  • the aggregating unit includes:
  • the relationship determining subunit 301 is configured to determine whether there is a referential relationship and/or a peer relationship between different amounts of information
  • the modifying sub-unit 302 is configured to, when the relationship determining sub-unit 301 determines that there is a referential relationship and/or a peer-to-peer relationship between different amounts of information, determine the referential relationship and/or peer-to-peer determined by the sub-unit according to the relationship.
  • the relationship is corrected for the distance between different amounts of information calculated by the computing unit;
  • the merging sub-unit 303 is configured to: when the relationship determining sub-unit 301 determines that there is a referential relationship and/or a peer relationship between different information amounts, the information corresponding to the minimum distance of the corrected sub-units 302 The amount is polymerized.
  • the merging sub-unit 303 is further configured to use the different information amount calculated by the calculating unit when the relationship determining sub-unit 301 determines that there is no referential relationship and/or a peer relationship between different information amounts. The amount of information corresponding to the minimum distance among the distances is aggregated.
  • the determination of the referential relationship and the peer relationship by the relationship determining subunit 301 can be determined based on the grammatical attributes and distance relationships of the respective information amounts.
  • the referential relationship and/or the peer relationship between different information amounts can be determined.
  • the grammatical attributes and the distance relationship of each information amount To further judge the referential relationship and/or the peer relationship between different amounts of information.
  • the information aggregation apparatus of the embodiment of the present invention quantizes the distance between different information amounts in the file by determining related information of the information amount in the file, and calculating a distance between different information amounts according to the related information, The quantized distance is used to aggregate different amounts of information, which effectively improves the accuracy of information aggregation, and thus provides an accurate information source for information extraction processing. At the same time, because the accuracy of information aggregation is effectively improved, The information that the user needs to pay attention to provides a more accurate service, thereby improving the user experience.
  • the information aggregation method and device in the embodiment of the present invention can be applied to a terminal device or a device such as a server, and can not only realize aggregation of text information, but also implement aggregation of image information.
  • the various embodiments in the present specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on differences from other embodiments.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical singles.
  • the element can be located in one place, or it can be distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention relates to the technical field of information processing. Disclosed are an information aggregation method and device. The method comprises: determining related information about an information amount in a file; calculating the distance between different information amounts according to the related information; and aggregating the different information amounts according to the calculated distance between the different information amounts. The information aggregation accuracy can be increased using the present invention.

Description

信息聚合方法及装置 技术领域 本发明涉及信息处理技术领域, 具体涉及一种信息聚合方法及装置。 背景技术  The present invention relates to the field of information processing technologies, and in particular, to an information aggregation method and apparatus. Background technique
信息聚合是将具有内在联系的不同信息组合成一个结构体, 例如人名、 电话号码、 邮件地址, 如果这些信息都是属于某个人的资料, 那么就可以将 该人名、 电话号码、 邮件地址组成一个大的信息块, 它是一个结构体: (人名, 电话号码, 邮件地址)。 利用信息聚合技术, 可以为用户提供多来源信息的一 站式的个性化服务。 比如, 终端设备监测用户的邮件或者短信息, 自动地从 邮件或者短信息中提取感兴趣的信息, 如联系人资料、 事件信息等, 然后就 生成一个日程表事件、 事务提醒事件、 或者通讯录联系人, 然后将信息存放 在对应的位置, 例如日程表、 事务提醒器、 联系人列表等, 以帮助用户处理 信息, 提升工作效率。  Information aggregation is the combination of different information with intrinsic links into a structure, such as person name, phone number, email address. If the information belongs to someone's data, then the person name, phone number, and email address can be combined into one. A large block of information, which is a structure: (name, phone number, email address). With information aggregation technology, users can provide one-stop personalized service with multi-source information. For example, the terminal device monitors the user's mail or short message, automatically extracts information of interest, such as contact information, event information, etc. from the mail or short message, and then generates a calendar event, a transaction reminder event, or an address book. Contacts, and then store the information in the corresponding location, such as schedules, transaction reminders, contact lists, etc., to help users process information and improve work efficiency.
信息聚合是信息提取的必要前提, 利用一种可量化的标准聚合信息是信 息聚合的核心工作。 选用不同的衡量准则会影响到信息聚合的效果, 从而影 响到信息提取的最终结果。  Information aggregation is a necessary prerequisite for information extraction. Aggregating information with a quantifiable standard is the core work of information aggregation. Choosing different metrics will affect the effect of information aggregation, which will affect the final result of information extraction.
在现有技术中, 信息聚合的常用方法是利用语法结构分析。 语法结构分 析利用语法原理, 根据不同的语法成分合并信息。 例如, 以汉语语法为例, 句子成分有主语、 谓语、 宾语、 定语、 状语、 补语。 每个成分对词汇属性都 有要求, 例如, 名词可以充当主语, 动词可以作谓语, 形容词修饰名词等等。 根据词汇的不同属性, 就可以聚合句子成分。 然而, 句子的复杂性以及成分 的多样性, 使得语法结构分析不易量化。 例如语法分析中的就近原则, 对于 终端设备而言, 就是一个非常复杂的问题, 因为远近没有量化定义, 终端设 备不知道什么是远, 什么是近。 由于语法结构分析难以量化, 因此, 信息聚 合的准确度较低。 发明内容 In the prior art, a common method of information aggregation is to use grammatical structure analysis. Grammatical structure analysis uses grammatical principles to combine information according to different grammatical components. For example, taking Chinese grammar as an example, the sentence components are subject, predicate, object, attributive, adverbial, and complement. Each component has requirements for lexical attributes. For example, a noun can act as a subject, a verb can be used as a predicate, an adjective modifies a noun, and so on. Depending on the different attributes of the vocabulary, the sentence components can be aggregated. However, the complexity of sentences and the diversity of components make grammatical structure analysis difficult to quantify. For example, the principle of proximity in grammar analysis is a very complicated problem for terminal equipment, because there is no quantitative definition in the distance, the terminal design I don't know what is far and what is close. Since grammatical structure analysis is difficult to quantify, the accuracy of information aggregation is low. Summary of the invention
本发明实施例针对上述现有技术存在的问题, 提供一种信息聚合方法及 装置, 以提高信息聚合的准确度。  The embodiments of the present invention provide an information aggregation method and apparatus for improving the accuracy of information aggregation.
为此, 本发明实施例提供如下技术方案:  To this end, the embodiments of the present invention provide the following technical solutions:
一种信息聚合方法, 包括:  An information aggregation method, including:
确定信息量在文件中的相关信息;  Determining information about the amount of information in the file;
根据所述相关信息计算不同信息量之间的距离;  Calculating a distance between different amounts of information according to the related information;
根据计算得到的不同信息量之间的距离对不同的信息量进行聚合。  The different amounts of information are aggregated according to the calculated distance between different amounts of information.
一种信息聚合装置, 包括:  An information aggregation device includes:
信息确定单元, 用于确定信息量在文件中的相关信息;  An information determining unit, configured to determine related information of the information amount in the file;
计算单元, 用于根据所述相关信息计算不同信息量之间的距离; 聚合单元, 用于根据计算得到的不同信息量之间的距离对不同的信息量 进行聚合。  a calculating unit, configured to calculate a distance between different information amounts according to the related information; and an aggregation unit, configured to aggregate different information amounts according to the calculated distance between different information amounts.
本发明实施例提供的信息聚合方法及装置, 通过确定信息量在文件中的 相关信息, 并根据所述相关信息计算不同信息量之间的距离, 从而对文件中 的不同信息量之间的距离进行量化处理, 利用量化后的距离对不同的信息量 进行聚合, 有效地提高了信息聚合的准确度。 附图说明  The information aggregation method and apparatus provided by the embodiments of the present invention determine the distance between different information amounts in the file by determining related information of the information amount in the file, and calculating the distance between different information amounts according to the related information. The quantization process is performed, and the quantized distance is used to aggregate different amounts of information, thereby effectively improving the accuracy of information aggregation. DRAWINGS
为了更清楚地说明本发明实施例中的技术方案, 下面将对实施例描述中 所需要使用的附图作一简单地介绍, 显而易见地, 下面描述中的附图仅仅是 本发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动 性的前提下, 还可以根据这些附图获得其他的附图。  In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings may be obtained based on these drawings without paying for creative labor.
图 1是本发明实施例信息聚合方法的流程图;  1 is a flowchart of an information aggregation method according to an embodiment of the present invention;
图 2是本发明实施例信息聚合装置的结构示意图; 图 3是本发明实施例中聚合单元的一种结构示意图。 具体实施方式 2 is a schematic structural diagram of an information aggregation apparatus according to an embodiment of the present invention; 3 is a schematic structural view of a polymerization unit in an embodiment of the present invention. detailed description
下面将结合本发明实施例中的附图, 对本发明实施例中的技术方案进行 清楚、 完整地描述, 显然, 所描述的实施例仅仅是本发明一部分实施例, 而 不是全部的实施例。 基于本发明中的实施例, 本领域普通技术人员在没有做 出创造性劳动前提下所获得的所有其它实施例, 都属于本发明保护的范围。  The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
本发明实施例信息聚合方法及装置, 通过确定信息量在文件中的相关信 息, 并根据所述相关信息计算不同信息量之间的距离, 从而对文件中的不同 信息量之间的距离进行量化处理, 利用量化后的距离对不同的信息量进行聚 合, 有效地提高信息聚合的准确度。  The information aggregation method and device according to the embodiment of the present invention quantifies the distance between different information amounts in the file by determining relevant information of the information amount in the file, and calculating a distance between different information amounts according to the related information. Processing, using the quantized distance to aggregate different amounts of information, effectively improving the accuracy of information aggregation.
本发明实施例信息聚合方法可以应用于终端设备或服务器, 比如, 终端 设备监测用户的邮件或者短信息, 自动实现对其中用户关注的信息的聚合。  The information aggregation method in the embodiment of the present invention can be applied to a terminal device or a server. For example, the terminal device monitors the user's mail or short message, and automatically aggregates the information that the user pays attention to.
如图 1所示, 是本发明实施例信息聚合方法的流程图, 包括以下步骤: 步骤 101 , 确定信息量在文件中的相关信息。  As shown in FIG. 1 , it is a flowchart of an information aggregation method according to an embodiment of the present invention, which includes the following steps: Step 101: Determine related information of an information volume in a file.
所述信息量是指用户关注的信息, 例如, 可以是人名、 电话号码、 邮箱 地址, 也可以是会议主题、 会议地点、 会议内容等等。 每个信息量包括由一 个或多个字符串组成, 每个信息量都有它对应的相关信息。 在本实施例中, 该步骤可以是确定不同信息量在文件中的相关信息。 也可以理解为, 获取文 件中用户关注的信息对应的相关相息, 或是获取文件中信息量对应的相关信 息。  The amount of information refers to information that the user pays attention to, for example, a person name, a phone number, an email address, a conference theme, a meeting place, a meeting content, and the like. Each amount of information consists of one or more strings, each of which has its associated information. In this embodiment, the step may be to determine related information of different amounts of information in the file. It can also be understood that the related information corresponding to the information that the user pays attention to in the file is obtained, or the related information corresponding to the amount of information in the file is obtained.
所述文件可以是用户的邮件或者短信息, 当然也可以是其它文件, 对此 本发明实施例不做限定。 在本实施例中, 文件可以是终端设备当前收到的用 户的邮件或者短信息, 也可以是已经存储在终端设备上的用户的邮件或者短 信息, 本发明实施例不做限定。  The file may be a mail or a short message of the user, and may be other files, which are not limited in this embodiment of the present invention. In this embodiment, the file may be the mail or the short message of the user currently received by the terminal device, or may be the mail or the short message of the user that has been stored on the terminal device, which is not limited in the embodiment of the present invention.
在实际应用中, 可以利用句子切分技术, 首先将文件中每个句子中的连 续字符串切分为不同的词, 然后再确定其中的每个词是否为需要关注的信息 量。 比如可以预先定义一些需要关注的信息量的类别, 对切分后的分词进行 类别标注, 然后根据各词的类别确定其是否为需要关注的信息量。 除此之外, 还可以利用其它方式来识别文件中的信息量, 比如, 可以设置一些需要关注 的词汇表, 然后, 根据这些词汇表过滤文件中的内容, 找出其中需要关注的 信息量。 In practical applications, you can use the sentence segmentation technique to first divide the continuous string in each sentence in the file into different words, and then determine whether each of the words is the information that needs attention. the amount. For example, it is possible to predefine categories of information that need to be focused on, classify the segmented word segments, and then determine whether it is the amount of information to be concerned according to the category of each word. In addition, other methods can be used to identify the amount of information in the file. For example, some vocabularies that need attention can be set, and then the contents of the file are filtered according to the vocabulary to find out the amount of information that needs attention.
当然, 还可以有更多其它方式来识别文件中的信息量, 对此本发明实施 例不做限定。  Of course, there are many other ways to identify the amount of information in the file, which is not limited in this embodiment of the present invention.
在本发明实施例中, 所述相关信息可以是位置信息, 比如, 信息量在文 件中的段落位置、 起始位置、 结束位置。 其中, 所述段落位置表示所述信息 量在文件中的自然段落位置, 是一个常数; 所述起始位置和结束位置表示所 述信息量在文件中所在句子中的位置。  In the embodiment of the present invention, the related information may be location information, for example, a paragraph position, a starting position, and an ending position of the information amount in the file. The position of the paragraph indicates a position of the natural paragraph in the file, which is a constant; the start position and the end position indicate the position of the information amount in the sentence in the file.
信息量位于文件的第一段落, 则段落位置就是 1 , 如果处于第二段落, 段 落位置就是 2, 依此类推。  The amount of information is in the first paragraph of the document, and the position of the paragraph is 1 . If it is in the second paragraph, the position of the paragraph is 2, and so on.
比如, 文件中有以下内容: "小明今天到北京出差, 他的电话是 12345678。 ,,  For example, the document has the following contents: "Xiao Ming went to Beijing for a business trip today, his phone number is 12345678. ,,
其中需要关注的信息量有: 小明、 他、 电话和 12345678。  The amount of information that needs attention is: Xiao Ming, He, Phone, and 12345678.
假设上述内容位于文件中的第 n个段落, 每个汉字占用两个位置空间, 数 字占用一个位置空间, 开始的位置是 1。 则其中各信息量在文件中的相关信息 下:  Suppose the above content is in the nth paragraph in the file. Each Chinese character occupies two locations, the number occupies one location space, and the starting position is 1. Then the amount of information in the relevant information in the file:
小明 (n, 1 , 4);  Xiao Ming (n, 1 , 4);
他 (n, 21 , 22);  He (n, 21, 22);
电话 (n, 25, 28);  Telephone (n, 25, 28);
12345678(n, 31 , 38)。  12345678 (n, 31, 38).
当然, 所述相关信息还可以包括其它信息, 比如, 信息量的语法属性等 信息。  Of course, the related information may also include other information, such as information such as a grammatical attribute of the amount of information.
步骤 102 , 根据所述相关信息计算不同信息量之间的距离。 具体地, 可以先根据信息量的相关信息计算该信息量的标签数值, , 以 获得不同信息量对应的标签数值。 在本实施例中, 可以理解为, 由于每个信 息量有一个其对应的相关信息, 通过计算标签数值, 从而使每个信息量可以 获得其对应的标签数值, 然后根据计算得到的标签数值计算不同信息量之间 的距离。 Step 102: Calculate a distance between different information amounts according to the related information. Specifically, the tag value of the information amount may be calculated according to the related information of the information amount to obtain the tag value corresponding to the different information amount. In this embodiment, it can be understood that, since each information quantity has a corresponding related information, the label value is calculated, so that each information quantity can obtain its corresponding label value, and then calculated according to the calculated label value. The distance between different amounts of information.
下面以所述相关信息只包括位置信息, 并以上述文件中的内容为例进行 说明。  In the following, the related information includes only the location information, and the content in the above file is taken as an example for description.
比如, 可以定义按以下公式 ( 1 )计算信息量的标签数值:  For example, you can define a label value that calculates the amount of information by the following formula (1):
L =段落位置 *标签系数 + (起始位置 + 结束位置) /2 ( 1 ) 其中, L表示信息量的标签数值。  L = paragraph position * label coefficient + (start position + end position) /2 ( 1 ) where L is the label value of the information amount.
在上述公式(1 ) 中加入标签系数是为了保证计算得到的各信息量的标签 数值的唯一性。 在实际应用中, 该标签系数可以是文件中所有段落中包含字 符数最多的段落中的字符数的最大值。 为了描述方便, 将所述标签系数记为 max—size。 例如, 文件中有三个自然段落, 第一段落的字符数是 n1 , 第二段 落的字符数是 n2, 第三段落的字符数是 n3, 则 max— size = max(n1, n2, n3)。  The label coefficient is added to the above formula (1) in order to ensure the uniqueness of the calculated label value of each information amount. In practical applications, the label factor can be the maximum number of characters in the paragraph containing the most number of characters in all paragraphs in the file. For convenience of description, the label coefficient is recorded as max-size. For example, there are three natural paragraphs in the file. The number of characters in the first paragraph is n1, the number of characters in the second paragraph is n2, and the number of characters in the third paragraph is n3, then max_size = max(n1, n2, n3).
假如 max— size的取值不是文件中所有段落中包含字符数最多的段落中的 字符数的最大值, 而是其他数值, 例如取当前段落的字符数值, 这样就不能 保证标签的唯一性。  If the value of max_size is not the maximum number of characters in the paragraph containing the most characters in all paragraphs in the file, but other values, such as taking the character value of the current paragraph, the uniqueness of the label cannot be guaranteed.
例如, 有三段文字, 第一段有 1000个字符, 第二段有 500个字符, 第三段 有 600个字符。 如果 max— size的取值是当前段落的数值, 则会出现如下情况: 第一段中信息量的标签数值为: 1 *1000 + (开始位置 +结束位置) /2,这个 标签数值的取值范围是 (1000, 2000) , 这一段的开始位置是 1 , 结束位置是 1000, 它们的中间值范围是 (1, 1000);  For example, there are three paragraphs of text, the first paragraph has 1000 characters, the second paragraph has 500 characters, and the third paragraph has 600 characters. If the value of max_size is the value of the current paragraph, the following will occur: The label value of the information quantity in the first paragraph is: 1 *1000 + (start position + end position) /2, the value of this label value The range is (1000, 2000), the starting position of this segment is 1, and the ending position is 1000, and their intermediate value range is (1, 1000);
第二段中信息量的标签数值为: 2*500 + (开始位置 + 结束位置) /2, 这 个标签数值的取值范围是(1000, 1500) , 这一段的开始位置是 1 , 结束位置是 500 , 它们的中间值范围是(1, 500); 按照同样的方式, 第三段中信息量的标签数值的范围是 (1800, 2400)。 可见, 第一段中信息量的标签数值范围覆盖了第二段中信息量的标签数 值范围, 第一段中信息量的标签数值和第三段中信息量的标签数值有重合。 The value of the tag in the second paragraph is: 2*500 + (start position + end position) /2, the value range of this tag is (1000, 1500), the starting position of this segment is 1, and the end position is 500 , their median range is (1, 500); In the same way, the range of tag values for the amount of information in the third segment is (1800, 2400). It can be seen that the range of the label value of the information quantity in the first paragraph covers the range of the label value of the information quantity in the second paragraph, and the label value of the information quantity in the first paragraph and the label value of the information quantity in the third paragraph overlap.
当然, 上述标签系数也可以是大于文件中所有段落中包含字符数最多的 段落中的字符数的最大值的一个数。  Of course, the above label coefficient may also be a number greater than the maximum value of the number of characters in the paragraph containing the largest number of characters in all the paragraphs in the file.
根据上述公式(1 ) , 可以得到上述各信息量的标签数值如下:  According to the above formula (1), the label values of the above information amounts can be obtained as follows:
L (小明) = n*max一 size + 5/2;  L (小明) = n*max一 size + 5/2;
L (他) = n*max一 size + 43/2;  L (he) = n*max a size + 43/2;
l_ (电话) = n*max_size + 53/2;  L_ (telephone) = n*max_size + 53/2;
L(12345678) = n*max—size + 59/2。  L(12345678) = n*max—size + 59/2.
在本实施例中, 当要计算不同信息量之间的距离时, 可以理解为: 计算 文件中的任意两个不同信息量之间的距离。 在本实施例中, 可以将两个不同 信息量对应的标签数值的差值的绝对值作为这两个信息量之间的距离, 即按 照以下公式(2 )计算不同信息量之间的距离:  In this embodiment, when the distance between different amounts of information is to be calculated, it can be understood as: Calculating the distance between any two different amounts of information in the file. In this embodiment, the absolute value of the difference between the tag values corresponding to two different information amounts may be taken as the distance between the two information amounts, that is, the distance between different information amounts is calculated according to the following formula (2):
d(x, y) = |L(x) - L(y)| ( 2 )  d(x, y) = |L(x) - L(y)| ( 2 )
其中, x和 y表示两个不同信息量。  Where x and y represent two different amounts of information.
根据上述公式(2 ) , 可以得到上述各信息量之间的距离如下: d(小明, 他) = 19;  According to the above formula (2), the distance between the above information amounts can be obtained as follows: d (Xiao Ming, he) = 19;
d (小明, 电话) =24;  d (小明, phone) =24;
d (小明, 12345678)=27;  d (小明, 12345678)=27;
d (他, 电话) = 5;  d (he, phone) = 5;
d (他, 12345678)=8;  d (he, 12345678)=8;
d (电话, 12345678) = 3。  d (telephone, 12345678) = 3.
在本实施例中, 通过上述的计算, 可以获得多个距离。 也可以理解为: 通过上述计算过程, 可以对文件中的不同信息量之间的距离进行量化, 使终 端设备可以准确识别不同信息量之间的距离的远近, 从而为信息聚合提供准 确的参考依据。 In the present embodiment, a plurality of distances can be obtained by the above calculation. It can also be understood as: Through the above calculation process, the distance between different information amounts in the file can be quantized, so that the terminal device can accurately identify the distance between different information amounts, thereby providing a quasi-information for information aggregation. The exact basis for the reference.
步骤 103,根据计算得到的不同信息量之间的距离对不同的信息量进行聚 合。  In step 103, different amounts of information are aggregated according to the calculated distance between different amounts of information.
在聚合过程中, 需要考虑不同信息量之间的距离, 按照就近原则进行聚 合。 需要聚合的信息可以是不同类别且具有关联性的信息, 通常是人名、 电 话、 地址、 邮箱这类信息, 也可以按照用户制定的信息类别聚合。  In the process of polymerization, it is necessary to consider the distance between different amounts of information and to carry out the aggregation according to the principle of proximity. The information that needs to be aggregated can be different categories and related information, usually information such as person name, phone number, address, and mailbox, and can also be aggregated according to the information category defined by the user.
由于不同信息量之间会存在指代关系 (比如 "他" 和 "小明" )和 /或对 等关系 (比如 "电话" 和 "12345678" ) , 因此, 可以先根据指代关系和 / 或对等关系对相关的信息量之间的距离进行修正, 然后, 选择一个最小的距 离, 将该距离对应的信息量聚合。  Since there are reference relationships (such as "he" and "xiaoming") and/or peer relationships (such as "telephone" and "12345678") between different amounts of information, they can be based on referential relationships and/or The relationship corrects the distance between the related information quantities, and then selects a minimum distance to aggregate the amount of information corresponding to the distance.
例如,上述得到的距离中, d (小明, 12345678)=27, 由于 "小明"和 "他" 具有指代关系, 而 "12345678" 和 "电话" 具有对等关系, 而且 d (他, 电话) = 5, 因此, 可以将 d (小明, 12345678)修正为 5, 即与 d (他, 电话)的值一样。 这样,再将计算得到的 "小明 "与" 12345678"的距离与计算得到的 " 12345678" 与其它人名的距离进行比较, 选择一个最小的数值来聚合, 即将电话号码 "12345678" 与其距离最短的人名进行聚合。  For example, in the distance obtained above, d (Xiaoming, 12345678) = 27, because "Xiao Ming" and "He" have a referential relationship, while "12345678" and "Telephone" have a peer relationship, and d (he, telephone) = 5, therefore, d (Xiaoming, 12345678) can be corrected to 5, which is the same value as d (he, phone). In this way, compare the calculated distance between "Xiaoming" and "12345678" with the calculated distance of "12345678" from other people's names, and select a minimum value to aggregate, that is, the person whose phone number "12345678" is the shortest distance. Perform polymerization.
上述指代关系和对等关系的判断可以根据各信息量的语法属性及距离关 系来确定。 在本实施例中, 可以理解为, 根据各信息量的语法属性可以判断 不同信息量之间的指代关系或对等关系, 进一步的, 还可以根据各信息量的 语法属性和距离关系来进一步的判断不同信息量之间的指代关系或对等关 系。 比如, "电话" 与 "12345678" 是用关连词 "是" 连接, 从而可以确定 它们是对等关系。 再比如, "小明" 是一个人名, "他" 是一个代词, 而且 上述文本中没有其它代词, 因此可以确定它们具有指代关系。 当然, 如果文 本中还有其它代词, 则需要根据每个代词与 "小明" 的距离, 确定距离最近 的代词与 "小明" 具有指代关系。 反之, 如果文本中还有其它人名, 则同样 需要确定根据每个人名与代词 "他"的距离,确定距离最近的人名与代词 "他" 具有指代关系。 对于有多个人名和多个代词的情况, 同样可以按照上述方式 确定其中的人名与代词指代关系。 The above-mentioned referential relationship and the judgment of the peer relationship can be determined according to the grammatical attribute and the distance relationship of each information amount. In this embodiment, it can be understood that the referential relationship or the peer relationship between different information amounts can be determined according to the syntax attribute of each information amount, and further, according to the syntax attribute and the distance relationship of each information amount, further The judgment of a referential relationship or a peer relationship between different amounts of information. For example, "telephone" and "12345678" are connected by the conjunction "yes" to determine that they are peer-to-peer. For another example, "Xiaoming" is a person's name, "He" is a pronoun, and there are no other pronouns in the above text, so it can be determined that they have a referential relationship. Of course, if there are other pronouns in the text, it is necessary to determine the distance between the nearest pronoun and "Xiao Ming" according to the distance between each pronoun and "Xiao Ming". On the other hand, if there are other names in the text, you also need to determine the distance between each person's name and the pronoun "he" to determine the closest person's name and pronoun "he". Has a referential relationship. For the case of having multiple names and multiple pronouns, it is also possible to determine the relationship between the person name and the pronoun in the above manner.
当然, 如果不同信息量之间不存在指代关系和对等关系, 则不需要对计 算得到的不同信息量之间的距离进行修正, 而是直接对计算得到的不同信息 量之间的距离中最小距离对应的信息量进行聚合。  Of course, if there is no referential relationship and a peer relationship between different amounts of information, it is not necessary to correct the distance between the calculated different amounts of information, but directly to the distance between the calculated different amounts of information. The amount of information corresponding to the minimum distance is aggregated.
上面的举例中, 不同信息量出现在文件中的一个段落中。 下面进一步举 例说明信息量位于不同段落中的情况下信息聚合的过程。  In the example above, different amounts of information appear in a paragraph in the file. The following is a further example of the process of information aggregation in the case where the amount of information is in different paragraphs.
例如, 一个文件中有以下内容:  For example, a file has the following contents:
王总明天到北京出差, 他的电话是 12345678。  President Wang is on a business trip to Beijing tomorrow. His phone number is 12345678.
王总将会与张总开会, 会议期间不方便接听电话, 有急事可以找王总他 的秘书小王, 他的电话是 87654321 , 或者给王总直接发邮件或者直接发邮件 给王总, 他的由 牛地址是: abc@domain.com。  President Wang will meet with General Zhang. During the meeting, it is not convenient to answer the phone. If there is an urgent matter, he can find Mr. Wang’s secretary, Xiao Wang. His telephone number is 87654321, or send an email directly to Mr. Wang or directly send an email to Wang Zong. The address of the cow is: abc@domain.com.
对于上述文本, 用户需要关注的信息量是人名, 电话号码, 电子邮件地 址。  For the above text, the amount of information that the user needs to pay attention to is the person's name, phone number, and email address.
上述文本有两个段落, 出现三个人, 分别是王总, 张总, 小王。 其中, 王总在两段中都出现, 在第二段中出现了三次;  The above text has two paragraphs, and three people appear, namely Wang Zong, Zhang Zong, Xiao Wang. Among them, Mr. Wang appeared in both paragraphs and appeared three times in the second paragraph;
有三个"他", 分别是: 第一段出现一个 "他" , 第二段有两个 "他" ; 两个电话号码, 分别是: 12345678、 87654321 ;  There are three "he", namely: one "he" in the first paragraph and two "he" in the second paragraph; two telephone numbers, namely: 12345678, 87654321;
一个电子由 牛地址, abc@domain.com。  An electronic by the cattle address, abc@domain.com.
4叚设汉字占两个字符位置, 中文标点占两个字符位置, ASCII字符占一个 字符位置。  4 汉 Chinese characters occupy two character positions, Chinese punctuation takes up two character positions, and ASCII characters occupy one character position.
对于上述文本内容, 首先确定信息量在文件中的相关信息, 具体如下: 第一段的信息量有:  For the above text content, first determine the information about the amount of information in the file, as follows: The amount of information in the first paragraph is:
王总, 他, 电话, 12345678;  President Wang, his phone, 12345678;
第二段的信息量有:  The amount of information in the second paragraph is:
王总, 张总, 王总 (第二个), 小王,他 (第一个他), 电话, 87654321 , 王 总 (第三个), 他 (第二个他), 邮件地址, abc@domain.com。 Wang Zong, Zhang Zong, Wang Zong (second), Xiao Wang, he (the first one), telephone, 87654321, Wang Total (third), he (the second one), email address, abc@domain.com.
在最新的文本中, 第一段有 40个字符, 第二段有 146个字符。  In the latest text, the first paragraph has 40 characters and the second paragraph has 146 characters.
设定 max一 size = 134。  Set max one size = 134.
由于信息量中有四个 "王总",三个"他"出现,为了区分出现重复的信息量, 釆用以下标记: 段落数值一信息量—第几个, 如第二段中第三个出现的王总记 为: 2—王总— 3, 其他的依此类推。  Since there are four "Wang" and three "He" in the amount of information, in order to distinguish the amount of repeated information, the following mark is used: Paragraph value - information amount - the first, such as the third in the second paragraph The total number of kings that appeared was: 2—Wang Zong—3, and so on.
上述信息量在文件中的相关信息分别为:  The information about the above information in the file is:
1—王总— 1(1, 1,4);  1—Wang Zong—1(1, 1,4);
1一他—1(1, 21, 22);  1一他-1 (1, 21, 22);
1—电话— 1(1, 25, 28);  1—telephone — 1 (1, 25, 28);
12345678(1, 31, 38);  12345678 (1, 31, 38);
2—王总— 1(2, 1, 4);  2—Wang Zong—1(2, 1, 4);
张总 (2, 11, 14);  Zhang Zong (2, 11, 14);
2—电话— 1(2, 39, 42);  2—telephone — 1 (2, 39, 42);
2—王总— 2(2, 55, 58);  2—Wang Zong—2 (2, 55, 58);
小王 (2, 65, 68);  Xiao Wang (2, 65, 68);
2一他—1(2,71, 72);  2 one he—1 (2, 71, 72);
2—电话— 2(2, 75, 78);  2—telephone — 2 (2, 75, 78);
87654321(2, 81, 88);  87654321 (2, 81, 88);
2—王总— 3(2, 97, 100);  2—Wang Zong—3 (2, 97, 100);
2一他— 2 (2,113 ,114);  2 one he-2 (2,113,114);
邮件地址 (2, 121, 124);  Email address (2, 121, 124);
abc@domain.com (2, 129, 132)。  Abc@domain.com (2, 129, 132).
然后, 按照前面定义的距离计算公式计算两两信息量之间的距离, 具体 计算过程与前面举例中类似, 在此不再——描述。  Then, according to the distance calculation formula defined above, the distance between the two information amounts is calculated. The specific calculation process is similar to the previous example, and is not described here.
在得到信息量之间的距离后, 确定信息量之间的指代关系及对等关系。 ( 1 )确定代词"他"的指代关系。 After obtaining the distance between the amounts of information, the referential relationship and the peer relationship between the amounts of information are determined. (1) Determine the referential relationship of the pronoun "he".
d(1—王总— 1, 1—他— 1) = |(1 +4)/2 - (21 +22)/2| = 19;  d(1—王总— 1, 1—他—1) = |(1 +4)/2 - (21 +22)/2| = 19;
d(1一他一1, 2一王总—1 (2, 1 , 4)) = |[134 + (21 +22)/2] -[2*134 + (1 +4)/2]| = d(1一他一一1, 2一王总-1(2, 1 , 4)) = |[134 + (21 +22)/2] -[2*134 + (1 +4)/2]| =
1 15。 1 15.
按照上述方式, 同样计算这个代词与其他人名的距离。  In the above manner, the distance between this pronoun and other names is also calculated.
这样就获得 6个距离(因为有六个人名, 重复的算出现的次数)。 根据这六 个距离中最小值, 可以确定第一段中的 "他"是指代第一段中的"王总 ", 也就是 说, "他" 和 "王总" 具有指代关系。  This gives you 6 distances (because there are six names, the number of repetitions is counted). Based on the minimum of the six distances, it can be determined that the "he" in the first paragraph refers to the "king of the king" in the first paragraph, that is, "he" and "king" have a referential relationship.
按照上述方式, 同样可以确定第二段中的第一个"他"是指代 "小王", 第 二个"他"是指代 "王总"。这样,就可以确定上述文本中的人称代词的指代关系。  According to the above method, it can also be determined that the first "he" in the second paragraph refers to "small king", and the second "he" refers to "king". In this way, the referential relationship of the personal pronouns in the above text can be determined.
( 2 )确定电话与号码的对等关系。  (2) Determine the peer relationship between the phone and the number.
如 Ί—电话— Γ与 Ί 2345678", "2—电话— 2"与" 8765432Γ, "邮件地址" 与" abc@domain.com"。  Such as Ί - phone - Γ and Ί 2345678", "2 - phone - 2" and "8765432 Γ, "mail address" and "abc@domain.com".
利用上述确定的指代关系和对等关系, 可以确定: 第一段中的"他"指代" 王总" , "电话 "就是 "12345678"。 对计算得到的信息量之间的距离进行修正, 可以得到: d(1—王总— 1, 12345678) = d(1一他— 1, 1—电话— 1) = 5。  Using the above identified relationship and peer relationship, it can be determined that "he" in the first paragraph refers to "king" and "telephone" is "12345678". Correcting the distance between the calculated amounts of information, you can get: d (1 - Wang total - 1, 12345678) = d (1 - he - 1, 1 - telephone - 1) = 5.
然后, 再计算 Ί 2345678"与其他名字的距离, 在这些距离中选择一个最 小的值决定这个电话号码的归属。  Then, calculate Ί 2345678" from other names, and choose a minimum value among these distances to determine the attribution of this phone number.
在确定了指代关系和对等关系后, 选择距离最小的相关信息量中的人名、 电话号码及邮件地址进行聚合, 最终可以得到如下的聚合结果:  After determining the referential relationship and the peer relationship, the person name, the phone number, and the email address in the relevant information amount with the smallest distance are selected for aggregation, and finally the following aggregation result is obtained:
王总, 12345678, abc@domain.com;  Mr. Wang, 12345678, abc@domain.com;
小王, 87654321 ;  Xiao Wang, 87654321;
张总。  boss Zhang.
在实际应用中, 终端设备得到上述聚合结果后, 可以将该聚合结果保存 到相应的文件中, 和 /或展示给用户, 以供用户选择等操作。  In an actual application, after the terminal device obtains the foregoing aggregation result, the aggregation result may be saved in a corresponding file, and/or displayed to the user for selection by the user.
可见, 本发明实施例信息聚合方法, 通过确定信息量在文件中的相关信 息, 并根据所述相关信息计算不同信息量之间的距离, 使不同信息量之间的 距离都有一个具体的数值, 从而对文件中的不同信息量之间的距离进行量化 处理, 利用量化后的距离对不同的信息量进行聚合, 不仅可以利用终端设备 自动实现信息的聚合, 而且可以有效地提高信息聚合的准确度, 进而为信息 提取处理提供准确的信息源; 同时, 由于有效的提高了信息聚合的准确度, 进而可以为用户需要关注的信息提供更准确的服务, 从而可以提高用户的体 验。 It can be seen that the information aggregation method in the embodiment of the present invention determines the related information in the file by determining the amount of information. Interest, and calculate the distance between different information amounts according to the related information, so that the distance between different information amounts has a specific value, thereby quantizing the distance between different information amounts in the file, using quantization The latter distance aggregates different amounts of information, which not only can automatically realize the aggregation of information by using terminal equipment, but also can effectively improve the accuracy of information aggregation, and thus provide an accurate information source for information extraction processing. At the same time, due to effective improvement The accuracy of information aggregation, in turn, can provide more accurate services for information that users need to pay attention to, thereby improving the user experience.
相应地, 本发明实施例还提供一种信息聚合装置, 该装置可以是终端设 备或服务器等设备的一部分。 所述终端设备可以是手机、 PDA、 平板电脑等 智能终端设备。  Correspondingly, an embodiment of the present invention further provides an information aggregation apparatus, which may be part of a device such as a terminal device or a server. The terminal device may be an intelligent terminal device such as a mobile phone, a PDA, or a tablet computer.
如图 2所示, 是该装置的一种结构示意图。  As shown in Figure 2, it is a schematic structural view of the device.
在该实施例中, 所述装置包括:  In this embodiment, the apparatus includes:
信息确定单元 201 ,用于确定信息量在文件中的相关信息。在本实施例中, 所述信息量是指用户关注的信息, 例如, 可以是人名、 电话号码、 邮箱地址, 也可以是会议主题、 会议地点、 会议内容等等。 每个信息量包括由一个或多 个字符串组成, 每个信息量都有它对应的相关信息。 在本实施例中, 信息确 定单元 201可以是确定不同信息量在文件中的相关信息。  The information determining unit 201 is configured to determine related information of the amount of information in the file. In this embodiment, the information volume refers to information that the user pays attention to, for example, may be a person name, a phone number, an email address, or a conference topic, a meeting place, a meeting content, and the like. Each amount of information consists of one or more strings, each of which has its associated information. In the present embodiment, the information determining unit 201 may be related information that determines different amounts of information in the file.
所述文件可以是用户的邮件或者短信息, 当然也可以是其它文件, 对此 本发明实施例不做限定。 在本实施例中, 文件可以是终端设备当前收到的用 户的邮件或者短信息, 也可以是已经存储在终端设备上的用户的邮件或者短 信息, 本发明实施例不做限定。  The file may be a mail or a short message of the user, and may be other files, which are not limited in this embodiment of the present invention. In this embodiment, the file may be the mail or the short message of the user currently received by the terminal device, or may be the mail or the short message of the user that has been stored on the terminal device, which is not limited in the embodiment of the present invention.
计算单元 202, 用于根据所述相关信息计算不同信息量之间的距离。 在本 实施例中,计算单元 202可以先根据信息量的相关信息计算该信息量的标签数 值, 在本实施例中, 可以理解为, 通过计算单元 202计算标签数值, 从而使每 个信息量可以获得其对应的标签数值, 然后根据计算得到的标签数值计算不 同信息量之间的距离。 聚合单元 203,用于根据计算得到的不同信息量之间的距离对不同的信息 量进行聚合。 在本实施例中, 在聚合过程中, 需要考虑不同信息量之间的距 离, 按照就近原则进行聚合。 需要聚合的信息可以是不同类别且具有关联性 的信息, 通常是人名、 电话、 地址、 邮箱这类信息, 也可以按照用户制定的 信息类别聚合。 The calculating unit 202 is configured to calculate a distance between different information amounts according to the related information. In this embodiment, the calculation unit 202 may first calculate the label value of the information amount according to the information about the information amount. In this embodiment, it may be understood that the calculation unit 202 calculates the label value, so that each information amount can be Obtain the corresponding label value, and then calculate the distance between different information quantities based on the calculated label value. The aggregating unit 203 is configured to aggregate different amounts of information according to the calculated distance between different amounts of information. In this embodiment, in the aggregation process, it is necessary to consider the distance between different amounts of information, and perform aggregation according to the principle of proximity. The information that needs to be aggregated may be different categories and related information, usually information such as person name, phone number, address, and mailbox, or may be aggregated according to the information category defined by the user.
由于不同信息量之间会存在指代关系 (比如 "他" 和 "小明" )和 /或对 等关系 (比如 "电话" 和 "12345678" ) , 因此, 可以先根据指代关系和 / 或对等关系对相关的信息量之间的距离进行修正, 然后, 选择一个最小的距 离, 将该距离对应的信息量聚合。  Since there are reference relationships (such as "he" and "xiaoming") and/or peer relationships (such as "telephone" and "12345678") between different amounts of information, they can be based on referential relationships and/or The relationship corrects the distance between the related information quantities, and then selects a minimum distance to aggregate the amount of information corresponding to the distance.
上述指代关系和对等关系的判断可以根据各信息量的语法属性及距离关 系来确定。 在本实施例中, 可以理解为, 根据各信息量的语法属性可以判断 不同信息量之间的指代关系或对等关系, 进一步的, 还可以根据各信息量的 语法属性和距离关系来进一步的判断不同信息量之间的指代关系或对等关 系。  The above-mentioned referential relationship and the judgment of the peer relationship can be determined according to the grammatical attributes and distance relationships of the respective information amounts. In this embodiment, it can be understood that the referential relationship or the peer relationship between different information amounts can be determined according to the syntax attribute of each information amount, and further, according to the syntax attribute and the distance relationship of each information amount, further The judgment of a referential relationship or a peer relationship between different amounts of information.
在本发明实施例中, 可以利用句子切分技术, 先将文件中每个句子中的 连续字符串切分为不同的词, 然后再确定其中的每个词是否为需要关注的信 息量。 比如可以预先定义一些需要关注的信息量的类别, 对切分后的分词进 行类别标注, 然后根据各词的类别确定其是否为需要关注的信息量。 除此之 外, 还可以利用其它方式来识别文件中的信息量, 比如, 可以设置一些需要 关注的词汇表, 然后, 根据这些词汇表过滤文件中的内容, 找出其中需要关 注的信息量。  In the embodiment of the present invention, the sentence segmentation technique can be used to first divide the continuous character string in each sentence in the file into different words, and then determine whether each of the words is the amount of information to be concerned. For example, it is possible to predefine categories of information that need to be focused on, classify the segmented word segments, and then determine whether it is the amount of information to be concerned according to the category of each word. In addition, other ways can be used to identify the amount of information in the file. For example, you can set some vocabulary that needs attention, and then filter the contents of the file according to these vocabularies to find out the amount of information that needs to be paid attention to.
当然, 还可以有更多其它方式来识别文件中的信息量, 对此本发明实施 例不做限定。  Of course, there are many other ways to identify the amount of information in the file, which is not limited in this embodiment of the present invention.
所述信息确定单元 201可以只对需要关注的信息量确定其在文件中的相 关信息。  The information determining unit 201 can determine the relevant information in the file only for the amount of information that needs attention.
所述相关信息可以是位置信息, 比如, 段落位置、 起始位置和结束位置, 所述段落位置表示所述信息量在文件中的自然段落位置; 所述起始位置和结 束位置表示所述信息量在文件中所在句子中的位置。 当然, 所述相关信息还 可以包括其它信息, 比如, 信息量的语法属性等信息。 The related information may be location information, such as a paragraph position, a start position, and an end position. The paragraph position represents a natural paragraph position of the information amount in the file; the start position and the end position indicate a position of the information amount in a sentence in the file. Of course, the related information may also include other information, such as information such as a grammatical attribute of the amount of information.
在本发明实施例中, 上述计算单元 202的一种具体结构包括: 第一计算子 单元和第二计算子单元(未图示) 。 其中:  In the embodiment of the present invention, a specific structure of the calculating unit 202 includes: a first calculating subunit and a second calculating subunit (not shown). among them:
所述第一计算子单元, 用于根据所述相关信息计算所述信息量的标签数 值, 具体可以按照上述公式 ( 1 )来计算各信息量的标签数值。 在本实施例中, 可以理解为, 由于每个信息量有一个其对应的相关信息, 因而, 通过计算标 签数值, 从而使每个信息量可以获得其对应的标签数值。  The first calculating subunit is configured to calculate a tag value of the information amount according to the related information, and specifically, calculate a tag value of each information amount according to the above formula (1). In this embodiment, it can be understood that since each information amount has a corresponding related information, by calculating the label value, each information amount can obtain its corresponding label value.
所述第二计算子单元, 用于根据所述标签数值计算不同信息量之间的距 离。 在本实施例中, 当要计算不同信息量之间的距离时, 可以理解为: 计算 文件中的任意两个不同信息量之间的距离。 在本实施例中, 可以将两个不同 信息量对应的标签数值的差值的绝对值作为这两个信息量之间的距离, 即按 照上述公式(2 )来计算不同信息量之间的距离。  The second calculating subunit is configured to calculate a distance between different information amounts according to the label value. In this embodiment, when the distance between different amounts of information is to be calculated, it can be understood as: Calculating the distance between any two different amounts of information in the file. In this embodiment, the absolute value of the difference between the tag values corresponding to two different information amounts may be taken as the distance between the two information amounts, that is, the distance between different information amounts is calculated according to the above formula (2). .
上述各信息量的标签数值、 以及不同信息量之间的距离的详细计算过程 可参照前面本发明实施例信息聚合方法中的描述, 在此不再赘述。  For the detailed calculation process of the label value of the above information amount and the distance between the different information amounts, refer to the description in the information aggregation method of the foregoing embodiment of the present invention, and details are not described herein again.
如图 3所示, 是本发明实施例中所述聚合单元的一种具体结构示意图。 在该实施例中, 所述聚合单元包括:  FIG. 3 is a schematic diagram showing a specific structure of the polymerization unit in the embodiment of the present invention. In this embodiment, the aggregating unit includes:
关系确定子单元 301 , 用于确定不同信息量之间是否有指代关系和 /或对 等关系;  The relationship determining subunit 301 is configured to determine whether there is a referential relationship and/or a peer relationship between different amounts of information;
修正子单元 302, 用于当所述关系确定子单元 301确定不同信息量之间有 指代关系和 /或对等关系时, 根据所述关系确定子单元确定的指代关系和 /或对 等关系对所述计算单元计算得到的不同信息量之间的距离进行修正;  The modifying sub-unit 302 is configured to, when the relationship determining sub-unit 301 determines that there is a referential relationship and/or a peer-to-peer relationship between different amounts of information, determine the referential relationship and/or peer-to-peer determined by the sub-unit according to the relationship. The relationship is corrected for the distance between different amounts of information calculated by the computing unit;
合并子单元 303 , 用于在所述关系确定子单元 301确定不同信息量之间有 指代关系和 /或对等关系时, 将所述修正子单元 302修正后的距离中最小距离 对应的信息量进行聚合。 在本实施例中, 该合并子单元 303进一步用于在所述关系确定子单元 301 确定不同信息量之间没有指代关系和 /或对等关系时, 将上述计算单元计算得 到的不同信息量之间的距离中最小距离对应的信息量进行聚合。 The merging sub-unit 303 is configured to: when the relationship determining sub-unit 301 determines that there is a referential relationship and/or a peer relationship between different information amounts, the information corresponding to the minimum distance of the corrected sub-units 302 The amount is polymerized. In this embodiment, the merging sub-unit 303 is further configured to use the different information amount calculated by the calculating unit when the relationship determining sub-unit 301 determines that there is no referential relationship and/or a peer relationship between different information amounts. The amount of information corresponding to the minimum distance among the distances is aggregated.
上述关系确定子单元 301对指代关系和对等关系的判断可以根据各信息 量的语法属性及距离关系来确定。 在本实施例中, 可以理解为, 根据各信息 量的语法属性可以判断不同信息量之间的指代关系和 /或对等关系,进一步的, 还可以根据各信息量的语法属性和距离关系来进一步的判断不同信息量之间 的指代关系和 /或对等关系。 具体可参照前面本发明实施例中的描述, 在此不 再赘述。  The determination of the referential relationship and the peer relationship by the relationship determining subunit 301 can be determined based on the grammatical attributes and distance relationships of the respective information amounts. In this embodiment, it can be understood that, according to the grammatical attributes of the respective information amounts, the referential relationship and/or the peer relationship between different information amounts can be determined. Further, according to the grammatical attributes and the distance relationship of each information amount, To further judge the referential relationship and/or the peer relationship between different amounts of information. For details, refer to the description in the foregoing embodiments of the present invention, and details are not described herein again.
同样, 上述修正子单元 302和合并子单元 303的具体处理过程也可参照前 面本发明实施例中的描述, 在此不再赘述。  For the specific processing of the foregoing modification sub-unit 302 and the merging sub-unit 303, reference may be made to the description in the foregoing embodiments of the present invention, and details are not described herein.
本发明实施例信息聚合装置, 通过确定信息量在文件中的相关信息, 并 根据所述相关信息计算不同信息量之间的距离, 从而对文件中的不同信息量 之间的距离进行量化处理, 利用量化后的距离对不同的信息量进行聚合, 有 效地提高了信息聚合的准确度, 进而为信息提取处理提供准确的信息源; 同 时, 由于有效的提高了信息聚合的准确度, 进而可以为用户需要关注的信息 提供更准确的服务, 从而可以提高用户的体验。  The information aggregation apparatus of the embodiment of the present invention quantizes the distance between different information amounts in the file by determining related information of the information amount in the file, and calculating a distance between different information amounts according to the related information, The quantized distance is used to aggregate different amounts of information, which effectively improves the accuracy of information aggregation, and thus provides an accurate information source for information extraction processing. At the same time, because the accuracy of information aggregation is effectively improved, The information that the user needs to pay attention to provides a more accurate service, thereby improving the user experience.
需要说明的是, 本发明实施例信息聚合方法及装置, 可以应用在终端设 备上, 也可以应用在服务器等设备上, 不仅可以实现文本信息的聚合, 而且 还可以实现图像信息的聚合。  It should be noted that the information aggregation method and device in the embodiment of the present invention can be applied to a terminal device or a device such as a server, and can not only realize aggregation of text information, but also implement aggregation of image information.
本说明书中的各个实施例均釆用递进的方式描述, 各个实施例之间相同 相似的部分互相参见即可, 每个实施例重点说明的都是与其他实施例的不同 之处。 尤其, 对于装置实施例而言, 由于其基本相似于方法实施例, 所以描 述得比较简单, 相关之处参见方法实施例的部分说明即可。 以上所描述的装 置实施例仅仅是示意性的, 其中所述作为分离部件说明的单元可以是或者也 可以不是物理上分开的, 作为单元显示的部件可以是或者也可以不是物理单 元, 即可以位于一个地方, 或者也可以分布到多个网络单元上。 可以根据实 际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。 本领域 普通技术人员在不付出创造性劳动的情况下, 即可以理解并实施。 The various embodiments in the present specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on differences from other embodiments. In particular, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment. The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical singles. The element can be located in one place, or it can be distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.
本领域普通技术人员可以理解实现上述实施例装置中的全部或部分流 程, 是可以通过计算机程序来指令相关的硬件来完成, 所述的程序可存储于 一计算机可读取存储介质中, 该程序在执行时, 可包括如上述各装置的实施 例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体( Read-Only Memory, ROM )或随机存储记忆体( Random Access Memory, RAM )等。  A person skilled in the art can understand that all or part of the process of implementing the foregoing embodiment apparatus can be completed by a computer program to instruct related hardware, and the program can be stored in a computer readable storage medium. In execution, the flow of an embodiment of the various devices as described above may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).
以上所述, 仅为本发明的具体实施方式, 但本发明的保护范围并不局限 于此, 任何熟悉本技术领域的技术人员在本发明揭露的技术范围内, 可轻易 想到的变化或替换, 都应涵盖在本发明的保护范围之内。 因此, 本发明的保 护范围应该以权利要求的保护范围为准。  The above is only the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any change or replacement that can be easily conceived by those skilled in the art within the technical scope of the present invention is All should be covered by the scope of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

权 利 要求 书 Claim
1、 一种信息聚合方法, 其特征在于, 包括: An information aggregation method, comprising:
确定信息量在文件中的相关信息;  Determining information about the amount of information in the file;
根据所述相关信息计算不同信息量之间的距离;  Calculating a distance between different amounts of information according to the related information;
根据计算得到的不同信息量之间的距离对不同的信息量进行聚合。  The different amounts of information are aggregated according to the calculated distance between different amounts of information.
2、 根据权利要求 1所述的方法, 其特征在于, 所述信息量为用户关注的信 息。  2. The method according to claim 1, wherein the amount of information is information of interest to the user.
3、 根据权利要求 1所述的方法, 其特征在于, 所述确定信息量在文件中的 相关信息包括:  3. The method according to claim 1, wherein the determining the information related information in the file comprises:
确定信息量在文件中的位置信息, 所述位置信息包括: 段落位置, 起始位 置, 结束位置, 其中, 所述段落位置表示所述信息量在文件中的自然段落位置, 所述起始位置和结束位置表示所述信息量在文件中所在句子中的位置。  Determining location information of the information amount in the file, the location information includes: a paragraph position, a starting position, an ending position, wherein the paragraph position indicates a natural paragraph position of the information amount in the file, the starting position And the end position indicates the position of the amount of information in the sentence in the file.
4、 根据权利要求 3所述的方法, 其特征在于, 所述根据所述相关信息计算 不同信息量之间的距离包括:  The method according to claim 3, wherein the calculating the distance between different information amounts according to the related information comprises:
根据所述相关信息计算所述信息量的标签数值, 以获得不同信息量对应的 标签数值;  Calculating a tag value of the information amount according to the related information, to obtain a tag value corresponding to the different information amount;
根据所述标签数值计算不同信息量之间的距离。  The distance between different amounts of information is calculated based on the tag value.
5、 根据权利要求 4所述的方法, 其特征在于,  5. The method of claim 4, wherein
所述根据所述相关信息计算所述信息量的标签数值包括:  The calculating the tag value of the information amount according to the related information includes:
利用以下公式计算所述信息量的标签数值: L =段落位置 *标签系数 + (起 始位置 + 结束位置 2;  The label value of the information amount is calculated by the following formula: L = paragraph position * label coefficient + (start position + end position 2;
所述根据所述标签数值计算不同信息量之间的距离包括:  The calculating the distance between different information amounts according to the label value includes:
将不同信息量对应的标签数值的差值的绝对值作为所述不同信息量之间的 距离。  The absolute value of the difference of the tag values corresponding to the different amounts of information is taken as the distance between the different amounts of information.
6、 根据权利要求 5所述的方法, 其特征在于, 所述标签系数大于或等于所 述文件中所有段落中包含字符数最多的段落中的字符数的最大值。 6. The method according to claim 5, wherein the label coefficient is greater than or equal to The maximum number of characters in the paragraph that contains the most characters in all paragraphs in the file.
7、 根据权利要求 1至 6任一项所述的方法, 其特征在于, 所述根据计算得到 的不同信息量之间的距离对不同的信息量进行聚合包括:  The method according to any one of claims 1 to 6, wherein the aggregating different amounts of information according to the calculated distance between different amounts of information comprises:
确定不同信息量之间是否有指代关系和 /或对等关系;  Determine whether there is a referential relationship and/or a peer relationship between different amounts of information;
当确定不同信息量之间有指代关系和 /或对等关系时, 根据所述指代关系和 / 或对等关系对所述距离进行修正;  When it is determined that there is a referential relationship and/or a peer relationship between different amounts of information, the distance is corrected according to the referential relationship and/or the peer relationship;
将修正后的距离中最小距离对应的信息量进行聚合。  The amount of information corresponding to the minimum distance among the corrected distances is aggregated.
8、 根据权利要求 7所述的方法, 其特征在于, 所述根据计算得到的不同信 息量之间的距离对不同的信息量进行聚合进一步包括:  8. The method according to claim 7, wherein the aggregating different amounts of information according to the calculated distance between different information amounts further comprises:
当确定不同信息量之间没有指代关系和 /或对等关系时, 则将计算得到的不 同信息量之间的距离中最小距离对应的信息量进行聚合。  When it is determined that there is no referential relationship and/or a peer relationship between different amounts of information, the amount of information corresponding to the minimum distance among the distances of the calculated different amounts of information is aggregated.
9、 根据权利要求 8所述的方法, 其特征在于, 所述确定不同信息量之间是 否有指代关系和 /或对等关系包括:  9. The method according to claim 8, wherein the determining whether there is a referential relationship and/or a peer relationship between different amounts of information comprises:
根据各信息量的语法属性判断不同信息量之间的指代关系和 /或对等关系。 The referential relationship and/or the peer relationship between different amounts of information are judged according to the grammatical attributes of the respective information amounts.
10、 根据权利要求 8所述的方法, 其特征在于, 所述确定不同信息量之间是 否有指代关系和 /或对等关系进一步包括: 10. The method according to claim 8, wherein the determining whether there is a referential relationship and/or a peer relationship between different amounts of information further comprises:
根据各信息量的语法属性和距离关系判断不同信息量之间的指代关系和 /或 对等关系。  The referential relationship and/or the peer relationship between different amounts of information are judged based on the grammatical attribute and the distance relationship of each information amount.
1 1、 一种信息聚合装置, 其特征在于, 包括:  1 1. An information aggregation device, comprising:
信息确定单元, 用于确定信息量在文件中的相关信息;  An information determining unit, configured to determine related information of the information amount in the file;
计算单元, 用于根据所述相关信息计算不同信息量之间的距离;  a calculating unit, configured to calculate a distance between different information amounts according to the related information;
聚合单元, 用于根据计算得到的不同信息量之间的距离对不同的信息量进 行聚合。  An aggregation unit is configured to aggregate different amounts of information according to distances between different amounts of information calculated.
12、 根据权利要求 1 1所述的装置, 其特征在于,  12. Apparatus according to claim 1 1 , wherein
所述信息确定单元, 具体用于确定信息量在文件中的位置信息, 所述位置 信息包括: 段落位置, 起始位置, 结束位置; 所述段落位置表示所述信息量在 文件中的自然段落位置; 所述起始位置和结束位置表示所述信息量在文件中所 在句子中的位置, 所述信息量为用户关注的信息。 The information determining unit is specifically configured to determine location information of the information amount in the file, where the location information includes: a paragraph position, a starting position, and an ending position; the paragraph position indicates that the information amount is The natural paragraph position in the file; the start position and the end position indicate the position of the information amount in the sentence in the file, and the information amount is information of interest to the user.
13、 根据权利要求 12所述的装置, 其特征在于, 所述计算单元包括: 第一计算子单元, 用于根据所述相关信息计算所述信息量的标签数值, 以 获得不同信息量对应的标签数值;  The device according to claim 12, wherein the calculating unit comprises: a first calculating subunit, configured to calculate a tag value of the information amount according to the related information, to obtain a corresponding amount of information Label value
第二计算子单元, 用于根据所述标签数值计算不同信息量之间的距离。 a second calculating subunit, configured to calculate a distance between different amounts of information according to the tag value.
14、 根据权利要求 12所述的装置, 其特征在于, 14. Apparatus according to claim 12 wherein:
所述第一计算子单元, 具体用于利用以下公式计算所述信息量的标签数值: L =段落位置 *标签系数 + (起始位置 + 结束位置) /2;  The first calculating subunit is specifically configured to calculate a label value of the information amount by using the following formula: L = paragraph position * label coefficient + (starting position + ending position) /2;
所述第二计算子单元, 具体用于将不同信息量对应的标签数值的差值的绝 对值作为所述不同信息量之间的距离。  The second calculating sub-unit is specifically configured to use the absolute value of the difference of the tag values corresponding to different information amounts as the distance between the different information amounts.
15、 根据权利要求 1 1至 14任一项所述的装置, 其特征在于, 所述聚合单元 具体包括:  The device according to any one of claims 1 to 14, wherein the aggregation unit specifically comprises:
关系确定子单元,用于确定不同信息量之间是否有指代关系和 /或对等关系; 修正子单元,用于当所述关系确定子单元 301确定不同信息量之间有指代关 系和 /或对等关系时, 根据所述关系确定子单元确定的指代关系和 /或对等关系对 所述计算单元计算得到的不同信息量之间的距离进行修正;  a relationship determining subunit for determining whether there is a referential relationship and/or a peer relationship between different amounts of information; a modifying subunit for determining a reference relationship between the different information amounts when the relationship determining subunit 301 determines And/or a peer relationship, determining, according to the relationship, a reference relationship determined by the subunit and/or a peer relationship to correct a distance between different amounts of information calculated by the computing unit;
合并子单元, 用于在所述关系确定子单元确定不同信息量之间有指代关系 和 /或对等关系时, 将所述修正子单元修正后的距离中最小距离对应的信息量进 行聚合。  And a merging sub-unit, configured to: when the relationship determining sub-unit determines that there is a referential relationship and/or a peer-to-peer relationship between different amounts of information, the information amount corresponding to the minimum distance of the modified sub-units is aggregated .
16、 根据权利要求 15所述的装置, 其特征在于, 所述合并子单元进一步用 于在所述关系确定子单元确定不同信息量之间没有指代关系和 /或对等关系时, 将所述计算单元计算得到的不同信息量之间的距离中最小距离对应的信息量进 行聚合。  The apparatus according to claim 15, wherein the merging subunit is further configured to: when the relationship determining subunit determines that there is no referential relationship and/or a peer relationship between different amounts of information, The amount of information corresponding to the minimum distance among the distances between the different amounts of information calculated by the calculation unit is aggregated.
17、 根据权利要求 15所述的装置, 其特征在于, 所述关系确定子单元进一 步用于根据各信息量的语法属性判断不同信息量之间的指代关系和 /或对等关 系。 The device according to claim 15, wherein the relationship determining sub-unit is further configured to determine a referential relationship and/or a peer-to-peer relationship between different amounts of information according to syntax attributes of each information amount. Department.
18、 根据权利要求 15所述的装置, 其特征在于, 所述关系确定子单元进一 步用于根据各信息量的语法属性和距离关系判断不同信息量之间的指代关系和 / 或对等关系。  The device according to claim 15, wherein the relationship determining subunit is further configured to determine a referential relationship and/or a peer relationship between different amounts of information according to a grammatical attribute and a distance relationship of each information amount. .
PCT/CN2013/070051 2012-01-20 2013-01-05 Information aggregation method and device WO2013107297A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210018912.2 2012-01-20
CN201210018912.2A CN103218371B (en) 2012-01-20 2012-01-20 information aggregation method and device

Publications (1)

Publication Number Publication Date
WO2013107297A1 true WO2013107297A1 (en) 2013-07-25

Family

ID=48798612

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/070051 WO2013107297A1 (en) 2012-01-20 2013-01-05 Information aggregation method and device

Country Status (2)

Country Link
CN (1) CN103218371B (en)
WO (1) WO2013107297A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175327A (en) * 2019-05-11 2019-08-27 复旦大学 A kind of data privacy quantitative estimation method based on privacy information detection

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246484A (en) * 2007-02-15 2008-08-20 刘二中 Electric text similarity processing method and system convenient for query
CN102012936A (en) * 2010-12-07 2011-04-13 中国电信股份有限公司 Massive data aggregation method and system based on cloud computing platform

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011113384A2 (en) * 2011-04-26 2011-09-22 华为终端有限公司 Method and terminal device for information processing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246484A (en) * 2007-02-15 2008-08-20 刘二中 Electric text similarity processing method and system convenient for query
CN102012936A (en) * 2010-12-07 2011-04-13 中国电信股份有限公司 Massive data aggregation method and system based on cloud computing platform

Also Published As

Publication number Publication date
CN103218371B (en) 2017-04-26
CN103218371A (en) 2013-07-24

Similar Documents

Publication Publication Date Title
US11966494B2 (en) Threshold-based assembly of remote automated assistant responses
US9454586B2 (en) System and method for customizing analytics based on users media affiliation status
US8504906B1 (en) Sending selected text and corresponding media content
JP6246951B2 (en) Data settings for user contact entries
US20120290551A9 (en) System And Method For Identifying Trending Targets Based On Citations
US11769064B2 (en) Onboarding of entity data
KR102144868B1 (en) Apparatus and method for providing call record
US10783874B2 (en) Method and apparatus for providing voice feedback information to user in call
JP2015528956A5 (en)
US10051108B2 (en) Contextual information for a notification
US10311072B2 (en) System and method for metadata transfer among search entities
WO2017121355A1 (en) Search processing method and device
CN106462564A (en) Providing factual suggestions within a document
WO2013107297A1 (en) Information aggregation method and device
US8001114B2 (en) Methods and apparatuses for dynamically searching for electronic mail messages
EP2680256A1 (en) System and method to analyze voice communications
US20080021871A1 (en) Methods And Apparatuses For Dynamically Displaying Search Suggestions
US20190317937A1 (en) System and method for metadata transfer among search entities
CN106469112A (en) A kind of information processing system, method and electronic equipment
US20170126605A1 (en) Identifying and merging duplicate messages
CN103218372B (en) Method and device for aggregating information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13738916

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13738916

Country of ref document: EP

Kind code of ref document: A1