WO2013107308A1 - Method and apparatus for aggregating information - Google Patents

Method and apparatus for aggregating information Download PDF

Info

Publication number
WO2013107308A1
WO2013107308A1 PCT/CN2013/070146 CN2013070146W WO2013107308A1 WO 2013107308 A1 WO2013107308 A1 WO 2013107308A1 CN 2013070146 W CN2013070146 W CN 2013070146W WO 2013107308 A1 WO2013107308 A1 WO 2013107308A1
Authority
WO
WIPO (PCT)
Prior art keywords
distance
information
amount
information amount
text
Prior art date
Application number
PCT/CN2013/070146
Other languages
French (fr)
Chinese (zh)
Inventor
黄波
Original Assignee
华为终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为终端有限公司 filed Critical 华为终端有限公司
Publication of WO2013107308A1 publication Critical patent/WO2013107308A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present invention relates to the field of information recognition, and in particular, to a method and apparatus for aggregating information.
  • Aggregate information is the combination of different information with intrinsic links into a structure, such as a person's name, phone number, and email address. If the information belongs to someone's data, then the person's name, phone number, and email address can be combined into one. Large blocks of information, forming a structure: (personal name, phone number, email address). With information aggregation technology, users can provide one-stop personalized service with multi-source information.
  • Aggregate information is an important part of information extraction.
  • the core of aggregated information is to use a quantifiable standard. Choosing different metrics will affect the effect of information aggregation, which will affect the final result of information extraction.
  • a common method of information aggregation is the location labeling method. The method includes: first locating the words in the text, so that each information quantity has a unique position label in the text, and then using the position label to obtain a distance, the distance is It expresses the close relationship between the two information quantities, and finally aggregates the information amount according to the quantized near-near relationship to obtain the structure.
  • the location labeling method in the prior art provides a quantitative standard between the amounts of information, which only focuses on the location of the amount of information and the amount of information.
  • the distance between the two, and the distance is quantified according to the distance, but when there is a distance equal to the amount of information before and after the amount of information, the location label method does not propose a rigorous solution, if random Polymerization, due to the amount of information aggregated before An information amount, an aggregated information amount, and a subsequent information amount may obtain completely different aggregation results, and the obtained aggregation result may be inaccurate, and the inaccurate information amount is provided to the subsequent information extraction process, which will affect the entire information extraction. accuracy.
  • Embodiments of the present invention provide a method and apparatus for aggregating information.
  • the technical solution is as follows: A method for aggregating information, the method comprising:
  • the amount of information is aggregated based on the corrected first distance and second distance to obtain a structure.
  • An apparatus for aggregating information comprising:
  • a text acquisition module configured to acquire text to be aggregated
  • a location label obtaining module configured to acquire a location label of the amount of information in the text
  • a calculation module configured to calculate a distance between each two information quantities according to the position label
  • a correction module configured to correct the first distance and the second according to a grammatical structure when the first distance and the second distance are equal a distance, wherein the first distance is a distance between a first information amount and a second information quantity in the information amount, and the second distance is the first information quantity and the third information in the information quantity Great separation between quantities;
  • An aggregation module configured to perform the information amount according to the corrected first distance and the second distance Polymerize to obtain a structure.
  • An embodiment of the present invention provides a method and apparatus for aggregating information, by acquiring a location tag of an amount of information in the text; calculating a distance between each two information amounts according to the location tag; when the first distance and the first When the two distances are equal, the first distance and the second distance are corrected according to a grammatical structure, wherein the first distance is a distance between the first information amount and the second information quantity in the information amount, and the second The distance is a distance between the first information amount and the third information amount in the information amount; and the information amount is aggregated according to the corrected first distance and the second distance to obtain a structure.
  • the distance between the information amounts is equal, the distance is corrected according to the grammatical structure, and the information amount is aggregated according to the corrected distance, and the grammatical structure is taken into consideration based on the aggregation according to the location label. , improve the accuracy of information aggregation.
  • FIG. 1 is a flowchart of a method for aggregating information according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for aggregating information according to an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of an apparatus for aggregating information according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of an apparatus for aggregating information according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of a method for aggregating information according to an embodiment of the present invention.
  • This embodiment can It can be implemented on terminals including mobile phones, personal computers and tablets, and can also be applied to servers, for example, when monitoring users' emails or short messages, to automatically aggregate information that is of interest to users.
  • the embodiment specifically includes:
  • the text may be data including a character string, a punctuation mark, a line feed, and the like.
  • the text may be the text currently received by the terminal, or may be the text already saved by the terminal user and stored in the terminal. This example is only described by taking the text as the text currently received by the terminal as an example.
  • the text may be a user's mail or a short message, and may be other files, which are not limited in this embodiment of the present invention.
  • the amount of information refers to a string of certain attributes and meanings of some of the files, for example, may be a person's name, a phone number, an email address, and the like. These strings are useful information for information extraction, or information that users pay attention to, besides the person's name, phone number, and email address, they can also be conference topics, meeting locations, meeting content, and so on.
  • sentence segmentation techniques can be used to first divide a continuous string in each sentence in a file into different words, and then determine whether each of these words is the amount of information that needs attention. For example, you can pre-define some categories of information that need attention, classify the segmented word segments, and then determine whether they are the amount of information to be concerned according to the category of each word.
  • other ways can be used to identify the amount of information in the file. For example, you can set some vocabulary that needs attention, and then filter the contents of the file according to these vocabularies to find out the amount of information that needs attention.
  • the method of the embodiment of the present invention is applied to the case where the amount of information is three or more.
  • Each information quantity has a unique position in the text.
  • the location is identified by a location tag.
  • the specific content of the location tag includes: a natural paragraph position of the information amount in the text, and an initial Position and end position, the location label can be in the form of (paragraph position, start position, end position).
  • paragraph position is the natural paragraph position of the amount of information in the text; for example, the amount of information is in the first paragraph of the text, its value is 1, if in the second paragraph, the value is 2, and so on.
  • the maximum number of characters in a paragraph is a constant, which is recorded as max_size.
  • the starting position is the starting position of the information amount in the text
  • the ending position is the ending position of the information amount in the text
  • the starting position and the ending position are the coordinates of the information amount in the paragraph.
  • each Chinese character occupies two positions (for example: Bytes) ), the number occupies a position space, the starting position is 1 and the ending position is 23. It should be noted that the starting position and ending position of the information amount are also affected by the encoding format used in the paragraph. For example, in ascii encoding, each English character occupies one byte.
  • the location tag value of the information amount may be first calculated according to the location tag, and then the distance between each two information amounts is calculated according to the location tag value.
  • first distance and the second distance are equal, correct the first distance and the second distance according to a grammatical structure, where the first distance is a distance between the first information quantity and the second information quantity in the information quantity, The second distance is a distance between the first amount of information and the third amount of information in the amount of information.
  • the first information amount, the second information amount, and the third information amount are only used to refer to any three pieces of information in which the positional relationship described in this embodiment exists in the acquired information amount.
  • the grammatical structure refers to a vocabulary attribute or a sentence component of the first information amount, the second information amount, and the third information amount.
  • the first information quantity is The third information amount is aggregated to obtain a structure; when the first distance is smaller than the second distance, the third information quantity is further away from the first information quantity, and the second information quantity is closer to the first information quantity, then the aggregation is performed. At the time, the first amount of information and the second amount of information are aggregated to obtain a structure.
  • the second distance is corrected to avoid inaccurate information aggregation due to the equality of the first distance and the second distance.
  • the amount of information is aggregated according to the corrected first distance and the second distance to obtain a polymerized structure.
  • the specific process of the aggregation is the same as the prior art.
  • the aggregation refers to the classification and sorting of the amount of information, so that in the process of extracting information in the subsequent process, the information that is sorted and sorted is fed back to the user, instead of being disordered.
  • Information The structure is a general term for the aggregation results after the aggregation of the information amount. For a large amount of information, it is necessary to classify and sort them, and return the structures arranged or combined according to the preset rules.
  • the structure may be saved in a corresponding file, and/or directly displayed to an end user or a server user for user selection and the like.
  • FIG. 2 is a flowchart of a method for aggregating information according to an embodiment of the present invention. Referring to Figure 2, the embodiment specifically includes:
  • step 201 The text in the step 201 is the same as that in the step 101, and details are not described herein again.
  • the text in the text is identified according to the saved dictionary, wherein the recognition is to enable the terminal to learn the text in the text, compose the words into words or sentences, and perform subsequent steps according to the recognized words or sentences. process.
  • the terminal acquires three or more information amounts in the text according to the preset keyword, and the three or more information amounts may be words, numbers, letters, and the like.
  • the embodiment is described by taking three or more pieces of information as an example, and in other embodiments, when the amount of information acquired is one, no aggregation is needed, and the information may be used.
  • the amount is used as a structure, and when the amount of information acquired is two, it can be based on the existing aggregation principle. Polymerize to obtain a structure.
  • triggering of the acquisition of the information volume may include, but is not limited to, the following situations:
  • the terminal extracts the received text, and when the text is received, the information in the text is acquired, and the aggregated information is aggregated according to the acquired information, and the aggregated structure can be saved to the corresponding In the file, and/or directly to the end user or server user for the user to select and other operations.
  • the terminal extracts the locally saved text at intervals of time, and then acquires the amount of information in the text every other period of time, and aggregates according to the obtained information amount, and the aggregated structure can be saved to In the corresponding file, and/or directly to the end user or server user for the user to select and other operations.
  • the vocabulary attribute refers to a noun, an adjective, a verb, an adverb, etc.
  • the sentence component refers to a subject, a predicate, an object, etc.
  • the Chinese grammar is taken as an example.
  • the vocabulary attribute is a noun information amount can be used as a subject or The object, and the vocabulary attribute is a predicate of the verb.
  • the amount of information in the text is analyzed according to the defined vocabulary attributes in the Chinese grammar library, and the vocabulary attribute of each information quantity is obtained, and then The lexical attribute and the categorization or definition of the vocabulary attribute in the Chinese grammar library, and the sentence component of the information amount.
  • step 204 is the same as step 102, and details are not described herein again.
  • the position label is a coordinate of the position of the information amount in the text, and according to the position label, the position label value of the information amount can be known. Based on the example of step 102, the position label value of the information quantity is For:
  • first distance and the second distance are equal, correct the first distance and the second distance according to a grammatical structure, where the first distance is the first information quantity and the second information quantity of the at least two information quantities a distance between the first information amount and the third information amount of the at least two information amounts;
  • the first distance and the second distance are equal, and it can be understood that the second information amount and the third information amount are respectively located before and after the first information amount.
  • the grammatical structure refers to a vocabulary attribute or a sentence component of the first information amount, the second information amount, and the third information amount.
  • the first distance and the second distance are equal, acquiring the first information amount and the second information amount according to a grammatical structure and a sentence component or a vocabulary attribute of the first information amount, the second information amount, and the third information amount
  • the tightness between the tightness, the first amount of information, and the third amount of information corrects the first distance and the second distance according to the tightness of the acquisition.
  • step 203-206 the vocabulary attribute is obtained, and then the sentence component is obtained according to the vocabulary attribute as an example.
  • step 203 may be replaced.
  • step 206 is replaced by: when the first distance and the second distance are equal, according to the grammatical structure and the first amount of information, the second amount of information, and
  • the three-information vocabulary attribute corrects the first distance and the second distance.
  • the first information amount and the second information amount are acquired according to a grammatical structure and vocabulary attributes of the first information amount, the second information amount, and the third information amount.
  • the tightness between the tightness, the first amount of information, and the third amount of information corrects the first distance and the second distance according to the tightness of the acquisition.
  • the terminal may pre-store the correspondence between the sentence component, the vocabulary attribute and the closeness, and obtain the closeness corresponding to the information quantity according to the sentence component or the vocabulary attribute of the information quantity, and the closeness may refer to the grammar of the language.
  • the setting is performed, and the different sentence components correspond to different closenesses, and different vocabulary attributes correspond to different closenesses, and the specific value can be set by a technician, which is not specifically limited in this embodiment.
  • Obtaining the closeness corresponding to the sentence component or the vocabulary attribute according to the sentence component or the vocabulary attribute determined by each information amount, and correcting the distance between the information amounts according to the tightness, and the specific correction process may include: The tightness between the first amount of information and the second amount of information is greater than the tightness between the first amount of information and the third amount of information, then subtracting a disturbance value from the first distance and/or adding to the second distance The last disturbance value is such that the corrected first distance and the second distance are no longer equal, and the information is aggregated according to the corrected first distance and the second distance.
  • the value of the disturbance amount can be adjusted according to different syntax components, and the appropriate disturbance amount can be selected to ensure that the distance between the information amounts is unique.
  • the difference in tightness may also be expressed in other ways, such as multiplication or division by the disturbance coefficient, as long as the corrected first distance and the second distance are no longer equal. And can reflect the difference in tightness.
  • the distance between the information quantities is corrected, so that the quantitative metrics of "before and after" and "far near” between the information quantities are considered, and the distance between the information amounts is redefined by increasing or decreasing a disturbance amount.
  • the three or more information amounts are aggregated according to the corrected first distance and the second distance to obtain a structure after polymerization.
  • step 207 is the same as step 105, and details are not described herein again.
  • the method further includes:
  • the terminal Upon receiving the extraction request for the amount of information, the terminal returns the aggregated information.
  • the aggregated information is returned, which improves the accuracy and efficiency of the extracted information.
  • the distance between the information amounts is equal, the distance is corrected according to the grammatical structure, and the information amount is aggregated according to the corrected distance, and the aggregation is performed according to the location label.
  • the grammatical structure improves the accuracy of information aggregation and the performance of subsequent information extraction.
  • the amount of information obtained from the above text is as follows: Shanghai, come, water, from, at sea.
  • the amount of information before and after the "water” is “self” and “from”.
  • the distance between "water” and “self” is the same as the distance between “water” and “from”. Therefore, it is impossible to judge the amount of information "water” to be aggregated with that amount of information.
  • the corrected distance is:
  • the corrected distance between "water” and “self” is the positive distance minus the positive disturbance
  • the corrected distance between "water” and “from” is the original distance.
  • selecting a suitable disturbance amount value such as 0.25, makes the correction distance between the amount of information and the amount of information before and after. This modified distance can describe the tightness between the amount of information.
  • the value of the disturbance momentum is 0.25
  • the order of aggregation can be judged by the corrected distance.
  • “Water” should be aggregated with "self”.
  • the results of the information aggregation are: Shanghai, tap water, from, at sea.
  • FIG. 3 is a schematic structural diagram of an apparatus for aggregating information according to an embodiment of the present invention.
  • the device includes:
  • a text obtaining module 301 configured to acquire text to be aggregated
  • a location tag obtaining module 302 configured to acquire a location tag of the information amount in the text
  • a calculation module 303 configured to calculate a distance between each two information amounts according to the location tag
  • a correction module 304 configured to be used by When the distance is equal to the second distance, the first distance and the second distance are corrected according to a grammatical structure, wherein the first distance is a distance between the first information amount and the second information amount in the information amount, The second distance is a distance between the first information amount and the third information amount in the information amount;
  • the aggregation module 305 is configured to aggregate the information amount according to the corrected first distance and the second distance to obtain a structure.
  • the apparatus further includes:
  • a vocabulary identification module 306 configured to acquire a vocabulary attribute of the amount of information in the text
  • the correction module 304 is further configured to: when the first distance and the second distance are equal, correct the vocabulary attribute according to a grammatical structure and the first information amount, the second information amount, and the third information quantity a distance and a second distance;
  • the vocabulary identification module 306 is configured to acquire a vocabulary attribute of the information amount in the text, and determine a sentence component of the information amount according to the obtained attribute;
  • the correction module 304 is further configured to: when the first distance and the second distance are equal, correct the sentence according to a grammatical structure and sentence components of the first information amount, the second information amount, and the third information amount The first distance and the second distance.
  • the correction module 304 is specifically configured to: when the first distance and the second distance are equal, acquire the first information amount according to a grammatical structure and vocabulary attributes of the first information amount, the second information amount, and the third information amount The tightness between the second amount of information, the tightness between the first amount of information and the third amount of information, and correcting the first distance and the second distance according to the tightness of the acquisition;
  • the correction module 304 is further configured to: when the first distance and the second distance are equal, acquire the first information amount according to a grammatical structure and sentence components of the first information amount, the second information amount, and the third information amount The tightness between the second information amount, the first information amount, and the third information amount, and the first distance and the second distance are corrected according to the acquired tightness.
  • the specific content of the location tag includes: a natural paragraph position, a start position, and an end position of the information amount in the text.
  • where L(x) and L(y) are the position label value of the information amount X and the position label value of the information amount y, respectively;
  • position label value paragraph position X paragraph maximum word The number of symbols + (starting position + ending position) /2 , where the paragraph position is the natural paragraph position of the amount of information in the text.
  • the distance between the information amounts is equal, the distance is corrected according to the grammatical structure, and the information amount is aggregated according to the corrected distance, and the aggregation is performed according to the location label.
  • the grammatical structure improves the accuracy of information aggregation and the performance of subsequent information extraction.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed are a method and an apparatus for aggregating information, belonging to the field of information identification. The method comprises: obtaining a text to be aggregated; obtaining a location label of an information amount in the text; according to the location label, calculating the distance between every two information amounts; when a first distance is equal to a second distance, correcting the first distance and the second distance according to a grammatical structure, the first distance being the distance between a first information amount and a second information amount of the information amounts, and the second distance being the distance between the first information amount and a third information amount of the information amounts; and aggregating the information amounts according to the corrected first distance and second distance, to obtain a structural body.

Description

一种聚合信息的方法和装置 本申请要求于 2012年 1月 20日提交中国专利局、 申请号为 201210018940.4 发明名称为"一种聚合信息的方法和装置 "的中国专利申请的优先权其全部内容 通过引用结合在本申请中。  The present invention claims the priority of the Chinese Patent Application entitled "Method and Apparatus for Aggregating Information", filed on January 20, 2012, in the Chinese Patent Office, Application No. 20121001894. This is incorporated herein by reference.
技术领域 Technical field
本发明涉及信息识别领域, 特别涉及一种聚合信息的方法和装置。  The present invention relates to the field of information recognition, and in particular, to a method and apparatus for aggregating information.
背景技术 Background technique
聚合信息是将具有内在联系的不同信息组合成一个结构体, 例如人名、 电 话号码、 邮件地址, 如果这些信息都是属于某个人的资料, 那么就可以将该人 名、 电话号码、 邮件地址组成一个大的信息块, 而形成一个结构体: (人名, 电 话号码, 邮件地址)。 利用信息聚合技术, 可以为用户提供多来源信息的一站式 的个性化服务。  Aggregate information is the combination of different information with intrinsic links into a structure, such as a person's name, phone number, and email address. If the information belongs to someone's data, then the person's name, phone number, and email address can be combined into one. Large blocks of information, forming a structure: (personal name, phone number, email address). With information aggregation technology, users can provide one-stop personalized service with multi-source information.
聚合信息是信息提取的重要组成部分, 聚合信息的核心是利用一种可量化 的标准。 选用不同的衡量准则会影响到信息聚合的效果, 从而影响到信息提取 的最终结果。 信息聚合的常用方法是位置标签方法, 该方法包括: 首先对文本 中的词汇进行定位, 使得每个信息量在文本中都有惟一的位置标签, 然后利用 这个位置标签获取一个距离, 这个距离就表示两个信息量之间远近关系, 最后 根据该经过量化的远近关系对信息量进行聚合, 得到结构体。  Aggregate information is an important part of information extraction. The core of aggregated information is to use a quantifiable standard. Choosing different metrics will affect the effect of information aggregation, which will affect the final result of information extraction. A common method of information aggregation is the location labeling method. The method includes: first locating the words in the text, so that each information quantity has a unique position label in the text, and then using the position label to obtain a distance, the distance is It expresses the close relationship between the two information quantities, and finally aggregates the information amount according to the quantized near-near relationship to obtain the structure.
在实现本发明的过程中, 发明人发现现有技术至少存在以下问题: 现有技术中的位置标签方法提供了信息量间的一个量化的标准, 其仅关注 了信息量的位置以及信息量之间的距离, 并根据经过量化的远近关系也即是距 离进行聚合, 但是当信息量的前后各有一个与之距离相等的信息量时, 位置标 签方法中未提出严谨的解决方案, 如果随机的进行聚合, 由于聚合信息量与前 一个信息量、 与聚合信息量与后一个信息量可能获得完全不同的聚合结果, 获 得的聚合结果可能不准确, 而将不准确的信息量提供给后续的信息提取过程 , 将影响整个信息提取的准确性。 In the process of implementing the present invention, the inventors have found that the prior art has at least the following problems: The location labeling method in the prior art provides a quantitative standard between the amounts of information, which only focuses on the location of the amount of information and the amount of information. The distance between the two, and the distance is quantified according to the distance, but when there is a distance equal to the amount of information before and after the amount of information, the location label method does not propose a rigorous solution, if random Polymerization, due to the amount of information aggregated before An information amount, an aggregated information amount, and a subsequent information amount may obtain completely different aggregation results, and the obtained aggregation result may be inaccurate, and the inaccurate information amount is provided to the subsequent information extraction process, which will affect the entire information extraction. accuracy.
发明内容 Summary of the invention
本发明实施例提供了一种聚合信息的方法和装置。 所述技术方案如下: 一种聚合信息的方法, 所述方法包括:  Embodiments of the present invention provide a method and apparatus for aggregating information. The technical solution is as follows: A method for aggregating information, the method comprising:
获取待聚合的文本;  Get the text to be aggregated;
获取所述文本中信息量的位置标签;  Obtaining a location tag of the amount of information in the text;
根据所述位置标签, 计算每两个信息量之间的距离; 当第一距离和第二距离相等时, 根据语法结构修正所述第一距离和第二距 离, 其中, 所述第一距离为所述信息量中第一信息量与第二信息量之间的距离, 所述第二距离为所述信息量中所述第一信息量与第三信息量之间的距离;  Calculating, according to the location tag, a distance between each two information amounts; when the first distance and the second distance are equal, correcting the first distance and the second distance according to a syntax structure, where the first distance is a distance between the first information amount and the second information amount, wherein the second distance is a distance between the first information amount and the third information amount in the information amount;
将所述信息量根据所述修正后的第一距离、 第二距离进行聚合, 获得结构 体。  The amount of information is aggregated based on the corrected first distance and second distance to obtain a structure.
一种聚合信息的装置, 所述装置包括:  An apparatus for aggregating information, the apparatus comprising:
文本获取模块, 用于获取待聚合的文本;  a text acquisition module, configured to acquire text to be aggregated;
位置标签获取模块, 用于获取所述文本中信息量的位置标签;  a location label obtaining module, configured to acquire a location label of the amount of information in the text;
计算模块, 用于根据所述位置标签, 计算每两个信息量之间的距离; 修正模块, 用于当第一距离和第二距离相等时, 根据语法结构修正所述第 一距离和第二距离, 其中, 所述第一距离为所述信息量中第一信息量与第二信 息量之间的距离, 所述第二距离为所述信息量中所述第一信息量与第三信息量 之间的 巨离;  a calculation module, configured to calculate a distance between each two information quantities according to the position label; and a correction module, configured to correct the first distance and the second according to a grammatical structure when the first distance and the second distance are equal a distance, wherein the first distance is a distance between a first information amount and a second information quantity in the information amount, and the second distance is the first information quantity and the third information in the information quantity Great separation between quantities;
聚合模块, 用于将所述信息量根据所述修正后的第一距离、 第二距离进行 聚合, 获得结构体。 An aggregation module, configured to perform the information amount according to the corrected first distance and the second distance Polymerize to obtain a structure.
本发明实施例提供了一种聚合信息的方法和装置, 通过获取所述文本中信 息量的位置标签; 根据所述位置标签, 计算每两个信息量之间的距离; 当第一 距离和第二距离相等时, 根据语法结构修正所述第一距离和第二距离, 其中, 所述第一距离为所述信息量中第一信息量与第二信息量之间的距离, 所述第二 距离为所述信息量中所述第一信息量与第三信息量之间的距离; 将所述信息量 根据所述修正后的第一距离、 第二距离进行聚合, 获得结构体。 本发明实施例 在信息量之间的距离出现相等的情况时, 根据语法结构对距离进行修正, 并根 据修正后的距离对信息量进行聚合, 在根据位置标签进行聚合的基础上兼顾了 语法结构, 提升了信息聚合的准确性。  An embodiment of the present invention provides a method and apparatus for aggregating information, by acquiring a location tag of an amount of information in the text; calculating a distance between each two information amounts according to the location tag; when the first distance and the first When the two distances are equal, the first distance and the second distance are corrected according to a grammatical structure, wherein the first distance is a distance between the first information amount and the second information quantity in the information amount, and the second The distance is a distance between the first information amount and the third information amount in the information amount; and the information amount is aggregated according to the corrected first distance and the second distance to obtain a structure. In the embodiment of the present invention, when the distance between the information amounts is equal, the distance is corrected according to the grammatical structure, and the information amount is aggregated according to the corrected distance, and the grammatical structure is taken into consideration based on the aggregation according to the location label. , improve the accuracy of information aggregation.
附图说明 DRAWINGS
为了更清楚地说明本发明实施例中的技术方案, 下面将对实施例描述中所 需要使用的附图作简单地介绍, 显而易见地, 下面描述中的附图仅仅是本发明 的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动的前提下, 还可以根据这些附图获得其他的附图。  In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described. It is obvious that the drawings in the following description are only some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in view of the drawings.
图 1是本发明实施例提供的一种聚合信息的方法的流程图;  1 is a flowchart of a method for aggregating information according to an embodiment of the present invention;
图 2是本发明实施例提供的一种聚合信息的方法的流程图;  2 is a flowchart of a method for aggregating information according to an embodiment of the present invention;
图 3是本发明实施例提供的一种聚合信息的装置的结构示意图;  3 is a schematic structural diagram of an apparatus for aggregating information according to an embodiment of the present invention;
图 4是本发明实施例提供的一种聚合信息的装置的结构示意图。  FIG. 4 is a schematic structural diagram of an apparatus for aggregating information according to an embodiment of the present invention.
具体实施方式 detailed description
为使本发明的目的、 技术方案和优点更加清楚, 下面将结合附图对本发明 实施方式作进一步地详细描述。  The embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.
图 1是本发明实施例提供的一种聚合信息的方法的流程图。 该实施例可以 在包括手机, 个人计算机和平板电脑等终端上实现, 也可以应用于服务器, 例 如, 监测用户的邮件或者短信息时, 来自动实现对其中用户关注的信息的聚合。 参见图 1 , 该实施例具体包括: FIG. 1 is a flowchart of a method for aggregating information according to an embodiment of the present invention. This embodiment can It can be implemented on terminals including mobile phones, personal computers and tablets, and can also be applied to servers, for example, when monitoring users' emails or short messages, to automatically aggregate information that is of interest to users. Referring to FIG. 1, the embodiment specifically includes:
101、 获取待聚合的文本。  101. Obtain the text to be aggregated.
本实施例中, 文本可以是包括字符串、 标点符号和换行符等的数据。  In this embodiment, the text may be data including a character string, a punctuation mark, a line feed, and the like.
需要说明的是, 该文本可以是终端当前接收到的文本, 也可以是由终端用 户指定的已经保存在终端的文本。 本实施例仅以该文本为终端当前接收到的文 本为例进行说明。 该文本可以是用户的邮件或者短信息, 当然也可以是其它文 件, 对此本发明实施例不做限定。  It should be noted that the text may be the text currently received by the terminal, or may be the text already saved by the terminal user and stored in the terminal. This example is only described by taking the text as the text currently received by the terminal as an example. The text may be a user's mail or a short message, and may be other files, which are not limited in this embodiment of the present invention.
102、 获取文本中各个信息量的位置标签。  102. Obtain a location label of each information amount in the text.
信息量是指文件中的一些的具有特定属性及含义的字符串, 例如, 可以是 人名、 电话号码、 邮箱地址等等。 这些字符串对信息提取来说都是有用的资料, 或者是用户关注的信息, 除了人名、 电话号码、 邮箱地址, 也可以是会议主题、 会议地点、 会议内容等等。 在实际应用中, 可以利用句子切分技术, 首先将文 件中每个句子中的连续字符串切分为不同的词, 然后再确定其中的每个词是否 为需要关注的信息量。 比如可以预先定义一些需要关注的信息量的类别, 对切 分后的分词进行类别标注, 然后根据各词的类别确定其是否为需要关注的信息 量。 除此之外, 还可以利用其它方式来识别文件中的信息量, 比如, 可以设置 一些需要关注的词汇表, 然后, 根据这些词汇表过滤文件中的内容, 找出其中 需要关注的信息量。  The amount of information refers to a string of certain attributes and meanings of some of the files, for example, may be a person's name, a phone number, an email address, and the like. These strings are useful information for information extraction, or information that users pay attention to, besides the person's name, phone number, and email address, they can also be conference topics, meeting locations, meeting content, and so on. In practical applications, sentence segmentation techniques can be used to first divide a continuous string in each sentence in a file into different words, and then determine whether each of these words is the amount of information that needs attention. For example, you can pre-define some categories of information that need attention, classify the segmented word segments, and then determine whether they are the amount of information to be concerned according to the category of each word. In addition, other ways can be used to identify the amount of information in the file. For example, you can set some vocabulary that needs attention, and then filter the contents of the file according to these vocabularies to find out the amount of information that needs attention.
当然, 还可以有更多其它方式来识别文件中的信息量, 对此本发明实施例 不做限定。  Of course, there are many other ways to identify the amount of information in the file, which is not limited in this embodiment of the present invention.
本发明实施例的方法应用于信息量为三个或三个以上的情况。 每个信息量在文本中都有一个唯一的位置, 在本实施例中, 该位置用位置 标签来标识, 优选地, 位置标签的具体内容包括: 信息量在文本中的自然段落 位置、 起始位置和结束位置, 位置标签的形式可以为 (段落位置, 起始位置, 结束位置) 。 其中, 段落位置是信息量在文本中的自然段落位置; 例如信息量 处于文本的首段, 它的数值就是 1 , 如果处于第二段落, 数值就是 2 , 如此类 推。 段落最大字符数是一个常数, 记为 max— size, 这个数值通常会取文本中所 有段落中包含字符数的最大值, 例如该文本中有三段文字, 第一段有 100字节, 第二段有 500字节, 第三段有 1000字节, 那么该文本的 max— size = max ( 100 , 500, 1000 ) = 1000。 另外, 起始位置为信息量在文本中的起始位置, 结束位置 为信息量在文本中的结束位置, 起始位置和结束位置为该信息量在段落中的坐 标。 The method of the embodiment of the present invention is applied to the case where the amount of information is three or more. Each information quantity has a unique position in the text. In this embodiment, the location is identified by a location tag. Preferably, the specific content of the location tag includes: a natural paragraph position of the information amount in the text, and an initial Position and end position, the location label can be in the form of (paragraph position, start position, end position). Where the paragraph position is the natural paragraph position of the amount of information in the text; for example, the amount of information is in the first paragraph of the text, its value is 1, if in the second paragraph, the value is 2, and so on. The maximum number of characters in a paragraph is a constant, which is recorded as max_size. This value usually takes the maximum number of characters in all paragraphs in the text. For example, there are three paragraphs in the text, the first paragraph has 100 bytes, and the second paragraph There are 500 bytes, and the third segment has 1000 bytes, then the max_size = max (100, 500, 1000) = 1000 of the text. In addition, the starting position is the starting position of the information amount in the text, and the ending position is the ending position of the information amount in the text, and the starting position and the ending position are the coordinates of the information amount in the paragraph.
例如: "小明今天到北京出差, 他的电话是 12345678。 " 4叚设上述文字处 于文本中的第 n个段落, 在 GB2313的编码格式下,每个汉字占用两个位置空间 (例如: 字节), 数字占用一个位置空间, 开始的位置是 1 , 结束的位置是 23。 需要说明的是, 信息量的起始位置和结束位置也受到该段落釆用的编码格式的 影响, 例如在 ascii编码中, 每个英文字符占用一个字节。  For example: "Xiao Ming went to Beijing for a business trip today. His phone number is 12345678." 4 The above paragraph is in the nth paragraph of the text. In the encoding format of GB2313, each Chinese character occupies two positions (for example: Bytes) ), the number occupies a position space, the starting position is 1 and the ending position is 23. It should be noted that the starting position and ending position of the information amount are also affected by the encoding format used in the paragraph. For example, in ascii encoding, each English character occupies one byte.
则信息量及其位置标签如下:  Then the amount of information and its location label are as follows:
小明 (n, 1 , 4), 他 (n, 21 , 22), 电话 (n, 25, 28), 12345678(n, 31 , Xiao Ming (n, 1 , 4), he (n, 21, 22), telephone (n, 25, 28), 12345678 (n, 31,
38)。 38).
103、 根据所述位置标签, 计算每两个信息量之间的距离。  103. Calculate a distance between each two information amounts according to the location label.
具体地, 可以先根据所述位置标签计算所述信息量的位置标签数值, 然后 根据所述位置标签数值计算每两个信息量之间的距离。 位置标签数值的计算公 式具体为: 位置标签数值 =段落位置 X 段落最大字符数 + (起始位置 + 结束位 置) /2。 Specifically, the location tag value of the information amount may be first calculated according to the location tag, and then the distance between each two information amounts is calculated according to the location tag value. The calculation formula for the position label value is as follows: Position label value = paragraph position X paragraph maximum number of characters + (start position + end position Set) /2.
而每两个信息量之间的距离使用的公式为:距离 = |L(X) - L(y) 其中, L(x) 和 L y)分别为信息量 X的位置标签数值和信息量 y的位置标签数值。 The formula for the distance between each two information quantities is: distance = |L( X ) - L(y) where L(x) and L y) are the position label value and information amount of the information quantity X, respectively. The location tag value.
104、 当第一距离和第二距离相等时, 根据语法结构修正该第一距离和第二 距离, 其中, 该第一距离为信息量中第一信息量与第二信息量之间的距离, 该 第二距离为该信息量中该第一信息量与第三信息量之间的距离。  104. When the first distance and the second distance are equal, correct the first distance and the second distance according to a grammatical structure, where the first distance is a distance between the first information quantity and the second information quantity in the information quantity, The second distance is a distance between the first amount of information and the third amount of information in the amount of information.
其中, 第一信息量、 第二信息量和第三信息量仅用于指代获取到的信息量 中存在本实施例所述的位置关系的任意三个信息量。 语法结构是指第一信息 量、 第二信息量和第三信息量的词汇属性或句子成分等。  The first information amount, the second information amount, and the third information amount are only used to refer to any three pieces of information in which the positional relationship described in this embodiment exists in the acquired information amount. The grammatical structure refers to a vocabulary attribute or a sentence component of the first information amount, the second information amount, and the third information amount.
可以理解, 当第一距离大于第二距离时, 表示第三信息量距离第一信息量 更近, 而第二信息量距离第一信息量更远, 则在聚合时, 将第一信息量和第三 信息量进行聚合, 获得结构体; 当第一距离小于第二距离时, 表示第三信息量 距离第一信息量更远, 而第二信息量距离第一信息量更近, 则在聚合时, 将第 一信息量和第二信息量进行聚合, 获得结构体。  It can be understood that when the first distance is greater than the second distance, indicating that the third information quantity is closer to the first information quantity, and the second information quantity is further away from the first information quantity, then in the aggregation, the first information quantity is The third information amount is aggregated to obtain a structure; when the first distance is smaller than the second distance, the third information quantity is further away from the first information quantity, and the second information quantity is closer to the first information quantity, then the aggregation is performed. At the time, the first amount of information and the second amount of information are aggregated to obtain a structure.
若对于第一信息量来说, 当第一距离和第二距离相等时, 根据该第一信息 量、 第二信息量和第三信息量已经确定的词汇属性或句子成分, 对第一距离和 第二距离进行修正, 避免了由于第一距离和第二距离相等而造成的信息聚合不 准确。  If, for the first amount of information, when the first distance and the second distance are equal, the lexical attribute or the sentence component that has been determined according to the first information amount, the second information amount, and the third information amount, The second distance is corrected to avoid inaccurate information aggregation due to the equality of the first distance and the second distance.
105、 将该信息量根据该修正后的该第一距离、 第二距离进行聚合, 获得聚 合后的结构体。  105. The amount of information is aggregated according to the corrected first distance and the second distance to obtain a polymerized structure.
该聚合的具体过程与现有技术同理, 该聚合是指对信息量的归类和整理, 以便在后续提取信息的过程中, 向用户反馈的是经过归类和整理的信息, 而不 是杂乱无章的信息。 其中, 结构体是指对信息量进行聚合后的聚合结果的统称, 对于大量的信 息量来说, 需要对其进行归类和整理, 并返回根据预设规则排列或组合的结构 体。 The specific process of the aggregation is the same as the prior art. The aggregation refers to the classification and sorting of the amount of information, so that in the process of extracting information in the subsequent process, the information that is sorted and sorted is fed back to the user, instead of being disordered. Information. The structure is a general term for the aggregation results after the aggregation of the information amount. For a large amount of information, it is necessary to classify and sort them, and return the structures arranged or combined according to the preset rules.
在实际应用中, 终端设备得到上述结构体后, 可以将该结构体保存到相应 的文件中, 和 /或直接展示给终端用户或服务器用户, 以供用户选择等操作。  In an actual application, after the terminal device obtains the foregoing structure, the structure may be saved in a corresponding file, and/or directly displayed to an end user or a server user for user selection and the like.
本实施例提供的方法, 当信息量之间的距离出现相等的情况时, 根据语法 结构对距离进行修正, 并根据修正后的距离对信息量进行聚合, 在根据位置标 签进行聚合的基础上兼顾了语法结构, 提升了信息聚合的准确性和后续提取信 息的性能。 图 2是本发明实施例提供的一种聚合信息的方法的流程图。 参见图 2, 该实 施例具体包括:  In the method provided by the embodiment, when the distance between the information amounts is equal, the distance is corrected according to the grammatical structure, and the information amount is aggregated according to the corrected distance, and the aggregation is performed according to the location label. The grammatical structure improves the accuracy of information aggregation and the performance of subsequent information extraction. FIG. 2 is a flowchart of a method for aggregating information according to an embodiment of the present invention. Referring to Figure 2, the embodiment specifically includes:
201、 获取待聚合的文本;  201. Acquire text to be aggregated;
该步骤 201中的文本与步骤 101的同理, 在此不再赘述。  The text in the step 201 is the same as that in the step 101, and details are not described herein again.
具体地, 接收文本后, 根据保存的字典对文本中的文字进行识别, 该识别 是为了使终端能够学习文本中的文字, 将文字组成词汇或语句, 并根据识别到 的词汇或语句进行后续的过程。  Specifically, after receiving the text, the text in the text is identified according to the saved dictionary, wherein the recognition is to enable the terminal to learn the text in the text, compose the words into words or sentences, and perform subsequent steps according to the recognized words or sentences. process.
202、 根据预设关键词获取三个或三个以上的信息量;  202. Obtain three or more information amounts according to preset keywords;
在本实施例中, 终端根据预设关键字在该文本中获取三个或三个以上的信 息量, 该三个或三个以上的信息量可以是词汇, 也可以是数字、 字母等。  In this embodiment, the terminal acquires three or more information amounts in the text according to the preset keyword, and the three or more information amounts may be words, numbers, letters, and the like.
可以理解的是, 本实施例是以获取的信息量为三个或三个以上的为例进行 说明, 而在其他实施例中, 当获取的信息量为一个时, 无需聚合, 可以将该信 息量作为结构体, 而当获取的信息量为两个时, 可根据现有的聚合的原则进行 聚合, 获得结构体。 It can be understood that the embodiment is described by taking three or more pieces of information as an example, and in other embodiments, when the amount of information acquired is one, no aggregation is needed, and the information may be used. The amount is used as a structure, and when the amount of information acquired is two, it can be based on the existing aggregation principle. Polymerize to obtain a structure.
需要说明的是, 触发对信息量的获取可以包括但不限于以下情况:  It should be noted that the triggering of the acquisition of the information volume may include, but is not limited to, the following situations:
( 1 )终端对接收到的文本进行信息提取, 当接收到文本时, 则对该文本中 的信息量进行获取, 并根据获取的信息量进行聚合, 可以将聚合后的结构体保 存到相应的文件中, 和 /或直接展示给终端用户或服务器用户, 以供用户选择等 操作。  (1) The terminal extracts the received text, and when the text is received, the information in the text is acquired, and the aggregated information is aggregated according to the acquired information, and the aggregated structure can be saved to the corresponding In the file, and/or directly to the end user or server user for the user to select and other operations.
( 2 )终端每隔一段时长对本地保存的文本进行信息提取, 则每隔一段时长 对该文本中的信息量进行获取, 并根据获取的信息量进行聚合, 可以将聚合后 的结构体保存到相应的文件中, 和 /或直接展示给终端用户或服务器用户, 以供 用户选择等操作。  (2) The terminal extracts the locally saved text at intervals of time, and then acquires the amount of information in the text every other period of time, and aggregates according to the obtained information amount, and the aggregated structure can be saved to In the corresponding file, and/or directly to the end user or server user for the user to select and other operations.
203、 获取该文本中该三个或三个以上的信息量的词汇属性, 并才艮据获取的 属性获取该三个或三个以上的信息量的句子成分;  203. Obtain lexical attributes of the three or more information quantities in the text, and obtain the sentence components of the three or more information quantities according to the obtained attributes;
其中, 词汇属性是指名词、 形容词、 动词、 副词等, 而句子成分是指主语、 谓语、 宾语等, 以中文语法为例进行说明, 一般来说, 词汇属性为名词的信息 量可以作为主语或宾语, 而词汇属性为动词的信息量可以作为谓语, 在本实施 例中, 根据中文语法库中已定义的词汇属性对文本中的信息量进行分析, 获取 每个信息量的词汇属性, 再根据词汇属性以及中文语法库中对该词汇属性的归 类或定义, 获取信息量的句子成分。  The vocabulary attribute refers to a noun, an adjective, a verb, an adverb, etc., and the sentence component refers to a subject, a predicate, an object, etc., and the Chinese grammar is taken as an example. Generally speaking, the vocabulary attribute is a noun information amount can be used as a subject or The object, and the vocabulary attribute is a predicate of the verb. In this embodiment, the amount of information in the text is analyzed according to the defined vocabulary attributes in the Chinese grammar library, and the vocabulary attribute of each information quantity is obtained, and then The lexical attribute and the categorization or definition of the vocabulary attribute in the Chinese grammar library, and the sentence component of the information amount.
204、 获取文本中各个信息量的位置标签;  204. Obtain a location label of each information amount in the text;
该步骤 204与步骤 102同理, 在此不再赘述。  This step 204 is the same as step 102, and details are not described herein again.
205、 根据获取的位置标签, 计算每两个信息量之间的距离;  205. Calculate a distance between each two information amounts according to the obtained location label.
位置标签是信息量在文本中位置的坐标, 根据该位置标签, 可以获知信息 量的位置标签数值, 在步骤 102的示例的基础上, 上述信息量的位置标签数值 为: The position label is a coordinate of the position of the information amount in the text, and according to the position label, the position label value of the information amount can be known. Based on the example of step 102, the position label value of the information quantity is For:
L (小明) = nxmax size + 5/2  L (小明) = nxmax size + 5/2
L (他) = nxmax size + 43/2  L (he) = nxmax size + 43/2
L(电话) = nxmax size + 53/2  L (telephone) = nxmax size + 53/2
L(12345678) = nxmax size + 59/2  L(12345678) = nxmax size + 59/2
由此, 上述信息量之间的距离为:  Thus, the distance between the above information amounts is:
d(小明, 他) = 19  d (小明, he) = 19
d (电话, 12345678) = 3  d (telephone, 12345678) = 3
d (他, 电话) = 5  d (he, phone) = 5
206、 当第一距离和第二距离相等时, 根据语法结构修正该第一距离和第二 距离, 其中, 该第一距离为该至少两个信息量中第一信息量与第二信息量之间 的距离, 该第二距离为该至少两个信息量中该第一信息量与第三信息量之间的 距离;  206. When the first distance and the second distance are equal, correct the first distance and the second distance according to a grammatical structure, where the first distance is the first information quantity and the second information quantity of the at least two information quantities a distance between the first information amount and the third information amount of the at least two information amounts;
对于文本来说, 第一距离和第二距离相等, 可以理解为第二信息量和第三 信息量分别位于第一信息量的前后位置。  For the text, the first distance and the second distance are equal, and it can be understood that the second information amount and the third information amount are respectively located before and after the first information amount.
其中, 语法结构是指第一信息量、 第二信息量和第三信息量的词汇属性或 句子成分等。  The grammatical structure refers to a vocabulary attribute or a sentence component of the first information amount, the second information amount, and the third information amount.
当第一距离和第二距离相等时, 根据语法结构和所述第一信息量、 第二信 息量和第三信息量的句子成分或词汇属性获取所述第一信息量和第二信息量之 间的紧密度、 第一信息量和第三信息量之间的紧密度, 根据获取的紧密度修正 所述第一距离和第二距离。  When the first distance and the second distance are equal, acquiring the first information amount and the second information amount according to a grammatical structure and a sentence component or a vocabulary attribute of the first information amount, the second information amount, and the third information amount The tightness between the tightness, the first amount of information, and the third amount of information corrects the first distance and the second distance according to the tightness of the acquisition.
在本实施例中的步骤 203-206中, 是以获取了词汇属性后, 再根据词汇属性 获取句子成分为例进行说明的, 可选地, 在另一实施例中, 步骤 203可以替换 为: 获取所述文本中信息量的词汇属性, 而相应地, 步骤 206替换为: 当第一 距离和第二距离相等时, 根据语法结构和所述第一信息量、 第二信息量和第三 信息量的词汇属性, 修正所述第一距离和第二距离。 具体地, 当第一距离和第 二距离相等时, 根据语法结构和所述第一信息量、 第二信息量和第三信息量的 词汇属性获取所述第一信息量和第二信息量之间的紧密度、 第一信息量和第三 信息量之间的紧密度, 根据获取的紧密度修正所述第一距离和第二距离。 In the step 203-206 in the embodiment, the vocabulary attribute is obtained, and then the sentence component is obtained according to the vocabulary attribute as an example. Alternatively, in another embodiment, step 203 may be replaced. For: obtaining a vocabulary attribute of the amount of information in the text, and correspondingly, step 206 is replaced by: when the first distance and the second distance are equal, according to the grammatical structure and the first amount of information, the second amount of information, and The three-information vocabulary attribute corrects the first distance and the second distance. Specifically, when the first distance and the second distance are equal, the first information amount and the second information amount are acquired according to a grammatical structure and vocabulary attributes of the first information amount, the second information amount, and the third information amount. The tightness between the tightness, the first amount of information, and the third amount of information corrects the first distance and the second distance according to the tightness of the acquisition.
其中, 终端可以预先保存句子成分、 词汇属性与紧密度的对应关系, 根据 信息量的句子成分或词汇属性, 从对应关系中获取该信息量所对应的紧密度, 该紧密度可以参照语言的语法进行设置, 不同的句子成分之间对应不同的紧密 度, 不同的词汇属性对应不同的紧密度, 该具体数值可以由技术人员进行设置, 本实施例不做具体限定。  The terminal may pre-store the correspondence between the sentence component, the vocabulary attribute and the closeness, and obtain the closeness corresponding to the information quantity according to the sentence component or the vocabulary attribute of the information quantity, and the closeness may refer to the grammar of the language. The setting is performed, and the different sentence components correspond to different closenesses, and different vocabulary attributes correspond to different closenesses, and the specific value can be set by a technician, which is not specifically limited in this embodiment.
根据每个信息量已经确定的句子成分或词汇属性, 获取该句子成分或词汇 属性对应的紧密度, 再根据该紧密度对信息量之间的距离进行修正, 其具体的 修正过程可以包括: 当第一信息量和第二信息量之间的紧密度大于第一信息量 和第三信息量之间的紧密度, 则在第一距离上减去一个扰动值和 /或在第二距离 上加上一个扰动值, 使得修正后的第一距离和第二距离不再相等, 并根据修正 后的第一距离和第二距离进行信息聚合。 当第一信息量和第二信息量之间的紧 密度小于第一信息量和第三信息量之间的紧密度, 则在第一距离上加上一个扰 动值和 /或在第二距离上减去一个扰动值, 使得修正后的第一距离和第二距离不 再相等, 并根据修正后的第一距离和第二距离进行信息聚合。 其中, 扰动量的 数值可以根据不同的语法成分调整, 选择适当的扰动量, 可以保证信息量间的 距离保持唯一性。 需要说明的是, 也可以用其他方式体现紧密度的差异, 例如 乘以或除以扰动系数, 只要能够使得修正后的第一距离和第二距离不再相等, 且能够体现紧密度的差异即可。 根据语法结构对信息量之间的距离进行修正, 使得考虑到了信息量之间 "前后" 、 "远近" 的量化度量标准, 通过增或减一 个扰动量, 重新定义信息量之间的距离。 Obtaining the closeness corresponding to the sentence component or the vocabulary attribute according to the sentence component or the vocabulary attribute determined by each information amount, and correcting the distance between the information amounts according to the tightness, and the specific correction process may include: The tightness between the first amount of information and the second amount of information is greater than the tightness between the first amount of information and the third amount of information, then subtracting a disturbance value from the first distance and/or adding to the second distance The last disturbance value is such that the corrected first distance and the second distance are no longer equal, and the information is aggregated according to the corrected first distance and the second distance. When the tightness between the first amount of information and the second amount of information is less than the tightness between the first amount of information and the third amount of information, adding a disturbance value to the first distance and/or at the second distance Subtracting a disturbance value such that the corrected first distance and the second distance are no longer equal, and the information is aggregated according to the corrected first distance and the second distance. Among them, the value of the disturbance amount can be adjusted according to different syntax components, and the appropriate disturbance amount can be selected to ensure that the distance between the information amounts is unique. It should be noted that the difference in tightness may also be expressed in other ways, such as multiplication or division by the disturbance coefficient, as long as the corrected first distance and the second distance are no longer equal. And can reflect the difference in tightness. According to the grammatical structure, the distance between the information quantities is corrected, so that the quantitative metrics of "before and after" and "far near" between the information quantities are considered, and the distance between the information amounts is redefined by increasing or decreasing a disturbance amount.
207、 将该三个或三个以上信息量根据修正后的第一距离、 第二距离进行聚 合, 获得聚合后的结构体。  207. The three or more information amounts are aggregated according to the corrected first distance and the second distance to obtain a structure after polymerization.
该步骤 207与步骤 105同理, 在此不再赘述。  This step 207 is the same as step 105, and details are not described herein again.
可选的, 步骤 207之后还包括:  Optionally, after step 207, the method further includes:
当接收到对信息量的提取请求时, 终端返回聚合后的信息。  Upon receiving the extraction request for the amount of information, the terminal returns the aggregated information.
通过信息的聚合, 并在接收到对信息量或以预设关键字的提取请求时, 返 回聚合后的信息, 提升了提取信息的准确性和效率。  Through the aggregation of information, and receiving the extraction request for the amount of information or the preset keyword, the aggregated information is returned, which improves the accuracy and efficiency of the extracted information.
本实施例提供的方法, 当信息量之间的距离出现相等的情况时, 根据语法 结构对距离进行修正, 并根据修正后的距离对信息量进行聚合, 在根据位置标 签进行聚合的基础上兼顾了语法结构, 提升了信息聚合的准确性和后续提取信 息的性能。  In the method provided by the embodiment, when the distance between the information amounts is equal, the distance is corrected according to the grammatical structure, and the information amount is aggregated according to the corrected distance, and the aggregation is performed according to the location label. The grammatical structure improves the accuracy of information aggregation and the performance of subsequent information extraction.
基于本发明提供的实施例, 举例如下: 待聚合的文本为: "上海自来水来 自海上" 。  Based on the embodiments provided by the present invention, examples are as follows: The text to be aggregated is: "Shanghai tap water comes from the sea".
从上述文本中获取的信息量如下: 上海, 自来, 水, 来自, 海上。  The amount of information obtained from the above text is as follows: Shanghai, come, water, from, at sea.
仅以对信息量 "水" 的聚合方法进行说明。  Only the aggregation method for the amount of information "water" will be described.
信息量 "水" 的前后信息量分别是 "自来" 和 "来自" 。 "水" 与 "自 来" 的距离与 "水" 跟 "来自" 的距离一样, 因此, 不能够判断信息量 "水" 要与那个信息量进行聚合。  The amount of information before and after the "water" is "self" and "from". The distance between "water" and "self" is the same as the distance between "water" and "from". Therefore, it is impossible to judge the amount of information "water" to be aggregated with that amount of information.
对于 "水" 来说, "自来" 是修饰词, "来自" 是动词; 通过根据词汇属 性确定紧密度, 可以获知, 修饰词与名词的紧密度高于动词对名词的紧密度, 因此, "自来" 对 "水" 的紧密度要比 "来自" 对 "水" 的紧密度高。 故修正 后的距离为: "水" 与 "自来" 的修正后距离是原距离基础上减去一个正的扰 动量, 而 "水" 与 "来自" 的修正后距离则是原距离基础上增加一个正的扰动 量。 更进一步, 选择一个合适的扰动量数值, 如 0.25 , 使得这个信息量的与前 后信息量之间的修正距离不等。 这个修正的距离可以描述信息量之间的紧密程 度, 如上述例子, 取扰动量数值为 0.25, "自来" 与 "水" 的修正距离为: d(自 来, 水) = 3-0.25 = 2.75; d (水, 来自) = 3+0.25 = 3.25。 这样, 通过修正后的 距离就可以判断出聚合的次序。 "水" 应该与 "自来" 聚合。 信息聚合后的结 果为: 上海, 自来水, 来自, 海上。 For "water", "self" is a modifier, "from" is a verb; by determining the tightness according to the lexical attribute, it can be known that the closeness of the modifier and the noun is higher than the closeness of the verb to the noun, Therefore, the "tightness" of "self" to "water" is higher than that of "water". Therefore, the corrected distance is: The corrected distance between "water" and "self" is the positive distance minus the positive disturbance, and the corrected distance between "water" and "from" is the original distance. Increase a positive disturbance. Further, selecting a suitable disturbance amount value, such as 0.25, makes the correction distance between the amount of information and the amount of information before and after. This modified distance can describe the tightness between the amount of information. For example, the value of the disturbance momentum is 0.25, and the correction distance between "self" and "water" is: d (self, water) = 3-0.25 = 2.75; d (water, from) = 3+0.25 = 3.25. Thus, the order of aggregation can be judged by the corrected distance. "Water" should be aggregated with "self". The results of the information aggregation are: Shanghai, tap water, from, at sea.
图 3是本发明实施例提供的一种聚合信息的装置的结构示意图。 参见图 3 , 该装置包括:  FIG. 3 is a schematic structural diagram of an apparatus for aggregating information according to an embodiment of the present invention. Referring to Figure 3, the device includes:
文本获取模块 301 , 用于获取待聚合的文本;  a text obtaining module 301, configured to acquire text to be aggregated;
位置标签获取模块 302 , 用于获取所述文本中信息量的位置标签; 计算模块 303 , 用于根据所述位置标签, 计算每两个信息量之间的距离; 修正模块 304, 用于当第一距离和第二距离相等时, 根据语法结构修正所述 第一距离和第二距离, 其中, 所述第一距离为所述信息量中第一信息量与第二 信息量之间的距离, 所述第二距离为所述信息量中所述第一信息量与第三信息 量之间的距离;  a location tag obtaining module 302, configured to acquire a location tag of the information amount in the text; a calculation module 303, configured to calculate a distance between each two information amounts according to the location tag; and a correction module 304, configured to be used by When the distance is equal to the second distance, the first distance and the second distance are corrected according to a grammatical structure, wherein the first distance is a distance between the first information amount and the second information amount in the information amount, The second distance is a distance between the first information amount and the third information amount in the information amount;
聚合模块 305, 用于将所述信息量根据所述修正后的第一距离、 第二距离进 行聚合, 获得结构体。  The aggregation module 305 is configured to aggregate the information amount according to the corrected first distance and the second distance to obtain a structure.
可选地, 参见图 4 , 所述装置还包括:  Optionally, referring to FIG. 4, the apparatus further includes:
词汇识别模块 306 , 用于获取所述文本中信息量的词汇属性;  a vocabulary identification module 306, configured to acquire a vocabulary attribute of the amount of information in the text;
相应地, 所述修正模块 304, 用于还用于当第一距离和第二距离相等时, 根据语法结 构和所述第一信息量、 第二信息量和第三信息量的词汇属性, 修正所述第一距 离和第二距离; Correspondingly, The correction module 304 is further configured to: when the first distance and the second distance are equal, correct the vocabulary attribute according to a grammatical structure and the first information amount, the second information amount, and the third information quantity a distance and a second distance;
或者,  Or,
所述词汇识别模块 306, 用于获取所述文本中信息量的词汇属性, 并根据获 取的属性确定所述信息量的句子成分;  The vocabulary identification module 306 is configured to acquire a vocabulary attribute of the information amount in the text, and determine a sentence component of the information amount according to the obtained attribute;
相应地, 所述修正模块 304, 还用于当第一距离和第二距离相等时, 根据语 法结构和所述第一信息量、 第二信息量和第三信息量的句子成分, 修正所述第 一距离和第二距离。  Correspondingly, the correction module 304 is further configured to: when the first distance and the second distance are equal, correct the sentence according to a grammatical structure and sentence components of the first information amount, the second information amount, and the third information amount The first distance and the second distance.
所述修正模块 304具体用于当第一距离和第二距离相等时, 根据语法结构 和所述第一信息量、 第二信息量和第三信息量的词汇属性获取所述第一信息量 和第二信息量之间的紧密度、 第一信息量和第三信息量之间的紧密度, 并根据 获取的紧密度修正所述第一距离和第二距离;  The correction module 304 is specifically configured to: when the first distance and the second distance are equal, acquire the first information amount according to a grammatical structure and vocabulary attributes of the first information amount, the second information amount, and the third information amount The tightness between the second amount of information, the tightness between the first amount of information and the third amount of information, and correcting the first distance and the second distance according to the tightness of the acquisition;
所述修正模块 304, 还用于当第一距离和第二距离相等时, 根据语法结构和 所述第一信息量、 第二信息量和第三信息量的句子成分获取所述第一信息量和 第二信息量之间的紧密度、 第一信息量和第三信息量之间的紧密度, 并根据获 取的紧密度修正所述第一距离和第二距离。  The correction module 304 is further configured to: when the first distance and the second distance are equal, acquire the first information amount according to a grammatical structure and sentence components of the first information amount, the second information amount, and the third information amount The tightness between the second information amount, the first information amount, and the third information amount, and the first distance and the second distance are corrected according to the acquired tightness.
优选地, 上述位置标签的具体内容包括: 信息量在文本中的自然段落位置、 起始位置和结束位置。  Preferably, the specific content of the location tag includes: a natural paragraph position, a start position, and an end position of the information amount in the text.
所述计算模块 303使用的计算每两个信息量的距离的公式为: 距离 = |L(x) - The formula used by the calculation module 303 to calculate the distance between each two information amounts is: Distance = |L(x) -
L(y)| , 其中, L(x) 和 L(y)分别为信息量 X的位置标签数值和信息量 y的位置标 签数值; L(y)| , where L(x) and L(y) are the position label value of the information amount X and the position label value of the information amount y, respectively;
所述位置标签数值的计算公式为: 位置标签数值 =段落位置 X 段落最大字 符数 + (起始位置 + 结束位置) /2 , 其中, 所述段落位置是信息量在文本中的自 然段落位置。 The calculation formula of the position label value is: position label value = paragraph position X paragraph maximum word The number of symbols + (starting position + ending position) /2 , where the paragraph position is the natural paragraph position of the amount of information in the text.
本实施例提供的装置, 当信息量之间的距离出现相等的情况时, 根据语法 结构对距离进行修正, 并根据修正后的距离对信息量进行聚合, 在根据位置标 签进行聚合的基础上兼顾了语法结构, 提升了信息聚合的准确性和后续提取信 息的性能。  In the device provided by this embodiment, when the distance between the information amounts is equal, the distance is corrected according to the grammatical structure, and the information amount is aggregated according to the corrected distance, and the aggregation is performed according to the location label. The grammatical structure improves the accuracy of information aggregation and the performance of subsequent information extraction.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过 硬件来完成, 也可以通过程序来指令相关的硬件完成, 所述的程序可以存储于 一种计算机可读存储介质中, 上述提到的存储介质可以是只读存储器, 磁盘或 光盘等。  A person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium. The storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.
以上所述仅为本发明的较佳实施例, 并不用以限制本发明, 凡在本发明的 原则之内, 所作的任何修改、 等同替换、 改进等, 均应包含在本发明的保护范 围之内。  The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalents, improvements, etc., which are within the scope of the present invention, should be included in the scope of the present invention. Inside.

Claims

权 利 要求 Rights request
1、 一种聚合信息的方法, 其特征在于, 所述方法包括: 获取待聚合的文本;  A method for aggregating information, the method comprising: obtaining a text to be aggregated;
获取所述文本中信息量的位置标签; 根据所述位置标签, 计算每两个信息量之间的距离; 当第一距离和第二距离相等时, 根据语法结构修正所述第一距离和第二距 离, 其中, 所述第一距离为所述信息量中第一信息量与第二信息量之间的距离, 所述第二距离为所述信息量中所述第一信息量与第三信息量之间的距离; 将所述信息量根据所述修正后的第一距离、 第二距离进行聚合, 获得结构  Obtaining a position label of the amount of information in the text; calculating a distance between each two information amounts according to the position label; and correcting the first distance and the first according to a grammatical structure when the first distance and the second distance are equal a distance between the first information amount and the second information amount in the information amount, where the second distance is the first information quantity and the third quantity in the information quantity a distance between the information amounts; and the information amount is aggregated according to the corrected first distance and the second distance to obtain a structure
2、 根据权利要求 1所述的方法, 其特征在于, 当第一距离和第二距离相等 时, 根据语法结构修正所述第一距离和第二距离, 之前包括: 2. The method according to claim 1, wherein when the first distance and the second distance are equal, the first distance and the second distance are corrected according to a grammatical structure, and the method includes:
获取所述文本中信息量的词汇属性; 相应地, 当第一距离和第二距离相等时, 根据语法结构修正所述第一距离 和第二距离, 包括: 当第一距离和第二距离相等时, 根据语法结构和所述第一 信息量、 第二信息量和第三信息量的词汇属性, 修正所述第一距离和第二距离;  Obtaining a vocabulary attribute of the amount of information in the text; correspondingly, when the first distance and the second distance are equal, correcting the first distance and the second distance according to a grammatical structure, including: when the first distance and the second distance are equal Refining the first distance and the second distance according to a grammatical structure and vocabulary attributes of the first amount of information, the second amount of information, and the third amount of information;
3、 根据权利要求 1所述的方法, 其特征在于, 当第一距离和第二距离相等 时, 根据语法结构修正所述第一距离和第二距离, 之前包括: 获取所述文本中 信息量的词汇属性, 并根据获取的词汇属性确定所述信息量的句子成分; 相应地, 当第一距离和第二距离相等时, 根据语法结构修正所述第一距离 和第二距离, 包括: 当第一距离和第二距离相等时, 根据语法结构和所述第一 信息量、 第二信息量和第三信息量的句子成分, 修正所述第一距离和第二距离。 The method according to claim 1, wherein when the first distance and the second distance are equal, the first distance and the second distance are corrected according to a grammatical structure, and the method includes: acquiring the amount of information in the text a vocabulary attribute, and determining a sentence component of the information amount according to the obtained vocabulary attribute; correspondingly, when the first distance and the second distance are equal, correcting the first distance according to a grammatical structure And the second distance, comprising: correcting the first distance and the first distance according to a grammatical structure and a sentence component of the first information amount, the second information amount, and the third information amount when the first distance and the second distance are equal Two distances.
4、 根据权利要求 2所述的方法, 其特征在于, 当第一距离和第二距离相等 时, 根据语法结构和所述第一信息量、 第二信息量和第三信息量的词汇属性, 修正所述第一距离和第二距离, 具体包括: 当第一距离和第二距离相等时, 根据语法结构和所述第一信息量、 第二信 息量和第三信息量的词汇属性获取所述第一信息量和第二信息量之间的紧密 度、 第一信息量和第三信息量之间的紧密度, 并根据获取的紧密度修正所述第 一距离和第二距离; The method according to claim 2, wherein, when the first distance and the second distance are equal, according to a grammatical structure and vocabulary attributes of the first amount of information, the second amount of information, and the third amount of information, Correcting the first distance and the second distance, specifically: when the first distance and the second distance are equal, acquiring the vocabulary attribute according to the grammatical structure and the first information amount, the second information amount, and the third information amount Determining the closeness between the first amount of information and the second amount of information, the tightness between the first amount of information and the third amount of information, and correcting the first distance and the second distance according to the tightness of the acquisition;
5、 根据权利要求 3所述的方法, 其特征在于, 当第一距离和第二距离相等 时, 根据语法结构和所述第一信息量、 第二信息量和第三信息量的句子成分, 修正所述第一距离和第二距离, 具体包括: 当第一距离和第二距离相等时, 根据语法结构和所述第一信息量、 第二信 息量和第三信息量的句子成分获取所述第一信息量和第二信息量之间的紧密 度、 第一信息量和第三信息量之间的紧密度, 并根据获取的紧密度修正所述第 一距离和第二距离。 5. The method according to claim 3, wherein, when the first distance and the second distance are equal, according to a grammatical structure and sentence components of the first information amount, the second information amount, and the third information amount, Correcting the first distance and the second distance, specifically: when the first distance and the second distance are equal, acquiring the sentence component according to the grammatical structure and the first information amount, the second information amount, and the third information amount The tightness between the first information amount and the second information amount, the tightness between the first information amount and the third information amount, and the first distance and the second distance are corrected according to the tightness of the acquisition.
6、 根据权利要求 1至 5任一项所述的方法, 其特征在于, 所述位置标签的 具体内容包括: 信息量在文本中的自然段落位置、 起始位置和结束位置。 The method according to any one of claims 1 to 5, wherein the specific content of the location tag comprises: a natural paragraph position, a starting position and an ending position of the information amount in the text.
7、 根据权利要求 1至 6任一项所述的方法, 其特征在于, 计算每两个信息 量之间的距离的计算公式为: 距离 = |L(x) - L(y)|, 其中, L(x) 和 L(y)分别为信息 量 X的位置标签数值和信息量 y的位置标签数值; The method according to any one of claims 1 to 6, characterized in that the calculation formula for calculating the distance between each two information amounts is: distance = |L(x) - L(y)|, wherein , L(x) and L(y) are the position label value of the information quantity X and the position label value of the information quantity y, respectively;
所述位置标签数值的计算公式为: 位置标签数值 =段落位置 X 段落最大字符 数 + (起始位置 + 结束位置) /2, 其中, 所述段落位置是信息量在文本中的自然 段落位置。  The position label value is calculated as: position label value = paragraph position X paragraph maximum character number + (start position + end position) /2, where the paragraph position is the natural paragraph position of the information amount in the text.
8、 一种聚合信息的装置, 其特征在于, 所述装置包括: 8. An apparatus for aggregating information, the apparatus comprising:
文本获取模块, 用于获取待聚合的文本; 位置标签获取模块, 用于获取所述文本中信息量的位置标签;  a text acquisition module, configured to acquire text to be aggregated, and a location label acquisition module, configured to acquire a location label of the information amount in the text;
计算模块, 用于根据所述位置标签, 计算每两个信息量之间的距离; 修正模块, 用于当第一距离和第二距离相等时, 根据语法结构修正所述第 一距离和第二距离, 其中, 所述第一距离为所述信息量中第一信息量与第二信 息量之间的距离, 所述第二距离为所述信息量中所述第一信息量与第三信息量 之间的 巨离; 聚合模块, 用于将所述信息量根据所述修正后的第一距离、 第二距离进行 聚合, 获得结构体。  a calculation module, configured to calculate a distance between each two information quantities according to the position label; and a correction module, configured to correct the first distance and the second according to a grammatical structure when the first distance and the second distance are equal a distance, wherein the first distance is a distance between a first information amount and a second information quantity in the information amount, and the second distance is the first information quantity and the third information in the information quantity The aggregation module is configured to aggregate the information amount according to the corrected first distance and the second distance to obtain a structure.
9、 根据权利要求 8所述的装置, 其特征在于, 所述装置还包括: 词汇识别模块, 用于获取所述文本中信息量的词汇属性; The device according to claim 8, wherein the device further comprises: a vocabulary identification module, configured to acquire a vocabulary attribute of the amount of information in the text;
相应地, 所述修正模块, 还用于当第一距离和第二距离相等时, 根据语法 结构和所述第一信息量、 第二信息量和第三信息量的词汇属性, 修正所述第一 距离和第二距离; Correspondingly, the correction module is further configured to: when the first distance and the second distance are equal, according to a grammar Correcting the first distance and the second distance by the structure and the vocabulary attributes of the first amount of information, the second amount of information, and the third amount of information;
10、 根据权利要求 8所述的装置, 其特征在于, 所述装置还包括:  The device according to claim 8, wherein the device further comprises:
词汇识别模块, 用于获取所述文本中信息量的词汇属性, 并根据获取的属 性确定所述信息量的句子成分; a vocabulary identification module, configured to acquire a vocabulary attribute of the amount of information in the text, and determine a sentence component of the information amount according to the acquired attribute;
相应地, 所述修正模块, 还用于当第一距离和第二距离相等时, 根据语法 结构和所述第一信息量、 第二信息量和第三信息量的句子成分, 修正所述第一 距离和第二距离。  Correspondingly, the correction module is further configured to: when the first distance and the second distance are equal, correct the sentence according to a grammatical structure and sentence components of the first information amount, the second information amount, and the third information amount One distance and two distances.
11、 根据权利要求 9所述的装置, 其特征在于, 所述修正模块具体用于当 第一距离和第二距离相等时, 根据语法结构和所述第一信息量、 第二信息量和 第三信息量的词汇属性获取所述第一信息量和第二信息量之间的紧密度、 第一 信息量和第三信息量之间的紧密度, 并根据获取的紧密度修正所述第一距离和 第二距离; 和 /或 The device according to claim 9, wherein the correction module is configured to: when the first distance and the second distance are equal, according to a grammatical structure and the first information amount, the second information amount, and the The vocabulary attribute of the three information amounts acquires the closeness between the first information amount and the second information amount, the tightness between the first information amount and the third information amount, and corrects the first according to the tightness of the acquisition Distance and second distance; and/or
12、 根据权利要求 10所述的装置, 其特征在于, 所述修正模块, 还用于当 第一距离和第二距离相等时, 根据语法结构和所述第一信息量、 第二信息量和 第三信息量的句子成分获取所述第一信息量和第二信息量之间的紧密度、 第一 信息量和第三信息量之间的紧密度, 并根据获取的紧密度修正所述第一距离和 第二距离。 The device according to claim 10, wherein the correction module is further configured to: when the first distance and the second distance are equal, according to a grammatical structure and the first information amount, the second information amount, and The sentence component of the third information amount acquires the closeness between the first information amount and the second information amount, the tightness between the first information amount and the third information amount, and corrects the first according to the tightness of the acquisition One distance and two distances.
13、 根据权利要求 8至 12任一项所述的装置, 其特征在于, 所述位置标签 的具体内容包括: 信息量在文本中的自然段落位置、 起始位置和结束位置。 The device according to any one of claims 8 to 12, wherein the specific content of the location tag comprises: a natural paragraph position, a starting position and an ending position of the amount of information in the text.
14、 根据权利要求 8至 13任一项所述的装置, 其特征在于, 所述计算模块 使用的计算每两个信息量的距离的公式为: 距离 = |L(x) - L(y)| , 其中, L(x) 和14. The apparatus according to any one of claims 8 to 13, wherein the formula for calculating the distance between each two information amounts used by the calculation module is: distance = |L(x) - L(y) | , where L(x) and
L y)分别为信息量 X的位置标签数值和信息量 y的位置标签数值; L y) is the position label value of the information amount X and the position label value of the information amount y;
所述位置标签数值的计算公式为: 位置标签数值 =段落位置 X 段落最大字符 数 + (起始位置 + 结束位置) /2, 其中, 所述段落位置是信息量在文本中的自然 段落位置。  The position label value is calculated as: position label value = paragraph position X paragraph maximum character number + (start position + end position) /2, where the paragraph position is the natural paragraph position of the information amount in the text.
PCT/CN2013/070146 2012-01-20 2013-01-07 Method and apparatus for aggregating information WO2013107308A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210018940.4A CN103218372B (en) 2012-01-20 2012-01-20 Method and device for aggregating information
CN201210018940.4 2012-01-20

Publications (1)

Publication Number Publication Date
WO2013107308A1 true WO2013107308A1 (en) 2013-07-25

Family

ID=48798617

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/070146 WO2013107308A1 (en) 2012-01-20 2013-01-07 Method and apparatus for aggregating information

Country Status (2)

Country Link
CN (1) CN103218372B (en)
WO (1) WO2013107308A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101110081A (en) * 2007-08-21 2008-01-23 北京大学 Method for extracting entity address message in text context
CN101178708A (en) * 2006-11-07 2008-05-14 北京酷讯科技有限公司 Automatic moulding plate information locating method for structured web page
CN102081660A (en) * 2011-01-13 2011-06-01 西北工业大学 Method for searching and sequencing keywords of XML documents based on semantic correlation

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195646A1 (en) * 2007-02-12 2008-08-14 Microsoft Corporation Self-describing web data storage model
CN101599071B (en) * 2009-07-10 2012-04-18 华中科技大学 Automatic extraction method of conversation text topic
CN101963974A (en) * 2010-09-03 2011-02-02 深圳创维数字技术股份有限公司 EPG column generating method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178708A (en) * 2006-11-07 2008-05-14 北京酷讯科技有限公司 Automatic moulding plate information locating method for structured web page
CN101110081A (en) * 2007-08-21 2008-01-23 北京大学 Method for extracting entity address message in text context
CN102081660A (en) * 2011-01-13 2011-06-01 西北工业大学 Method for searching and sequencing keywords of XML documents based on semantic correlation

Also Published As

Publication number Publication date
CN103218372A (en) 2013-07-24
CN103218372B (en) 2017-04-26

Similar Documents

Publication Publication Date Title
US20210157984A1 (en) Intelligent system that dynamically improves its knowledge and code-base for natural language understanding
WO2023060795A1 (en) Automatic keyword extraction method and apparatus, and device and storage medium
US8380492B2 (en) System and method for text cleaning by classifying sentences using numerically represented features
EP3355301B1 (en) Cross-lingual initialization of language models
US20150120788A1 (en) Classification of hashtags in micro-blogs
US20110258181A1 (en) Method for calculating semantic similarities between messages and conversations based on enhanced entity extraction
JP2012529108A (en) Lighting system and language detection
WO2014117553A1 (en) Method and system of adding punctuation and establishing language model
CN103577989A (en) Method and system for information classification based on product identification
WO2017012222A1 (en) Time-sensitivity processing requirement identification method, device, apparatus and non-volatile computer storage medium
WO2012016505A1 (en) File processing method and file processing device
CN110704608A (en) Text theme generation method and device and computer equipment
CN107861948B (en) Label extraction method, device, equipment and medium
CN107424612B (en) Processing method, apparatus and machine-readable medium
CN106663123B (en) Comment-centric news reader
CN115186654A (en) Method for generating document abstract
US9251141B1 (en) Entity identification model training
CN108763202A (en) Method, apparatus, equipment and the readable storage medium storing program for executing of the sensitive text of identification
CN113918031A (en) System and method for Chinese punctuation recovery using sub-character information
JP2011221978A (en) Named element marking apparatus, named element marking method and computer readable medium
CN114880520B (en) Video title generation method, device, electronic equipment and medium
WO2019231635A1 (en) Method and apparatus for generating digest for broadcasting
CN110895654A (en) Segmentation method, segmentation system and non-transitory computer readable medium
CN115858776A (en) Variant text classification recognition method, system, storage medium and electronic equipment
WO2021082570A1 (en) Artificial intelligence-based semantic identification method, device, and semantic identification apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13738388

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13738388

Country of ref document: EP

Kind code of ref document: A1