WO2006136055A1 - Procédé d'exploration de données texte - Google Patents

Procédé d'exploration de données texte Download PDF

Info

Publication number
WO2006136055A1
WO2006136055A1 PCT/CN2005/000894 CN2005000894W WO2006136055A1 WO 2006136055 A1 WO2006136055 A1 WO 2006136055A1 CN 2005000894 W CN2005000894 W CN 2005000894W WO 2006136055 A1 WO2006136055 A1 WO 2006136055A1
Authority
WO
WIPO (PCT)
Prior art keywords
template
data
variable
text data
regular expression
Prior art date
Application number
PCT/CN2005/000894
Other languages
English (en)
Chinese (zh)
Inventor
Jin Li
Xiaojin Li
Zhaoming Deng
Wenbin Tang
Meipeng Guo
Mei Xiang
Original Assignee
Zte Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zte Corporation filed Critical Zte Corporation
Priority to CN2005800493417A priority Critical patent/CN101151843B/zh
Priority to PCT/CN2005/000894 priority patent/WO2006136055A1/fr
Publication of WO2006136055A1 publication Critical patent/WO2006136055A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Definitions

  • the present invention relates to data analysis processing techniques, and in particular to a text data mining method. Background technique
  • the technical problem to be solved by the present invention is to provide a text data mining method. Text data in different formats can be analyzed by modifying the template file without relying on developing code or using expensive commercial data mining tools.
  • the present invention provides the following solutions:
  • a text data mining method includes the following steps:
  • the extracted original information is parsed into data values of a specified data name and a data type.
  • the pre-made template file is generated according to a text data structure and a template language variable rule that need to be mined.
  • the template variable rule includes: a variable name attribute and Variable type attribute.
  • each of the template variable rules corresponds to a data item to be extracted in a text data to be mined.
  • the method of the present invention wherein the template file is compiled into a template object composed of a regular expression, and is compiled by using a template compiler.
  • scanning the template file further comprises: filtering the annotation information therein, and masking the non-template variable rule part in the template file.
  • non-template variable rule part of the mask template file refers to the part by using a quotation in the regular expression syntax.
  • the extracting the original information in the text data further includes: sequentially storing the extracted original information in a temporary storage area.
  • the extracted original information is parsed into data values of a specified data name and a data type, and is parsed according to attributes of the template variable rule.
  • the method of the invention does not need to modify the code for text data of different formats, and only needs to modify the template file according to the template definition language to adapt to different data formats, greatly reducing the time spent on data analysis; and using regular expressions
  • the data matching algorithm is used to mine the data information, which is much more efficient than the traditional method. Moreover, by converting the data value into the specified format, the subsequent processing difficulty is reduced.
  • the method of the present invention is suitable for concurrent data mining processing, making full use of The processing capability of the computer; and the method according to the present invention can be quickly applied to a system implemented by using different development tools, which is simple to implement and The price is low.
  • FIG. 1 is a schematic flowchart of a text data mining method according to the present invention.
  • FIG. 2 is a schematic diagram of pre-production of a template file according to the present invention.
  • FIG. 3 is a schematic flowchart of a process of a compiler according to the present invention.
  • Figure 4 is a schematic diagram of the compiled template file.
  • Figure 5 is a schematic diagram of text data mining.
  • FIG. 6 and FIG. 7 are schematic flowcharts of an embodiment of a text data mining method according to the present invention.
  • FIG. 1 a schematic diagram of a flow of a text data mining method according to the present invention, first reading a pre-made template file including at least one template variable rule (step 101 ); where the template variable rule may include two Attributes: The name of the variable and the type of the variable.
  • Each of the template variable rules corresponds to a data item to be extracted in a text data to be mined.
  • the template file is compiled into a template object composed of a regular expression (step 102); here, the template file is compiled into a template object composed of a regular expression, and the template is utilized.
  • the compiler is compiled.
  • step 103 And scanning the text data to be mined according to the template object, performing data matching on the data (step 103); and then sequentially extracting the matched original information in the text data according to the regular expression (step 104)
  • the extracted original information in the text data is extracted, and the extracted original information may be sequentially stored in the temporary storage area.
  • step 105 the extracted original information is parsed into data values of the specified data name and the data type according to the template variable rule I" (step 105); here, the extracted original text data is parsed into a specified
  • the type of data is parsed according to the variables and variable types in the template variable rules.
  • the pre-made template file used in the present invention is not limited to any one of the template languages. In other words, it can be written and generated according to the type of text data to be excavated, and different template languages are defined.
  • the previously generated template file is used to perform mining processing on the text data.
  • an example of a template file prepared in advance is provided below.
  • templates can support annotations to facilitate the maintenance of template files.
  • Annotations are interpreted text that is ignored during template compilation and use, but is indispensable for the readability of the template.
  • Comment format comment content ⁇ "
  • the comment is similar to the multi-line comment in the JAVA language, and the comment is from the beginning until the first " ⁇ " encountered as the comment content.
  • the template variable rule requires at least two attributes, the variable name and the type of the variable.
  • the template variable rule format can be: "$ ⁇ VAR[ ; VAR_TYPE] ⁇ "
  • variable name "VAR" is similar to the definition of a variable in a computer language: it must be a letter or an underscore, consisting of letters, numbers, and underscores.
  • variable type "VARJTYPE” is the value of the enumerated type, which can be S, N, D, A, and so on. Corresponds to strings, numbers, dates, lists, and so on.
  • Example: "$ ⁇ USERNAME; S ⁇ ” represents a template variable rule with a variable named "USERNAME” and a data type of string.
  • one or more template variable rules can be defined in a template file. If no variable type is specified, the default is a string type variable. Templates automatically convert raw data information from text data into data values of the specified type.
  • the text data in this example is a real alarm message sent by a certain type of telecommunication device to the network management system.
  • Our goal is to extract the alarm number, alarm location, etc. from this text data.
  • Each template variable in the template file corresponds to a piece of data information we need to extract.
  • the template variable rules for the alarm sequence number and alarm location information are as follows: Alarm sequence number: $ ⁇ ALARMID ; S ⁇
  • variable name of the above alarm number is "ALARMID”
  • variable type is a string.
  • the alarm location is as follows:
  • Chassis $ ⁇ Shelf ;N ⁇
  • the alarm position is composed of three template variables, namely "Rack”, “Shelf”, “Slot”, and the variable types are all numeric.
  • FIG. 3 it is a schematic flowchart of a process of a compiler according to the present invention.
  • scanning the template file, and recording a template variable rule therein step 201); here, scanning the template file by filtering the annotation information therein; and then using the quotation in the regular expression syntax to use the non-template variable
  • the rules section is referenced to implement blocking the non-template variable rules section of the template file.
  • the template variable rule portion in the template file is replaced with a regular expression (step 202); finally, the generated regular expression is compiled into a regular expression object (step 203).
  • FIG 4 a schematic diagram of the template file is compiled.
  • the purpose of compiling a template file is to scan a template file written according to the template language and compile it into a regular expression.
  • Figure 4 is an implementation of our regular expression engine based on the JAVA language. For other applications, the language can be used according to the development and the regular expression engine can be used.
  • FIG. 5 it is a schematic diagram of text data mining.
  • a schematic diagram of extracting and mining data information in a text data using a template object is described.
  • the text data is scanned, and the original information in the text data is extracted through the regular expression object in the template;
  • the raw data information is then converted to a data value of the specified type based on the template variable rule definition in the template.
  • the text data mining results are shown in Figure 5.
  • the data mining process supports multi-threaded concurrent operations, which improves the utilization of computer resources.
  • FIG. 6 and FIG. 7 are schematic diagrams showing an embodiment of a text data mining method according to the present invention.
  • a template data file generated according to a text data structure and a template language variable rule to be mined is read (step 301); then, the template file is scanned, template annotation information is filtered, and template variable rules are recorded, by using regular expressions
  • the quotation in the grammar refers to the non-template variable rule part to implement blocking the non-template variable rule part in the template file (step 302); and then replaces the template variable rule part in the template file with a regular expression ( Step 303) compiling the generated regular expression into a regular expression object (step 304); then, according to the template object, scanning the text data to be mined, and performing data matching thereon (step 305); a regular expression, sequentially extracting the matched original information in the text data (step 306); and then parsing the extracted original information into data values of the specified data name and the data type according to the template variable rule (Step 307).
  • a text data mining method according to the present invention is not limited to the specification and the implementation side.
  • the use of the applications listed in the specification can be applied to various fields suitable for the present invention, and other advantages and modifications can be easily made by those skilled in the art, and therefore, without departing from the scope of the claims and the equivalents
  • the present invention is not limited to the specific details, the representative devices, and the illustrated examples shown and described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé d'exploration de données texte qui comprend : l'extraction d'un fichier de modèle pré-formé contenant au moins une règle de paramètre de modèle, la compilation dudit fichier en objets de modèle composés par les expressions régulières en fonction de la règle de paramètre du modèle, l'analyse des données texte à explorer et la mise en place d'une correspondance de données en fonction des objets de modèle, l'extraction de manière séquentielle des informations initiales mises en correspondance parmi les données texte en fonction de l'expression régulière, la résolution des informations initiales extraites en une valeur de donnée du nom et du type de donnée affectés en fonction de la règle de paramètre de modèle. Selon la présente invention, le processus d'analyse pour les données texte de différents types peut être mis en place simplement en modifiant le fichier de modèle sans se reposer sur le développement du code de programme ni utiliser d'outil d'exploration de données commercial onéreux. La complexité et le coût du système de gestion du réseau de communication sont donc réduits.
PCT/CN2005/000894 2005-06-22 2005-06-22 Procédé d'exploration de données texte WO2006136055A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN2005800493417A CN101151843B (zh) 2005-06-22 2005-06-22 一种文本数据挖掘方法
PCT/CN2005/000894 WO2006136055A1 (fr) 2005-06-22 2005-06-22 Procédé d'exploration de données texte

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2005/000894 WO2006136055A1 (fr) 2005-06-22 2005-06-22 Procédé d'exploration de données texte

Publications (1)

Publication Number Publication Date
WO2006136055A1 true WO2006136055A1 (fr) 2006-12-28

Family

ID=37570080

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2005/000894 WO2006136055A1 (fr) 2005-06-22 2005-06-22 Procédé d'exploration de données texte

Country Status (2)

Country Link
CN (1) CN101151843B (fr)
WO (1) WO2006136055A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095745A (zh) * 2016-05-27 2016-11-09 厦门市美亚柏科信息股份有限公司 基于通讯记录的交易记录提取方法及其系统
CN109726284A (zh) * 2018-12-07 2019-05-07 成都品果科技有限公司 一种通用性强的数据分析方法
CN111291547A (zh) * 2020-01-20 2020-06-16 腾讯科技(深圳)有限公司 模板生成方法、装置、设备及介质
CN111569427A (zh) * 2020-06-10 2020-08-25 网易(杭州)网络有限公司 资源的处理方法、装置、存储介质和电子装置
US11714849B2 (en) 2021-08-31 2023-08-01 Alibaba Damo (Hangzhou) Technology Co., Ltd. Image generation system and method

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609984B (zh) * 2008-06-16 2012-08-29 上海申瑞电力科技股份有限公司 用于数据采集与监视控制系统的快速辅助建模方法
CN104731555A (zh) * 2013-12-23 2015-06-24 中兴通讯股份有限公司 一种避免寄存器冲突的方法及装置
CN105739947A (zh) * 2014-12-10 2016-07-06 中兴通讯股份有限公司 一种寄存器冲突的检测方法及装置
CN108279883B (zh) * 2016-12-30 2021-11-26 北京京东尚科信息技术有限公司 一种可配置的特征计算方法及系统
CN112580298B (zh) * 2019-09-29 2024-05-07 大众问问(北京)信息科技有限公司 一种标注数据获取方法、装置及设备
CN111880838B (zh) * 2020-08-03 2024-04-12 北京神舟航天软件技术有限公司 一种基于模板匹配技术的数据解析方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002025564A1 (fr) * 2000-09-25 2002-03-28 Kent Ridge Digital Labs Systeme, procede et interface utilisant des modeles pour construire des bases de donnees biologiques
CN1492336A (zh) * 2003-09-04 2004-04-28 上海格尔软件股份有限公司 基于数据仓库的信息安全审计方法
US20050027710A1 (en) * 2003-07-30 2005-02-03 International Business Machines Corporation Methods and apparatus for mining attribute associations

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5692107A (en) * 1994-03-15 1997-11-25 Lockheed Missiles & Space Company, Inc. Method for generating predictive models in a computer system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002025564A1 (fr) * 2000-09-25 2002-03-28 Kent Ridge Digital Labs Systeme, procede et interface utilisant des modeles pour construire des bases de donnees biologiques
US20050027710A1 (en) * 2003-07-30 2005-02-03 International Business Machines Corporation Methods and apparatus for mining attribute associations
CN1492336A (zh) * 2003-09-04 2004-04-28 上海格尔软件股份有限公司 基于数据仓库的信息安全审计方法

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095745A (zh) * 2016-05-27 2016-11-09 厦门市美亚柏科信息股份有限公司 基于通讯记录的交易记录提取方法及其系统
CN109726284A (zh) * 2018-12-07 2019-05-07 成都品果科技有限公司 一种通用性强的数据分析方法
CN111291547A (zh) * 2020-01-20 2020-06-16 腾讯科技(深圳)有限公司 模板生成方法、装置、设备及介质
CN111291547B (zh) * 2020-01-20 2024-04-26 腾讯科技(深圳)有限公司 模板生成方法、装置、设备及介质
CN111569427A (zh) * 2020-06-10 2020-08-25 网易(杭州)网络有限公司 资源的处理方法、装置、存储介质和电子装置
CN111569427B (zh) * 2020-06-10 2023-04-25 网易(杭州)网络有限公司 资源的处理方法、装置、存储介质和电子装置
US11714849B2 (en) 2021-08-31 2023-08-01 Alibaba Damo (Hangzhou) Technology Co., Ltd. Image generation system and method

Also Published As

Publication number Publication date
CN101151843B (zh) 2010-05-12
CN101151843A (zh) 2008-03-26

Similar Documents

Publication Publication Date Title
WO2006136055A1 (fr) Procédé d'exploration de données texte
Wimmer et al. Bridging grammarware and modelware
JP2000148461A (ja) ソフトウェアモデル及び既存のソ―スコ―ドを同期化させる方法及びその装置
US7792851B2 (en) Mechanism for defining queries in terms of data objects
CN115543402B (zh) 一种基于代码提交的软件知识图谱增量更新方法
Neubauer et al. XMLText: from XML schema to Xtext
CN109299074A (zh) 一种基于模板化数据库视图的数据校验方法及系统
US20030200534A1 (en) Mechanism for reformatting a simple source code statement into a compound source code statement
CN108241658A (zh) 一种日志模式发现方法及系统
CN110007922B (zh) 基于人工智能的图形化源代码的编译方法、装置和设备
CN109325217B (zh) 一种文件转换方法、系统、装置及计算机可读存储介质
CN111124380A (zh) 一种前端代码生成方法
CN113326261B (zh) 数据血缘关系提取方法、装置及电子设备
Ballance et al. Grammatical abstraction and incremental syntax analysis in a language-based editor
CN113608903A (zh) 一种基于xml语言的故障管理方法
KR100762712B1 (ko) 규칙기반의 전자문서 변환방법 및 그 시스템
CN112506488A (zh) 一种基于sql创建语句生成编程语言类的方法
JP2008165403A (ja) Xml文書の処理方法および処理プログラム
JP2006011756A (ja) プログラム変換プログラム、プログラム変換装置およびプログラム変換方法
CN110222169A (zh) 一种可视化数据处理解析系统及其处理方法
CN116521621A (zh) 一种数据处理方法、装置、电子设备及存储介质
CN112597011B (zh) 一种基于多语言的算子测试用例生成和优化方法
CN113971044A (zh) 组件文档生成方法、装置、设备及可读存储介质
CN107577476A (zh) 一种基于模块划分的安卓系统源码差异性分析方法、服务器及介质
CN110515653A (zh) 文档生成方法、装置、电子设备及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 200580049341.7

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase

Ref document number: 05754937

Country of ref document: EP

Kind code of ref document: A1