WO2016074434A1 - 数据提取方法及装置 - Google Patents

数据提取方法及装置 Download PDF

Info

Publication number
WO2016074434A1
WO2016074434A1 PCT/CN2015/076587 CN2015076587W WO2016074434A1 WO 2016074434 A1 WO2016074434 A1 WO 2016074434A1 CN 2015076587 W CN2015076587 W CN 2015076587W WO 2016074434 A1 WO2016074434 A1 WO 2016074434A1
Authority
WO
WIPO (PCT)
Prior art keywords
extraction
data
target data
message
message data
Prior art date
Application number
PCT/CN2015/076587
Other languages
English (en)
French (fr)
Inventor
陈娟
吴明
陈伟
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016074434A1 publication Critical patent/WO2016074434A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to the field of communications, and in particular to a data extraction method and apparatus.
  • Metadata extraction can help to understand the interactive content of websites, business applications, and servers that users frequently log in to. Based on the results of metadata extraction, the operator can track and analyze user behavior and user experience, and collect information such as delays and traffic on hot websites and corresponding websites. It can better optimize the wireless network and help operators improve the network quality, so that the products can obtain higher value.
  • the server After receiving and interpreting the request message, the server returns a response message.
  • the problem is how to accurately extract the required data from the massive message content.
  • the existing methods are generally based on direct matching of regular expressions. Because the metadata information transmitted on the network is complicated, sometimes the characteristics of the plaintext cannot be found, and the regular expression cannot be configured well; sometimes the message data There are multiple extraction targets in the extraction but the extraction is not comprehensive or just extract one but extract a lot of unwanted error content.
  • the embodiment of the invention provides a data extraction method and device to solve at least the problem of inaccurate extraction of target data in the related art.
  • a data extraction method includes: determining extracted target data according to a data message; and matching content in the message data according to a predetermined regular expression; In the case where there are at least two target data in the message data, the at least two target data are extracted.
  • matching the content in the message data according to a predetermined regular expression includes: if there is a character string feature in the message data, according to a predetermined character regular expression pair The content in the message data is matched.
  • matching the content in the message data according to a predetermined regular expression includes: parsing the method by using a predetermined function in a case where the message data does not have a character feature The message data is decoded to obtain the target data.
  • extracting the at least two target data includes: performing pre-configured successful extraction for recording in the case of extracting the at least two target data from different pieces of the message data.
  • the number of extractions and/or the number of attempts to extract the extraction failures extracts the target data.
  • extracting the at least two target data includes: if a message data has two extraction targets, performing multiple matching on the content in the message data The target data is extracted; and/or, in the case where there are two extraction targets for different message data, the pre-configured extraction number for recording extraction success and/or the number of extraction attempts for recording extraction failure are used for two The target data is extracted.
  • the method before the extracting the target data by using a pre-configured number of extractions for recording extraction success and/or an extraction extraction number for recording extraction failure, the method further includes: configuring a dynamic setting interface, wherein The dynamic setting interface is configured to receive different extraction times and attempt extraction times set for different extraction types.
  • a data extracting apparatus comprising: a determining module configured to determine extracted target data according to a data message; a matching module configured to pair the newspaper according to a predetermined regular expression The content in the text data is matched; the extraction module is configured to extract the at least two target data in the case that at least two target data are present in the message data.
  • the matching module includes: a matching unit, configured to perform, in a case where the packet data has a character string feature, perform content in the message data according to a predetermined character regular expression match.
  • the matching module includes: a parsing unit, configured to parse the packet data by using a predetermined function parsing method when the packet data does not have a character feature, and decoding the Target data.
  • the extracting module includes: an extracting unit, configured to, by using a pre-configured record extraction success, in the case of extracting the at least two target data from different pieces of the message data.
  • the target data is extracted by the number of extractions and/or the number of attempts to extract the extraction failure.
  • the extracting module includes: a second extracting unit, configured to perform multiple matching on the content in the packet data after the two pieces of the packet data have two extraction targets The target data is extracted; and/or the third extracting unit is configured to use a pre-configured number of extractions for recording extraction success and/or for record extraction in the case where there are two extraction targets for different message data. The number of failed attempts to extract extracts two target data.
  • the device further includes: a configuration unit configured to configure a dynamic setting interface, wherein the dynamic setting interface is configured to receive different extraction times and attempt extraction times set for different extraction types.
  • the extracted target data is determined according to the data packet; the content in the packet data is matched according to a predetermined regular expression; and at least two target data are present in the packet data.
  • the extracting the at least two target data solves the problem that the extraction of the target data is inaccurate in the related art, and the effect of the target data can be accurately extracted.
  • FIG. 1 is a flow chart of a data extraction method according to an embodiment of the present invention.
  • FIG. 2 is a block diagram of a data extraction device in accordance with an embodiment of the present invention.
  • FIG. 3 is a block diagram 1 of a data extraction device in accordance with a preferred embodiment of the present invention.
  • FIG. 4 is a block diagram 2 of a data extraction device in accordance with a preferred embodiment of the present invention.
  • FIG. 5 is a block diagram 3 of a data extraction apparatus in accordance with a preferred embodiment of the present invention.
  • FIG. 6 is a block diagram 4 of a data extraction apparatus in accordance with a preferred embodiment of the present invention.
  • FIG. 7 is a flow chart 1 of a data extraction method in accordance with a preferred embodiment of the present invention.
  • FIG. 8 is a second flowchart of a data extraction method in accordance with a preferred embodiment of the present invention.
  • FIG. 9 is a third flowchart of a data extraction method in accordance with a preferred embodiment of the present invention.
  • FIG. 10 is a flowchart 4 of a data extraction method in accordance with a preferred embodiment of the present invention.
  • FIG. 11 is a flowchart 5 of a data extraction method in accordance with a preferred embodiment of the present invention.
  • FIG. 12 is a flowchart 6 of a data extraction method in accordance with a preferred embodiment of the present invention.
  • FIG. 13 is a flow chart 7 of a data extraction method in accordance with a preferred embodiment of the present invention.
  • FIG. 1 is a flowchart of a data extraction method according to an embodiment of the present invention. As shown in FIG. 1, the process includes the following steps:
  • Step S102 determining the extracted target data according to the data packet
  • Step S104 matching content in the message data according to a predetermined regular expression
  • step S106 if at least two target data are present in the message data, the at least two target data are extracted.
  • the extracted target data is determined according to the data packet, and the content in the packet data is matched according to a predetermined regular expression. If at least two target data exist in the packet data, the at least The extraction of the two target data solves the problem that the extraction of the target data is inaccurate in the related art, and the effect of the target data can be accurately extracted.
  • matching the content in the message data according to the predetermined regular expression may include: in the case that the message data has a character string feature, the message data is according to a predetermined character regular expression. The content in the match is matched; and/or, if the message data does not have a character feature, the message data is parsed by a predetermined function analysis, and the target data is decoded.
  • the extracting the at least two target data may include: performing pre-configured extraction for the extraction in the case of extracting the at least two target data from different pieces of the message data.
  • the number of extractions and/or the number of attempts to extract the extraction failure are extracted for the target data.
  • extracting the at least two target data includes: if a message data has two extraction targets, performing multiple matching on the content in the message data to the two target data Performing extraction; and/or, in the case where there are two extraction targets for different message data, using pre-configured extraction times for recording extraction success and/or attempt extraction extraction times for recording extraction failure for two target data Perform extraction.
  • the dynamic setting interface is configured before the target data is extracted by pre-configuring the number of extractions for recording extraction success and/or the number of attempts to extract extraction failures, wherein the dynamic setting interface
  • the setup interface is used to receive different extraction times and attempt extraction times set for different extraction types.
  • the embodiment of the present invention further provides a data extracting device, which is used to implement the above-mentioned embodiments and preferred embodiments, and has not been described again.
  • the term "module” may implement a combination of software and/or hardware of a predetermined function.
  • the apparatus described in the following embodiments is preferably implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
  • FIG. 2 is a block diagram of a data extraction apparatus according to an embodiment of the present invention. As shown in FIG. 2, the method includes: a determination module 22, a matching module 24, and an extraction module 26. The following briefly describes each module.
  • the determining module 22 is configured to determine the extracted target data according to the data message
  • the matching module 24 is connected to the determining module 22, and configured to match the content in the packet data according to a predetermined regular expression
  • the extraction module 26 is connected to the matching module 24, and is configured to extract the at least two target data if at least two target data are present in the message data.
  • the matching module 24 includes:
  • the matching unit 32 is configured to match the content in the message data according to a predetermined character regular expression in a case where the message data has a character string feature.
  • the matching module 24 includes:
  • the parsing unit 42 is configured to parse the message data by a predetermined function analysis method when the message data does not have a character feature, and decode the target data.
  • FIG. 5 is a block diagram 3 of a data extraction apparatus according to a preferred embodiment of the present invention. As shown in FIG. 5, the extraction module 26 includes:
  • the extracting unit 52 is configured to set, by using a pre-configured number of extractions for recording extraction success and/or an attempted extraction number for recording extraction failure, in the case of extracting the at least two target data from different pieces of the message data.
  • the target data is extracted.
  • the extracting module 26 may further include: a second extracting unit configured to perform multiple matching on the content in the packet data after the two pieces of the packet data have two extraction targets. Two target data are extracted; and/or a third extraction unit is set to use a pre-configured number of extractions for recording extraction success and/or for recording in the case where there are two extraction targets for different message data. Extracting the number of failed attempts to extract the two target data.
  • FIG. 6 is a block diagram 4 of a data extracting apparatus according to a preferred embodiment of the present invention. As shown in FIG. 6, the apparatus further includes:
  • the configuration unit 62 is connected to the extraction unit 52, and is configured to configure a dynamic setting interface, wherein the dynamic setting interface is configured to receive different extraction times and attempt extraction times set for different extraction types.
  • the embodiment of the present invention provides a method for extracting metadata.
  • Pre-defined regular expressions match the content in the message.
  • the target data is extracted. If a plurality of extraction targets are to be extracted in a message data, and the regular rules can only match one result, the embodiment of the present invention adopts a multiple matching extension function, and all the extractions can be implemented by configuring multiple matching extended attributes. For example, if there are multiple hello in a packet, if only the basic extraction configuration is configured, only the content of the first occurrence in the packet can be extracted.
  • FIG. 7 is a first flowchart of a data extraction method according to a preferred embodiment of the present invention. As shown in FIG. 7, the method includes the following steps:
  • Step S702 analyzing that a message has multiple extraction targets
  • Step S704 writing a regular expression
  • Step S706 the message matches the regular expression
  • Step S708 the matching is successfully extracted to the first one
  • Step S710 configuring multiple matching extended attributes
  • step S712 the matching is continued from the end position of the last match until the end of the extraction.
  • the number of extraction attempts and the number of attempts to extract are provided.
  • the user can specify any number of extractions, each time the count is incremented by one, when it is reached. After the number of extractions, the extraction is no longer performed. In some cases, it is possible to configure the extraction rule, but it is too late to extract the information to be extracted. For example, it may be an encrypted message or the next target appears later. In this case, you can specify the number of attempts to extract to avoid the performance. loss.
  • the cumulative method of trying to extract the number of times if it is not extracted continuously, it will increase by 1. If it is extracted, it will be cleared again.
  • FIG. 8 is a second flowchart of a data extraction method according to a preferred embodiment of the present invention. As shown in FIG. 8, the method includes the following steps:
  • Step S802 using a default value for the number of extraction times and the number of attempts to extract;
  • Step S804 the user (product) invokes the parameter configuration interface to dynamically modify
  • step S806 metadata extraction is performed according to the new parameter.
  • FIG. 9 is a third flowchart of a data extraction method according to a preferred embodiment of the present invention. As shown in FIG. 9, the method includes the following steps:
  • Step S902 the function parses the application layer data
  • Step S904 decoding is performed to obtain an extraction target.
  • FIG. 10 is a flowchart 4 of a data extraction method according to a preferred embodiment of the present invention. As shown in FIG. 10, the method includes the following steps:
  • Step S1002 defining a variable
  • Step S1004 extracting data in the message and assigning the value to the variable
  • Step S1006 the variable participates in the expression calculation
  • step S1008 it is judged that the expression is established, and in the case where the judgment result is YES. Step S1010 is performed, if the result of the determination is no, step S1012 is performed;
  • Step S1010 performing extraction
  • Step S1012 without extracting, returning.
  • a plurality of extraction targets are used to describe a process of multiple matching extended attribute extraction, and multiple extraction targets are used to describe the number of times of metadata extraction and the number of attempts to extract the number of times.
  • the QQ login and exit events are used to illustrate the process of expression-assisted metadata extraction, but the mechanism and method of metadata extraction are not limited to the above cases.
  • def needs to be extracted from the message payload content abcdefghijkdeflmn.
  • the matching end position plus 1 or the matching starting position plus 3 is the starting position of the extraction target; after matching to the R2 expression, the matching starting position is decremented by 1 or the matching ending position
  • Subtracting 3 is the end position of the extraction target.
  • FIG. 11 is a flowchart 5 of a data extraction method according to a preferred embodiment of the present invention. As shown in FIG. 11, the method includes the following steps:
  • step S1102 the def needs to be extracted from the abcdefghijkdeflmn, and two extraction target defs exist in the message, and the multiple matching extended attributes are configured to be extracted;
  • Step S1112 calculating the starting position: the end position of R3 plus 1 or the starting position of R3 plus 2;
  • step S1116 the multiple matching extended attribute is extracted to two results, and the extraction ends.
  • FIG. 12 is a flowchart 6 of a data extraction method according to a preferred embodiment of the present invention. As shown in FIG. 12, the method includes the following steps:
  • Step S1202 the number of extractions of the product configuration target extraction type and the number of attempts to extract
  • Step S1204 the number of extractions and the number of attempts to extract are written into the extraction module in real time
  • Step S1206 the user performs Internet service request server data
  • Step S1208 the data message performs regular expression matching
  • step S1210 the matching is successful, the number of extractions is +1, and the number of attempts to be extracted is cleared;
  • Step S1212 calculating a starting position and an ending position
  • step S1214 it is determined whether the number of extractions is reached, if not, the process proceeds to step S1208, and if yes, step S1220 is performed, and the extraction ends;
  • step S1216 the matching is unsuccessful, and the number of attempts is +1;
  • step S1218 it is determined whether the number of attempts to extract is reached, if not, the process proceeds to step S1208, and if yes, step S1220 is performed, and the extraction ends;
  • step S1220 the extraction ends.
  • the QQ login message is analyzed.
  • the value of the Flag field at the beginning of the message payload is 0x02, which means that the Command field of the login is at the beginning of the message payload +3.
  • the content of the two bytes is 0x62, and the Data field is loaded in the message.
  • the content of one byte at the beginning +11 position is 0x02.
  • the QQ exit message is analyzed.
  • the Flag field is 0x02
  • the Command field representing the exit is 0x01
  • the Data field is 0x02. It is necessary to extract the Command field of the two bytes of the load start +3 position. The above rule is too short and too simple.
  • FIG. 13 is a flowchart of a data extraction method according to a preferred embodiment of the present invention. As shown in FIG. 13, the method includes the following steps:
  • Step S1302 defining three variables, Flag, Command, Data;
  • Step S1304 Flag assignment, the content of the message payload start position takes 1 byte value; Command assignment, load start + 3 position content takes 2 bytes value; Data assignment, load start + 11 position content takes 1 byte value ;
  • Step S1308, extracting the value of two bytes of the start +3 position of the message payload
  • step S1310 the extraction ends.
  • modules or steps of the embodiments of the present invention can be implemented by a general computing device, which can be concentrated on a single computing device or distributed in multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in the storage device by the computing device and, in some cases, may be different from The steps shown or described are performed sequentially, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本发明公开了一种数据提取方法及装置,其中,该方法包括:依据数据报文确定提取的目标数据;根据预定的正则表达式对该报文数据中的内容进行匹配;在该报文数据中存在至少两个目标数据的情况下,对该至少两个目标数据进行提取。通过本发明,解决了相关技术中对目标数据的提取不准确的问题,进而能够准确提取目标数据的效果。

Description

数据提取方法及装置 技术领域
本发明涉及通信领域,具体而言,涉及一种数据提取方法及装置。
背景技术
随着移动通信技术的发展,互联网信息交流传递越来越便捷。运营商网络的不断优化,速度提升,带宽升级,费用降低,都是顺应时代的潮流。为了更好的推广产品,提升用户体验,运营商迫切需要了解用户的需求喜好等等。元数据提取可以协助了解用户常登陆的网站,业务应用,和服务器的交互内容。运营商根据元数据提取的结果就能对用户行为和用户体验做跟踪分析,统计热点网站,用户上相应网站的时延、流量等信息。能更好地优化无线网络,协助运营商提升网络质量,从而使产品获得更高的价值。
简单地,用户通过互联网终端设备向服务器请求获取资源,在接收和解释请求消息后,服务器会返回响应消息,问题就是如何从海量的报文内容中准确提取出所需要的数据。现有的方法一般都是根据正则表达式直接匹配提取,由于在网络上传输的元数据信息纷繁复杂,有时候无法找到明文的特征,正则表达式无法较好地进行配置;有时候报文数据中有多个提取目标但是提取不全面或者是只需提取一个却提取出很多不需要的错误内容。
针对相关技术中对目标数据的提取不准确的问题,目前尚未提出有效的解决方案。
发明内容
本发明实施例提供了一种数据提取方法及装置,以至少解决相关技术中对目标数据的提取不准确的问题。
根据本发明实施例的一个方面,提供了一种数据提取方法,包括:依据数据报文确定提取的目标数据;根据预定的正则表达式对所述报文数据中的内容进行匹配;在所述报文数据中存在至少两个目标数据的情况下,对所述至少两个目标数据进行提取。
在本发明实施例中,根据预定的正则表达式对所述报文数据中的内容进行匹配包括:在所述报文数据中具有字符串特征的情况下,根据预定的字符正则表达式对所述报文数据中的内容进行匹配。
在本发明实施例中,根据预定的正则表达式对所述报文数据中的内容进行匹配包括:在所述报文数据中不具有字符特征的情况下,采用预定函数解析的方式解析所述报文数据,解码得到所述目标数据。
在本发明实施例中,对所述至少两个目标数据进行提取包括:在对不同的所述报文数据中提取所述至少两个目标数据的情况下,通过预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对所述目标数据进行提取。
在本发明实施例中,对所述至少两个目标数据进行提取包括:在一个报文数据有两个提取目标的情况下,对所述报文数据中的内容进行多次匹配后对两个目标数据进行提取;和/或,在不同报文数据有两个提取目标的情况下,采用预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对两个目标数据进行提取。
在本发明实施例中,在通过预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对所述目标数据进行提取之前,还包括:配置动态设置接口,其中,所述动态设置接口用于接收针对不同提取类型设置的不同提取次数和尝试提取次数。
根据本发明实施例的另一方面,提供了一种数据提取装置,包括:确定模块,设置为依据数据报文确定提取的目标数据;匹配模块,设置为根据预定的正则表达式对所述报文数据中的内容进行匹配;提取模块,设置为在所述报文数据中存在至少两个目标数据的情况下,对所述至少两个目标数据进行提取。
在本发明实施例中,所述匹配模块包括:匹配单元,设置为在所述报文数据中具有字符串特征的情况下,根据预定的字符正则表达式对所述报文数据中的内容进行匹配。
在本发明实施例中,所述匹配模块包括:解析单元,设置为在所述报文数据中不具有字符特征的情况下,采用预定函数解析的方式解析所述报文数据,解码得到所述目标数据。
在本发明实施例中,所述提取模块包括:提取单元,设置为在对不同的所述报文数据中提取所述至少两个目标数据的情况下,通过预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对所述目标数据进行提取。
在本发明实施例中,所述提取模块包括:第二提取单元,设置为在一个报文数据有两个提取目标的情况下,对所述报文数据中的内容进行多次匹配后对两个目标数据进行提取;和/或,第三提取单元,设置为在不同报文数据有两个提取目标的情况下,采用预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对两个目标数据进行提取。
在本发明实施例中,所述装置还包括:配置单元,设置为配置动态设置接口,其中,所述动态设置接口用于接收针对不同提取类型设置的不同提取次数和尝试提取次数。
通过本发明实施例,采用依据数据报文确定提取的目标数据;根据预定的正则表达式对所述报文数据中的内容进行匹配;在所述报文数据中存在至少两个目标数据的情况下,对所述至少两个目标数据进行提取,解决了相关技术中对目标数据的提取不准确的问题,进而能够准确提取目标数据的效果。
附图说明
此处所说明的附图用来提供对本发明实施例的进一步理解,构成本申请的一部分,本发明实施例的示意性实施例及其说明用于解释本发明实施例,并不构成对本发明实施例的不当限定。在附图中:
图1是根据本发明实施例的数据提取方法的流程图;
图2是根据本发明实施例的数据提取装置的框图;
图3是根据本发明优选实施例的数据提取装置的框图一;
图4是根据本发明优选实施例的数据提取装置的框图二;
图5是根据本发明优选实施例的数据提取装置的框图三;
图6是根据本发明优选实施例的数据提取装置的框图四;
图7是根据本发明优选实施例的数据提取方法的流程图一;
图8是根据本发明优选实施例的数据提取方法的流程图二;
图9是根据本发明优选实施例的数据提取方法的流程图三;
图10是根据本发明优选实施例的数据提取方法的流程图四;
图11是根据本发明优选实施例的数据提取方法的流程图五;
图12是根据本发明优选实施例的数据提取方法的流程图六;
图13是根据本发明优选实施例的数据提取方法的流程图七。
具体实施方式
下文中将参考附图并结合实施例来详细说明本发明实施例。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
在本实施例中提供了一种数据提取方法,图1是根据本发明实施例的数据提取方法的流程图,如图1所示,该流程包括如下步骤:
步骤S102,依据数据报文确定提取的目标数据;
步骤S104,根据预定的正则表达式对该报文数据中的内容进行匹配;
步骤S106,在该报文数据中存在至少两个目标数据的情况下,对该至少两个目标数据进行提取。
通过上述步骤,依据数据报文确定提取的目标数据,根据预定的正则表达式对该报文数据中的内容进行匹配,在该报文数据中存在至少两个目标数据的情况下,对该至少两个目标数据进行提取,解决了相关技术中对目标数据的提取不准确的问题,进而能够准确提取目标数据的效果。
本实施例中,根据预定的正则表达式对该报文数据中的内容进行匹配可以包括:在该报文数据中具有字符串特征的情况下,根据预定的字符正则表达式对该报文数据中的内容进行匹配;和/或,在该报文数据中不具有字符特征的情况下,采用预定函数解析的方式解析该报文数据,解码得到该目标数据。
在一个可选的实施方式中,对该至少两个目标数据进行提取可以包括:在对不同的该报文数据中提取该至少两个目标数据的情况下,通过预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对该目标数据进行提取。
在本发明实施例中,对该至少两个目标数据进行提取包括:在一个报文数据有两个提取目标的情况下,对该报文数据中的内容进行多次匹配后对两个目标数据进行提取;和/或,在不同报文数据有两个提取目标的情况下,采用预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对两个目标数据进行提取。
作为一种优选的实施方式,在通过预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对该目标数据进行提取之前,配置动态设置接口,其中,该动态设置接口用于接收针对不同提取类型设置的不同提取次数和尝试提取次数。
本发明实施例还提供了一种数据提取装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图2是根据本发明实施例的数据提取装置的框图,如图2所示,包括:确定模块22、匹配模块24和提取模块26,下面对各个模块进行简要说明。
确定模块22,设置为依据数据报文确定提取的目标数据;
匹配模块24,连接至上述确定模块22,设置为根据预定的正则表达式对该报文数据中的内容进行匹配;
提取模块26,连接至上述匹配模块24,设置为在该报文数据中存在至少两个目标数据的情况下,对该至少两个目标数据进行提取。
图3是根据本发明优选实施例的数据提取装置的框图一,如图3所示,该匹配模块24包括:
匹配单元32,设置为在该报文数据中具有字符串特征的情况下,根据预定的字符正则表达式对该报文数据中的内容进行匹配。
图4是根据本发明优选实施例的数据提取装置的框图二,如图4所示,该匹配模块24包括:
解析单元42,设置为在该报文数据中不具有字符特征的情况下,采用预定函数解析的方式解析该报文数据,解码得到该目标数据。
图5是根据本发明优选实施例的数据提取装置的框图三,如图5所示,该提取模块26包括:
提取单元52,设置为在对不同的该报文数据中提取该至少两个目标数据的情况下,通过预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对该目标数据进行提取。
在本发明实施例中,该提取模块26还可以包括:第二提取单元,设置为在一个报文数据有两个提取目标的情况下,对该报文数据中的内容进行多次匹配后对两个目标数据进行提取;和/或,第三提取单元,设置为在不同报文数据有两个提取目标的情况下,采用预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对两个目标数据进行提取。
图6是根据本发明优选实施例的数据提取装置的框图四,如图6所示,该装置还包括:
配置单元62,连接至上述提取单元52,设置为配置动态设置接口,其中,该动态设置接口用于接收针对不同提取类型设置的不同提取次数和尝试提取次数。
下面结合可选实施方式对本发明实施例进行进一步说明。
为了更好的提升网络服务,本发明实施例提供了一种元数据提取的方法,首先需要分析报文中的内容,找到所需要的目标数据,当报文数据内容具有字符串特征时,根据预先定义的正则表达式,对报文中的内容进行匹配,匹配成功后进行目标数据的提取。如果一个报文数据中传输多个提取目标都需要提取,而正则规则一般只能匹配出一个结果,本发明实施例采用多次匹配扩展功能,可以通过配置多次匹配扩展属性实现全部提取。比如一个报文中存在多处hello,如果只配置基本的提取配置,只能提取出该报文中首次出现位置的内容,为保证全部提取,增加配置多次提取扩展配置,多次匹配的开始位置是首次匹配的结束位置(只有首次匹配满足,才会进行多次匹配)。图7是根据本发明优选实施例的数据提取方法的流程图一,如图7所示,包括以下步骤:
步骤S702,分析到一个报文有多个提取目标;
步骤S704,书写正则表达式;
步骤S706,报文匹配正则表达式;
步骤S708,匹配成功提取到第一个;
步骤S710,配置多次匹配扩展属性;
步骤S712,从上次匹配的结束位置开始继续匹配,直到提取结束。
当不同报文数据中传输多个提取目标都需要提取,本发明实施例提供配置提取次数和尝试提取次数。用户可以指定任意的提取次数,每提取到一次计数加1,当达到 提取次数后,就不再进行提取。有些情况下,有可能配置了提取正则,但是却迟迟不能提取到待提取的信息,例如可能是加密报文或者是下一条目标出现较晚,这时可以指定尝试提取次数来避免白白的性能损失。尝试提取次数的累加方法:连续未提取到则加1,如果提取到则重新清零。
不同提取类型的提取次数和尝试提取次数有着不同的配置需求,元数据提取提供一个动态设置接口接收用户修改参数。用户可针对不同提取类型设置不同的提取次数和尝试提取次数,实时动态修改提取数据。图8是根据本发明优选实施例的数据提取方法的流程图二,如图8所示,包括以下步骤:
步骤S802,某提取类型提取次数和尝试提取次数采用默认值;
步骤S804,用户(产品)调用参数配置接口动态修改;
步骤S806,按照新的参数进行元数据提取。
当从报文中无法找到特征字串时,采用函数解析的方式分析应用层数据,直接解码得到提取目标。图9是根据本发明优选实施例的数据提取方法的流程图三,如图9所示,包括以下步骤:
步骤S902,函数解析应用层数据;
步骤S904,解码得到提取目标。
在某些情况下,只有报文满足了某个特征说明了是特定的报文数据时(定义切入规则)才能进行提取,或者是由于正则特征较弱或者提取较多不需要的内容(定义排除规则)而影响性能时,可以采用表达式辅助信息提取。定义变量,将报文内容中数据赋值到变量,用于表达式运算。表达式形式类似如:(a+6)>=b&&(c!=d||e>>2<8)),支持逻辑表达式,数学表达式以及两者组合的表达式。只有表达式为真时,才能进行提取动作。图10是根据本发明优选实施例的数据提取方法的流程图四,如图10所示,包括以下步骤:
步骤S1002,定义变量;
步骤S1004,提取报文中的数据,赋值给变量;
步骤S1006,变量参与表达式计算
步骤S1008,判断表达式成立,在判断结果为是的情况下。执行步骤S1010,在判断结果为否的情况下,执行步骤S1012;
步骤S1010,进行提取;
步骤S1012,不提取,返回。
相关实施例中,以一个报文有多个提取目标来说明多次匹配扩展属性提取的过程,以多个报文有多个提取目标来说明元数据提取次数和尝试提取次数的使用方法,下面以QQ登陆和退出事件来说明表达式辅助元数据提取的过程,但元数据提取的机制及方法不仅限于上述几种情况。
关于多次匹配扩展属性提取的功能描述。利用本发明实施例,从报文载荷内容abcdefghijkdeflmn中需要提取出def。这个报文中存在两个提取目标def,配置多次匹配扩展属性进行提取。配置正则表达式R1=abc,R2=ghi。匹配到R1表达式之后,匹配到的结束位置加1或者匹配到的开始位置加3就是提取目标的起始位置;匹配到R2表达式之后,匹配到的开始位置减1或者匹配到的结束位置减3就是提取目标的结束位置。继续进行第二次匹配,从第一次匹配的结束位置i开始,配置正则表达式R3=jk,R4=lmn。匹配到R3表达式之后,匹配到的结束位置加1或者匹配到的开始位置加2就是提取目标的起始位置;匹配到R4表达式之后,匹配到的开始位置减1或者匹配到的结束位置减3就是提取目标的结束位置。多次匹配扩展属性提取到两个结果,提取结束。图11是根据本发明优选实施例的数据提取方法的流程图五,如图11所示,包括以下步骤:
步骤S1102,需要从abcdefghijkdeflmn中提取出def,报文中存在两个提取目标def,配置多次匹配扩展属性进行提取;
步骤S1104,配置正则表达式R1=abc,R2=ghi,报文匹配成功;
步骤S1106,计算起始位置:R1的结尾位置加1或R1的开始位置加3;
步骤S1108,计算结束位置:R2的开始位置减1或R2的结尾位置减3;
步骤S1110,继续进行第二次匹配,从第一次匹配的结束位置i开始,配置正则表达式R3=jk,R4=lmn,报文匹配成功;
步骤S1112,计算起始位置:R3的结尾位置加1或R3的开始位置加2;
步骤S1114,计算结束位置:R4的开始位置减1或R4的结尾位置减3;
步骤S1116,多次匹配扩展属性提取到两个结果,提取结束。
关于配置提取次数和尝试提取次数的功能描述,用户配置目标提取类型的提取次数和尝试提取次数,这些参数实时的写入元数据提取模块,用户进行互联网业务请求服务器数据,数据报文进入提取模块进行正则表达式匹配。匹配成功时,提取次数计数加1,尝试提取次数清零,计算得到起始位置和结束位置,然后判断提取次数是否达到配置的数值,没有达到则报文继续进入模块进行匹配,否则提取过程结束;如匹配不成功时,尝试提取次数加1,判断尝试提取次数是否达到配置的数值,没有达到则报文继续进入模块进行匹配,否则提取过程结束。图12是根据本发明优选实施例的数据提取方法的流程图六,如图12所示,包括以下步骤:
步骤S1202,产品配置目标提取类型的提取次数和尝试提取次数;
步骤S1204,提取次数和尝试提取次数实时地写入提取模块;
步骤S1206,用户进行互联网业务请求服务器数据;
步骤S1208,数据报文进行正则表达式匹配;
步骤S1210,匹配成功,提取次数+1,尝试提取次数清零;
步骤S1212,计算起始位置和结束位置;
步骤S1214,判断提取次数是否达到,没有达到,继续进入步骤S1208,达到则执行步骤S1220,提取结束;
步骤S1216,匹配不成功,尝试提取次数+1;
步骤S1218,判断尝试提取次数是否达到,没有达到,继续进入步骤S1208,,达到则执行步骤S1220,提取结束;
步骤S1220,提取结束。
关于表达式辅助QQ登陆和退出事件提取的功能描述。分析QQ登陆报文,Flag字段在报文载荷起始位置一字节内容为0x02,代表登陆的Command字段在报文载荷起始+3位置两字节内容为0x62,Data字段在报文载荷起始+11位置一字节内容为0x02。分析QQ退出报文,Flag字段为0x02,代表退出的Command字段为0x01,Data字段为0x02。需要提取出载荷起始+3位置两字节内容的Command字段,如上的规则太短太简单,这里采用表达式判断Flag为2,且Data等于2,并且QQ命令字Command 为0x62或0x01,即只有判断出这个报文是登陆报文,或者是退出报文,表达式成立时才进行提取。提取出报文载荷起始+3位置的两个字节的数值,提取结束。图13是根据本发明优选实施例的数据提取方法的流程图七,如图13所示,包括以下步骤:
步骤S1302,定义三个变量,Flag,Command,Data;
步骤S1304,Flag赋值,报文载荷起始位置内容取1字节值;Command赋值,载荷起始+3位置内容取2字节值;Data赋值,载荷起始+11位置内容取1字节值;
步骤S1306,计算表达式是否为真((Flag==2)&&(Data==2)&&((Command==0x62)||Command==0x01));
步骤S1308,提取出报文载荷起始+3位置的两个字节的数值;
步骤S1310,提取结束。
显然,本领域的技术人员应该明白,上述的本发明实施例的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明实施例不限制于任何特定的硬件和软件结合。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
工业实用性
如上所述,通过上述实施例及优选实施方式,解决了相关技术中对目标数据的提取不准确的问题,进而能够准确提取目标数据的效果。

Claims (12)

  1. 一种数据提取方法,包括:
    依据数据报文确定提取的目标数据;
    根据预定的正则表达式对所述报文数据中的内容进行匹配;
    在所述报文数据中存在至少两个目标数据的情况下,对所述至少两个目标数据进行提取。
  2. 根据权利要求1所述的方法,其中,根据预定的正则表达式对所述报文数据中的内容进行匹配包括:
    在所述报文数据中具有字符串特征的情况下,根据预定的字符正则表达式对所述报文数据中的内容进行匹配。
  3. 根据权利要求2所述的方法,其中,根据预定的正则表达式对所述报文数据中的内容进行匹配包括:
    在所述报文数据中不具有字符特征的情况下,采用预定函数解析的方式解析所述报文数据,解码得到所述目标数据。
  4. 根据权利要求1所述的方法,其中,对所述至少两个目标数据进行提取包括:
    在对不同的所述报文数据中提取所述至少两个目标数据的情况下,通过预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对所述目标数据进行提取。
  5. 根据权利要求4中所述的方法,其中,对所述至少两个目标数据进行提取包括:
    在一个报文数据有两个提取目标的情况下,对所述报文数据中的内容进行多次匹配后对两个目标数据进行提取;和/或
    在不同报文数据有两个提取目标的情况下,采用预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对两个目标数据进行提取。
  6. 根据权利要求4或5所述的方法,其中,在通过预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对所述目标数据进行提取之前,还包括:
    配置动态设置接口,其中,所述动态设置接口用于接收针对不同提取类型设置的不同提取次数和尝试提取次数。
  7. 一种数据提取装置,包括:
    确定模块,设置为依据数据报文确定提取的目标数据;
    匹配模块,设置为根据预定的正则表达式对所述报文数据中的内容进行匹配;
    提取模块,设置为在所述报文数据中存在至少两个目标数据的情况下,对所述至少两个目标数据进行提取。
  8. 根据权利要求7所述的装置,其中,所述匹配模块包括:
    匹配单元,设置为在所述报文数据中具有字符串特征的情况下,根据预定的字符正则表达式对所述报文数据中的内容进行匹配。
  9. 根据权利要求8所述的装置,其中,所述匹配模块包括:
    解析单元,设置为在所述报文数据中不具有字符特征的情况下,采用预定函数解析的方式解析所述报文数据,解码得到所述目标数据。
  10. 根据权利要求7所述的装置,其中所述提取模块包括:
    第一提取单元,设置为在对不同的所述报文数据中提取所述至少两个目标数据的情况下,通过预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对所述目标数据进行提取。
  11. 根据权利要求7中所述的装置,其中,所述提取模块包括:
    第二提取单元,设置为在一个报文数据有两个提取目标的情况下,对所述报文数据中的内容进行多次匹配后对两个目标数据进行提取;和/或
    第三提取单元,设置为在不同报文数据有两个提取目标的情况下,采用预先配置的用于记录提取成功的提取次数和/或用于记录提取失败的尝试提取次数对两个目标数据进行提取。
  12. 根据权利要求10或11所述的装置,其中,所述装置还包括:
    配置单元,设置为配置动态设置接口,其中,所述动态设置接口用于接收针对不同提取类型设置的不同提取次数和尝试提取次数。
PCT/CN2015/076587 2014-11-12 2015-04-14 数据提取方法及装置 WO2016074434A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410638204.8A CN105653531B (zh) 2014-11-12 2014-11-12 数据提取方法及装置
CN201410638204.8 2014-11-12

Publications (1)

Publication Number Publication Date
WO2016074434A1 true WO2016074434A1 (zh) 2016-05-19

Family

ID=55953676

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/076587 WO2016074434A1 (zh) 2014-11-12 2015-04-14 数据提取方法及装置

Country Status (2)

Country Link
CN (1) CN105653531B (zh)
WO (1) WO2016074434A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117440B (zh) * 2017-06-23 2021-06-22 中移动信息技术有限公司 一种元数据信息获取方法、系统和计算机可读存储介质
CN107766466A (zh) * 2017-09-29 2018-03-06 上海望友信息科技有限公司 数据类型的识别方法、系统、计算机可读存储介质及设备
CN109933712A (zh) * 2019-03-06 2019-06-25 北京思特奇信息技术股份有限公司 一种报文数据的提取方法及系统
CN111507615A (zh) * 2020-04-15 2020-08-07 江苏鹏为软件有限公司 一种智慧城市检测用评价系统
CN112511643A (zh) * 2020-12-07 2021-03-16 北京天融信网络安全技术有限公司 一种报文数据提取方法及装置
CN113965408B (zh) * 2021-11-09 2023-01-20 北京锐安科技有限公司 一种http报文的提取方法、装置、介质及设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043862A (zh) * 2010-12-29 2011-05-04 重庆新媒农信科技有限公司 网页数据定向抓取方法
CN104133830A (zh) * 2013-05-02 2014-11-05 乐视网信息技术(北京)股份有限公司 一种数据获取方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7827155B2 (en) * 2006-04-21 2010-11-02 Microsoft Corporation System for processing formatted data
CN101068209B (zh) * 2007-06-20 2010-09-29 中兴通讯股份有限公司 深度报文检查系统及方法
CN100461183C (zh) * 2007-07-10 2009-02-11 北京大学 网络搜索中基于多种规则的元数据自动抽取方法
CN101957816B (zh) * 2009-07-13 2013-03-20 上海华燕置业发展有限公司 基于多页面比较的网页元数据自动抽取方法和系统
JP5424798B2 (ja) * 2009-09-30 2014-02-26 株式会社日立ソリューションズ メタデータ設定方法及びメタデータ設定システム、並びにプログラム
CN102611565B (zh) * 2011-10-18 2015-07-08 深圳供电局有限公司 一种基于正则表达式的监控系统告警关联分析方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043862A (zh) * 2010-12-29 2011-05-04 重庆新媒农信科技有限公司 网页数据定向抓取方法
CN104133830A (zh) * 2013-05-02 2014-11-05 乐视网信息技术(北京)股份有限公司 一种数据获取方法

Also Published As

Publication number Publication date
CN105653531A (zh) 2016-06-08
CN105653531B (zh) 2020-02-07

Similar Documents

Publication Publication Date Title
WO2016074434A1 (zh) 数据提取方法及装置
CN108306877B (zh) 基于node js的用户身份信息的验证方法、装置和存储介质
US10587544B2 (en) Message processing method, processing server, terminal, and storage medium
US11425047B2 (en) Traffic analysis method, common service traffic attribution method, and corresponding computer system
WO2017166644A1 (zh) 一种数据采集方法和系统
CN105704091B (zh) 一种基于ssh协议的会话解析方法及系统
KR20190114023A (ko) 패킷 기반 데이터 통신의 디바이스 식별자 의존적 오퍼레이션 프로세싱
CN108469972B (zh) 支持web页面中显示多窗口的方法和装置
CN104461474A (zh) 用于移动终端的截屏方法和截屏装置以及移动终端
US20160050128A1 (en) System and Method for Facilitating Communication with Network-Enabled Devices
EP3131303B1 (en) Method and device for transmitting data in intelligent terminal to television terminal
CN107133161B (zh) 一种生成客户端性能测试脚本方法及装置
US20160316020A1 (en) Web page information presentation method and system
CN108848082A (zh) 数据处理方法、装置、存储介质及计算机设备
CN103179133A (zh) 基于实体类的客户端与服务器通信的方法
WO2016086755A1 (zh) 一种报文处理的方法和透明代理服务器
WO2015078122A1 (zh) 数据流的识别方法及设备
CN106656919A (zh) 一种基于Telnet协议的会话解析方法及系统
CN108345606A (zh) 网页资源的获取方法和装置
CN105550179A (zh) 一种网页收藏方法和浏览器插件
EP2760161B1 (en) Policy processing method and device
CN107370628A (zh) 基于埋点的日志处理方法及系统
EP3335395B1 (en) Systems and methods for dynamically restricting the rendering of unauthorized content included in information resources
CN104902432A (zh) 生成移动终端应用操作日志的方法和设备
CN105184559B (zh) 一种支付系统及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15858356

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15858356

Country of ref document: EP

Kind code of ref document: A1