CN110851687A - Data identification method, terminal equipment and storage medium - Google Patents

Data identification method, terminal equipment and storage medium Download PDF

Info

Publication number
CN110851687A
CN110851687A CN201911092398.5A CN201911092398A CN110851687A CN 110851687 A CN110851687 A CN 110851687A CN 201911092398 A CN201911092398 A CN 201911092398A CN 110851687 A CN110851687 A CN 110851687A
Authority
CN
China
Prior art keywords
data
identified
layer
matched
entering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911092398.5A
Other languages
Chinese (zh)
Inventor
鄢小征
林晖文
吴鸿伟
陈良彬
毕永辉
陈志飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Pico Information Co Ltd
Original Assignee
Xiamen Meiya Pico Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Pico Information Co Ltd filed Critical Xiamen Meiya Pico Information Co Ltd
Priority to CN201911092398.5A priority Critical patent/CN110851687A/en
Publication of CN110851687A publication Critical patent/CN110851687A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to a data identification method, terminal equipment and storage medium, wherein the method comprises the steps of constructing a characteristic template of data to be identified according to the characteristics of the data to be identified, matching each data to be identified through the characteristic template, and using the data to be identified which can be matched as identification data; the feature template comprises a plurality of layers, wherein the matching is to match each layer respectively, and when all the layers can be matched, the data to be identified can be matched. Compared with a full-scale acquisition method, the method can effectively reduce the data storage space; compared with an accurate identification method, an identification scheme does not need to be developed for each data, universality is high, and some data which are newly on-line on a network and are matched with an existing template can be found through the template.

Description

Data identification method, terminal equipment and storage medium
Technical Field
The present invention relates to the field of data classification, and in particular, to a data identification method, a terminal device, and a storage medium.
Background
At present, the methods for identifying the required data from a large amount of data are mainly two types: one is precise identification, that is, the data must conform to a specified format and the data needs to contain specified field information; one is to acquire the full amount and then manually search the required information. The accurate identification needs to grasp the format and field information of the data in advance, and the amount of the data which can be identified is small, for example, an accurate identification scheme is needed to identify the data sent by the website A; another accurate identification scheme is required to identify the data sent by the B site. The full-scale acquisition requires that all data be collected and stored, a large storage space is occupied, and when data of the site a or the site B are retrieved from the data, the efficiency is low.
Disclosure of Invention
In order to solve the above problems, the present invention provides a data identification method, a terminal device, and a storage medium.
The specific scheme is as follows:
a data identification method is characterized in that a feature template of data to be identified is constructed according to features of the data to be identified, each data to be identified is matched through the feature template, and the data to be identified which can be matched are used as identification data.
Further, the feature template includes a plurality of layers, the matching is performed on each layer, and when all the layers can be matched, the data to be identified can be matched.
Further, the feature template comprises five layers: the first layer is a communication protocol; the second layer is a website domain name or address; the third layer is a network port number; the fourth layer is interaction; the fifth layer is data format and field characteristics.
Further, the interaction comprises a sequence number of each data packet with the interaction characteristics in the data to be identified and the interaction characteristics corresponding to the sequence number.
Further, the specific matching method of each data to be identified comprises the following steps:
s1: the initialization sequence number I is 1, and the total number I of the data packets is obtained according to the data to be identified;
s2: receiving an ith data packet of data to be identified;
s3: judging whether the ith data packet can be matched with the first layer of the feature template, if so, entering S4; otherwise, go to S11;
s4: judging whether the ith data packet can be matched with the second layer of the feature template, if so, entering S5; otherwise, go to S11;
s5: judging whether the ith data packet can be matched with the third layer of the feature template, if so, entering S6; otherwise, go to S11;
s6: judging whether the serial number of the data packet configured in the fourth layer of the feature template contains i, if so, entering S7; otherwise, go to S8;
s7: judging whether the ith data packet can be matched with the interactive features corresponding to the sequence number i configured in the fourth layer, if so, entering S8; otherwise, go to S11;
s8: judging whether the ith data packet can be matched with the fifth layer of the feature template or not, and if so, entering S9; otherwise, go to S11;
s9: judging whether I is true, if so, entering S10; otherwise, let i equal to i +1, return to S2;
s10: the data to be identified is the required data, and the process is finished;
s11: and ending when the data to be identified is not the required data.
A data recognition terminal device comprises a processor, a memory and a computer program stored in the memory and operable on the processor, wherein the processor executes the computer program to implement the steps of the method of the embodiment of the present invention.
A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to an embodiment of the invention as described above.
The invention adopts the technical scheme and has the beneficial effects that:
1. compared with a full-scale acquisition method, the method can effectively reduce the data storage space.
2. Compared with an accurate identification method, an identification scheme does not need to be developed for each data, universality is high, and some data which are newly on-line on a network and are matched with an existing template can be found through the template.
3. The data on the network is large, the data containing common characteristics or common templates is also large, and a plurality of new data with common characteristics with the existing data will appear in the future, and the new data can be found more efficiently by the method.
Drawings
Fig. 1 is a flowchart illustrating a first embodiment of the present invention.
Detailed Description
To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.
The invention will now be further described with reference to the accompanying drawings and detailed description.
The first embodiment is as follows:
the embodiment of the invention provides a data identification method, which is characterized in that a characteristic template of data to be identified is constructed according to the characteristics of the data to be identified, each data to be identified is matched through the characteristic template, and the data to be identified which can be matched is used as identification data.
In this embodiment, the feature template includes a plurality of layers, the matching is performed on each layer, and when all the layers can be matched, the data to be identified can be matched.
The elements of each layer can be set and adjusted by those skilled in the art according to the characteristics of the data, and the feature template in this embodiment includes five layers of elements, respectively:
the first layer element is a communication protocol, such as TCP, UDP, FTP, etc.
The second layer element is a website domain name or address (url), such as: com/news/.
The third layer element is a network port number, such as 80, 443, etc.
And the fourth layer element is interaction, and the interaction comprises the serial number of each data packet with the interaction characteristics in the data to be identified and the interaction characteristics corresponding to the serial number.
In a data switching network, a single data is often divided into a plurality of data packets for transmission in a plurality of times, and the interactive characteristics represented by different data packets contained in each data are often different, for example, a first data packet of a certain data is an uplink packet, and the interactive characteristics are a request for login; the second packet is a downstream packet whose interactive feature is to request a bet. Therefore, the interactive features are in one-to-one correspondence with the serial numbers of the data packets, and when the interactive features are required to be judged to be in accordance with the interactive features, the interactive features corresponding to the serial numbers can be found only by judging the serial numbers of the data packets, so that the interactive features are matched.
The fifth layer element is data format and field characteristics, such as: the 3 rd and 4 th bytes of the packet represent the total length of the packet, and the last 2 bytes of the packet are 0xFF, etc.
Configuring each layer of elements according to the format of the characteristic template, wherein the elements of each layer can be empty or multiple configurations according to actual situations, such as: the first layer communication protocol is configured to TCP and UDP protocols, the second layer is empty, the port number of the third layer network is configured to 80 ports, etc.
When a certain layer of elements in the feature template are empty, the feature of the layer of elements is not the necessary feature of the data to be detected.
The characteristic template has universality and can be suitable for most network data identification.
As shown in fig. 1, the specific matching method for each piece of data to be identified in this embodiment includes the following steps:
s1: and the initialization serial number I is 1, and the total number I of the data packets is obtained according to the data to be identified.
S2: and receiving the ith data packet of the data to be identified.
S3: judging whether the ith data packet can be matched with the first layer of the feature template, if so, entering S4; otherwise, the process proceeds to S11.
It should be noted that if the layer element is empty, the default representation data conforms to the layer element.
S4: judging whether the ith data packet can be matched with the second layer of the feature template, if so, entering S5; otherwise, the process proceeds to S11.
S5: judging whether the ith data packet can be matched with the third layer of the feature template, if so, entering S6; otherwise, the process proceeds to S11.
S6: judging whether the serial number of the data packet configured in the fourth layer of the feature template contains i, if so, entering S7; otherwise, the process proceeds to S8.
S7: judging whether the ith data packet can be matched with the interactive features corresponding to the sequence number i configured in the fourth layer, if so, entering S8; otherwise, the process proceeds to S11.
S8: judging whether the ith data packet can be matched with the fifth layer of the feature template or not, and if so, entering S9; otherwise, the process proceeds to S11.
S9: judging whether I is true, if so, entering S10; otherwise, let i equal to i +1, return to S2.
S10: and storing the data to be identified as the required data, and ending.
S11: and ending when the data to be identified is not the required data.
The method of the present embodiment differs from the precise identification method as follows:
A:/lizishuoming/xxx&uid=890&num=2
B:/lizishuoming/xxx&mail=&num=3
C:/lizishuoming/xxx&tel=059&num=4
the data found by accurate recognition must only be from a, otherwise, it is a false recognition, and the data found by the method described in this embodiment may be from a, B, C, or the like, if A, B, C all conform to the same feature template (e.g., A, B, C has both the features of lizhouming/xxx and num). For example, many web-gambling sites have similar characteristics, i.e., conform to the same characteristic template. By the identification method in the embodiment, new gambling websites appear in the following, and the new gambling websites can be effectively identified as long as the new gambling websites conform to the characteristic template.
The embodiment of the invention provides a characteristic template and a data identification method based on the characteristic template, which can identify and store the same accumulated data in a classified manner from the source and have the following beneficial effects:
1. compared with a full-scale acquisition method, the method can effectively reduce the data storage space.
2. Compared with an accurate identification method, an identification scheme does not need to be developed for each data, universality is high, and some data which are newly on-line on a network and are matched with an existing template can be found through the template.
3. The data on the network is large, the data containing common characteristics or common templates is also large, and a plurality of new data with common characteristics with the existing data will appear in the future, and the new data can be found more efficiently by the method.
Example two:
the invention further provides a data identification terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the method embodiment of the first embodiment of the invention.
Further, as an executable scheme, the data identification terminal device may be a computing device such as a desktop computer, a notebook, a palm computer, and a cloud server. The data identification terminal device may include, but is not limited to, a processor, a memory. It is understood by those skilled in the art that the above-mentioned structure of the data identification terminal device is only an example of the data identification terminal device, and does not constitute a limitation to the data identification terminal device, and may include more or less components than the above, or combine some components, or different components, for example, the data identification terminal device may further include an input/output device, a network access device, a bus, etc., which is not limited by the embodiment of the present invention.
Further, as an executable solution, the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and the like. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor is a control center of the data identification terminal device, and various interfaces and lines are used to connect various parts of the entire data identification terminal device.
The memory can be used for storing the computer program and/or the module, and the processor can realize various functions of the data identification terminal equipment by operating or executing the computer program and/or the module stored in the memory and calling the data stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The invention also provides a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method of an embodiment of the invention.
The data identification terminal device integrated module/unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM ), Random Access Memory (RAM), software distribution medium, and the like.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A data recognition method, characterized by: and constructing a feature template of the data to be identified according to the features of the data to be identified, matching each data to be identified through the feature template, and taking the data to be identified which can be matched as the identification data.
2. The data recognition method of claim 1, wherein: the feature template comprises a plurality of layers, wherein the matching is to match each layer respectively, and when all the layers can be matched, the data to be identified can be matched.
3. The data recognition method of claim 1, wherein: the characteristic template comprises five layers: the first layer is a communication protocol; the second layer is a website domain name or address; the third layer is a network port number; the fourth layer is interaction; the fifth layer is data format and field characteristics.
4. The data recognition method of claim 3, wherein: the interaction comprises the serial number of each data packet with the interaction characteristics in the data to be identified and the interaction characteristics corresponding to the serial number.
5. The data recognition method of claim 4, wherein: the specific matching method of each data to be identified comprises the following steps:
s1: the initialization sequence number I is 1, and the total number I of the data packets is obtained according to the data to be identified;
s2: receiving an ith data packet of data to be identified;
s3: judging whether the ith data packet can be matched with the first layer of the feature template, if so, entering S4; otherwise, go to S11;
s4: judging whether the ith data packet can be matched with the second layer of the feature template, if so, entering S5; otherwise, go to S11;
s5: judging whether the ith data packet can be matched with the third layer of the feature template, if so, entering S6; otherwise, go to S11;
s6: judging whether the serial number of the data packet configured in the fourth layer of the feature template contains i, if so, entering S7; otherwise, go to S8;
s7: judging whether the ith data packet can be matched with the interactive features corresponding to the sequence number i configured in the fourth layer, if so, entering S8; otherwise, go to S11;
s8: judging whether the ith data packet can be matched with the fifth layer of the feature template or not, and if so, entering S9; otherwise, go to S11;
s9: judging whether I is true, if so, entering S10; otherwise, let i equal to i +1, return to S2;
s10: the data to be identified is the required data, and the process is finished;
s11: and ending when the data to be identified is not the required data.
6. A data recognition terminal device characterized by: comprising a processor, a memory and a computer program stored in the memory and running on the processor, the processor implementing the steps of the method according to any of claims 1 to 5 when executing the computer program.
7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
CN201911092398.5A 2019-11-11 2019-11-11 Data identification method, terminal equipment and storage medium Pending CN110851687A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911092398.5A CN110851687A (en) 2019-11-11 2019-11-11 Data identification method, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911092398.5A CN110851687A (en) 2019-11-11 2019-11-11 Data identification method, terminal equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110851687A true CN110851687A (en) 2020-02-28

Family

ID=69600922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911092398.5A Pending CN110851687A (en) 2019-11-11 2019-11-11 Data identification method, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110851687A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609445A (en) * 2010-12-10 2012-07-25 微软公司 Matching queries to data operations using query templates
CN105488210A (en) * 2015-12-11 2016-04-13 金蝶软件(中国)有限公司 Batch data matching method and device
CN106384282A (en) * 2016-06-14 2017-02-08 平安科技(深圳)有限公司 Method and device for building decision-making model
CN108449231A (en) * 2018-03-15 2018-08-24 华青融天(北京)技术股份有限公司 A kind of filter method of transaction data, device and realization device
CN109491990A (en) * 2018-09-17 2019-03-19 武汉达梦数据库有限公司 A kind of method of detection data quality and the device of detection data quality

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609445A (en) * 2010-12-10 2012-07-25 微软公司 Matching queries to data operations using query templates
CN105488210A (en) * 2015-12-11 2016-04-13 金蝶软件(中国)有限公司 Batch data matching method and device
CN106384282A (en) * 2016-06-14 2017-02-08 平安科技(深圳)有限公司 Method and device for building decision-making model
CN108449231A (en) * 2018-03-15 2018-08-24 华青融天(北京)技术股份有限公司 A kind of filter method of transaction data, device and realization device
CN109491990A (en) * 2018-09-17 2019-03-19 武汉达梦数据库有限公司 A kind of method of detection data quality and the device of detection data quality

Similar Documents

Publication Publication Date Title
CN108427731B (en) Page code processing method and device, terminal equipment and medium
CN111062013B (en) Account filtering method and device, electronic equipment and machine-readable storage medium
US20200204688A1 (en) Picture book sharing method and apparatus and system using the same
CN107360261A (en) A kind of HTTP request processing method, device and electronic equipment
CN109815112B (en) Data debugging method and device based on functional test and terminal equipment
CN104219230A (en) Method and device for identifying malicious websites
CN114429401A (en) Transaction data processing method and device, terminal device and readable storage medium
CN111273891A (en) Business decision method and device based on rule engine and terminal equipment
CN113704307A (en) Data query method, device, server and computer readable storage medium
CN107748772B (en) Trademark identification method and device
CN108897592A (en) A kind of software methods of exhibiting and relevant device
CN111177243B (en) Data export method and device, storage medium and electronic device
CN108717449A (en) A kind of information processing method and system
CN111859069B (en) Network malicious crawler identification method, system, terminal and storage medium
CN107506407B (en) File classification and calling method and device
CN109669678A (en) Template engine integration method, device, electronic equipment and storage medium
CN117093619A (en) Rule engine processing method and device, electronic equipment and storage medium
CN110851687A (en) Data identification method, terminal equipment and storage medium
CN111212153A (en) IP address checking method, device, terminal equipment and storage medium
CN111209325A (en) Service system interface identification method, device and storage medium
CN111159226A (en) Index query method and system
CN114124883B (en) Data access method and device based on cloud storage address, computer equipment and medium
CN110222286A (en) Information acquisition method, device, terminal and computer readable storage medium
WO2019001333A1 (en) Application interface display method, apparatus and electronic device
CN110795405B (en) Fragment data restoration method, terminal device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200228

RJ01 Rejection of invention patent application after publication