CN108734149B - Text data scanning method and device - Google Patents

Text data scanning method and device Download PDF

Info

Publication number
CN108734149B
CN108734149B CN201810531245.5A CN201810531245A CN108734149B CN 108734149 B CN108734149 B CN 108734149B CN 201810531245 A CN201810531245 A CN 201810531245A CN 108734149 B CN108734149 B CN 108734149B
Authority
CN
China
Prior art keywords
byte
data
text data
scanned
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810531245.5A
Other languages
Chinese (zh)
Other versions
CN108734149A (en
Inventor
温悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201810531245.5A priority Critical patent/CN108734149B/en
Publication of CN108734149A publication Critical patent/CN108734149A/en
Application granted granted Critical
Publication of CN108734149B publication Critical patent/CN108734149B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The embodiment of the specification provides a text data scanning method and a text data scanning device, which are used for scanning text data to be scanned stored in a byte form, reading the text data to be scanned in a byte stream form, matching the text data to be scanned according to bytes through a pre-configured byte stream regular matching tool, wherein the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, the byte matching rule is pre-written according to a byte coding format of the text data to be scanned, and finally the successfully matched text data is determined as scanned key data.

Description

Text data scanning method and device
Technical Field
The present disclosure relates to the field of information security, and in particular, to a method and an apparatus for scanning text data.
Background
Data scanning of large amounts of text using regular expressions is a common practice in the industry in the field of big data security, such as: the regular expression describing the privacy scan rule can be configured as a regular matching engine to scan and match the text data to determine whether the client privacy data exists in the text.
In the prior art, text data is often stored in various storage media in a binary file form, and if the text data is to be scanned, a general method is to read the text data in a byte stream form, then decode and convert the byte data to make the byte data become meaningful readable character data, and then scan and match the character stream data by using a regular expression matching tool, so as to scan key data meeting a regular expression matching rule in a large amount of text data. However, it takes a long time to decode and convert a large amount of text data first, which affects the data scanning rate, and it is necessary to provide a faster and more efficient data scanning scheme.
Disclosure of Invention
In view of the above technical problems, an embodiment of the present specification provides a text data scanning method and apparatus, and a technical scheme is as follows:
according to a first aspect of embodiments of the present specification, there is provided a text data scanning method applied to text data to be scanned stored in a byte form, the method including:
a text data scanning method is applied to text data to be scanned which is stored in a byte form, and is characterized by comprising the following steps:
reading text data to be scanned in a byte stream mode;
matching the text data to be scanned according to bytes by a pre-configured byte stream regular matching tool, wherein the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is pre-written according to the byte coding format of the text data to be scanned;
and determining the successfully matched text data as the scanned key data, wherein the scanned key data is the text data in the form of byte sequence.
According to a second aspect of embodiments herein, there is provided a text data scanning apparatus applied to text data to be scanned stored in a byte form, the apparatus including:
a data reading module: the scanning device is used for reading text data to be scanned in a byte stream mode;
a data matching module: the device comprises a byte stream regular matching tool, a byte matching module and a data processing module, wherein the byte stream regular matching tool is used for matching the text data to be scanned according to bytes by the aid of a preconfigured byte stream regular matching tool, the byte matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is written in advance according to a byte coding format of the text data to be scanned;
a data extraction module: and the method is used for determining the text data successfully matched with the data as the scanned key data, wherein the scanned key data is the text data in the form of byte sequence.
According to a third aspect of embodiments herein, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements a text data scanning method applied to text data to be scanned stored in byte form, the method comprising:
reading text data to be scanned in a byte stream mode;
matching the text data to be scanned according to bytes by a pre-configured byte stream regular matching tool, wherein the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is pre-written according to the byte coding format of the text data to be scanned;
and determining the successfully matched text data as the scanned key data, wherein the scanned key data is the text data in the form of byte sequence.
According to the technical scheme provided by the embodiment of the specification, the regular matching is directly performed on the text data in the form of the byte stream during data scanning, and the matching result in the form of the byte sequence is obtained. The intermediate conversion process from the byte stream to the character stream is avoided, and the data scanning efficiency is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of embodiments of the invention.
In addition, any one of the embodiments in the present specification is not required to achieve all of the effects described above.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present specification, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a schematic diagram of a prior art text data scanning method shown in an exemplary embodiment of the present description;
FIG. 2 is a flow chart of a method of scanning text data as shown in an exemplary embodiment of the present description;
FIG. 3 is a flow chart illustrating configuring a byte stream regular matching tool in accordance with an exemplary embodiment of the present specification;
FIG. 4 is another flow diagram of a text data scanning method shown in an exemplary embodiment of the present description;
FIG. 5 is a diagram illustrating a method of scanning text data in accordance with an exemplary embodiment of the present description;
FIG. 6 is a schematic diagram of a text data scanning apparatus shown in an exemplary embodiment of the present description;
fig. 7 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
Data scanning of large amounts of text using regular expressions is a common practice in the industry in the field of big data security, such as: the text data may be scan matched using a regular expression that describes privacy scan rules to determine if customer privacy data is present in the text.
Referring to fig. 1, a diagram of scanning private data in the prior art is shown. In the prior art, text data is often stored in various storage media in a binary file form, and if the text data is to be scanned, a general method is to read the text data in a byte stream form, then decode and convert the byte data to make the byte data become meaningful readable character data, and then scan and match the character stream data by using a regular expression matching tool, so as to scan key data meeting a regular expression matching rule in a large amount of text data.
However, in the requirements of practical applications, it is generally only necessary to pay attention to the content of the key data satisfying the condition in the large amount of text, and even sometimes it is not necessary to pay attention to the specific content of the key data, and it is only necessary to confirm whether the large amount of text contains the key data satisfying the condition. Still take the example of scanning the private data, in general, we need to scan the private data in the text to be scanned and know the specific content of the private data, and at this time, it only needs to decode the scanned private data. In another case, we only need to confirm whether a large amount of text data contains the private data without knowing the specific content of the private data, and at this time, only scanning is needed, and the text data does not need to be decoded. In the prior art, no matter which situation is targeted, all text data to be scanned need to be decoded and converted first, and then scanning matching is performed. However, decoding and converting a large amount of text data completely first brings extra performance loss, affects the data scanning rate, and it is necessary to provide a faster and more efficient data scanning scheme.
In view of the above problems, an embodiment of the present specification provides a text data scanning method and a text data scanning apparatus for executing the method, and the following describes in detail the text data scanning method according to the embodiment, and referring to fig. 2, provides a text data scanning method, which is applied to text data to be scanned stored in a byte form, and may include the following steps:
s201, reading text data to be scanned in a byte stream mode;
a byte stream is a way for a computer to read data from a data source (disk, network, etc.), and the byte stream data is embodied as a sequence of bytes. And the byte sequence can be converted into a character sequence with specific meaning and readability through a corresponding decoding rule.
At present, text data is usually stored in various storage systems in the form of byte files, and a set of byte-long sequences with sequence can be obtained after reading the text data to be scanned in the form of byte streams.
S202, matching the text data to be scanned according to bytes by a pre-configured byte stream regular matching tool, wherein the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is pre-written according to the byte coding format of the text data to be scanned;
as shown in step S201, the text data to be scanned exists in a group of byte sequences after being read. The read bytes in the byte sequence are sequentially received through a pre-configured byte stream regular matching tool to determine whether a certain section of byte sequence or a certain section of byte sequence can accord with the matching rule of the byte stream regular matching tool, and when the certain section of byte sequence accords with the pre-configured byte matching rule, the matching of the section of byte sequence is judged to be successful.
For example, if mailbox information of a user is required to be scanned from a large amount of text data, the matching rule can be set to match a group of text data with a keyword 'mailbox' or a combined format of letters and numbers which meets a mailbox naming rule.
Taking a byte stream regular matching engine as an example, explaining a byte stream regular matching tool: the byte stream regular matching engine matches the text data to be scanned, typically by means of a finite automaton. The finite automata can be divided into an NFA (non-deterministic finite automata) and a DFA (deterministic finite automata), the finite automata can generate state change every time a new byte is received, and when a section of byte sequence enables the state of the finite automata to be changed continuously from an initial state to a final state, the section of byte sequence is judged to accord with a byte matching rule in a byte stream regular matching engine, and then matching is successful.
The finite automaton is compiled by a regular expression, the regular expression needs to be written in advance corresponding to a byte coding format of text data to be scanned, if the text data is coded by UTF-8, a matching rule described by the regular expression needs to be written corresponding to UTF-8, so that the finite automaton compiled according to the regular expression can sequentially accept each byte in a byte sequence and make corresponding state change according to the accepted byte. Referring to fig. 5, a byte stream regular matching engine is used as a regular matching tool, and during data scanning, regular matching can be directly performed on text data in a byte stream form, and a matching result in a byte sequence form is obtained. The intermediate conversion process from the byte stream to the character stream is avoided, and the data scanning efficiency is improved.
After scanning, if the matched text data exists, determining the text data as the scanned key data; and if the text data which is successfully matched does not exist, the fact that the text to be scanned does not contain the key data is indicated.
In practical application, a further data processing mode can be determined according to requirements. For example, the following steps are carried out: in some cases, the specific content of the key data does not need to be known, and only the fact that whether the text to be scanned contains the key data needs to be confirmed, in this case, only the fact that the data to be scanned is finished and whether the scanned key data exists is confirmed, and decoding operation is not needed; in other cases, the specific content of the key data needs to be known, and after the key data is scanned, the key data needs to be decoded to convert the key data in the form of byte sequence into text data in the form of character string. The decoding conversion between the character data and the byte data is a prior art, and is not described herein again. Regardless of the processing mode, the scanned key data needs to be decoded at most, so that the time waste caused by decoding all the data to be scanned is avoided, and the scanning efficiency is improved.
The byte stream regular matching tool configured in advance in step S202 corresponds to the byte stream regular expression describing the byte data matching rule, that is, the byte stream regular matching tool is written according to a byte stream regular expression describing the byte matching rule for the text data to be scanned. Based on this, an embodiment of the present specification provides a method for acquiring a byte stream regular matching tool, which is shown in fig. 3, and the method may include the following steps:
s301, obtaining a character stream regular expression, wherein the character stream regular expression describes a character data matching rule meeting the scanning requirement of the text data to be scanned;
s302, replacing character data in the character data matching rule with corresponding byte data to obtain a byte stream regular expression corresponding to the character stream regular expression, wherein the character data and the corresponding byte data can be converted with each other by applying a specified encoding rule;
for example, if the character data matching rule is to retrieve all characters "ABC" matched in the text data to be scanned, and the encoding mode of the text data to be scanned in the form of a byte stream is X, all parts of the characters "ABC" in the original character data matching rule are replaced with byte data obtained by converting the characters "ABC" using the encoding mode X. Therefore, the regular expression describing the rule can perform byte matching on the text data to be scanned with the encoding mode of X.
S303, configuring a corresponding byte stream regular matching tool according to the byte stream regular expression.
Referring to fig. 4, the present specification also provides another text data scanning method, which is applied to text data to be scanned stored in a byte form, and may include the following steps:
s401, reading text data to be scanned in a byte stream mode;
s402, a pre-configured byte stream regular matching tool accepts bytes in text data to be scanned one by one through a finite automaton, when a byte sequence enables the state of the finite automaton to reach a final state, the byte sequence is judged to be successfully matched text data, the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is pre-programmed according to a byte coding format of the text data to be scanned;
s403, determine whether there is successfully matched text data? If there is successfully matched text data, step S404 is executed, and if there is not successfully matched text data, the process is ended.
S404, determining the successfully matched text data as the scanned key data, wherein the scanned key data is the text data existing in the form of byte sequence.
S405, decoding the key data, and converting the key data in the form of byte sequence into text data in the form of character string.
Corresponding to the above method embodiment, an embodiment of the present specification further provides a text data scanning apparatus, as shown in fig. 6, which is applied to scan text data to be scanned, where the text data is stored in a byte form, and the apparatus may include: a data reading module 610, a data matching module 620 and a data extraction module 630.
The data reading module 610: the scanning device is used for reading text data to be scanned in a byte stream mode;
the data matching module 620: the device comprises a byte stream regular matching tool, a byte matching module and a data processing module, wherein the byte stream regular matching tool is used for matching the text data to be scanned according to bytes by the aid of a preconfigured byte stream regular matching tool, the byte matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is written in advance according to a byte coding format of the text data to be scanned;
the data extraction module 630: and the method is used for determining the text data successfully matched with the data as the scanned key data, wherein the scanned key data is the text data in the form of byte sequence.
An embodiment of the present specification further provides an electronic device, which at least includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the foregoing text data scanning method when executing the program, and the method is applied to scan text data to be scanned, which is stored in a byte form, and the method at least includes:
reading text data to be scanned in a byte stream mode;
matching the text data to be scanned according to bytes by a pre-configured byte stream regular matching tool, wherein the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is pre-written according to the byte coding format of the text data to be scanned;
and determining the successfully matched text data as the scanned key data, wherein the scanned key data is the text data in the form of byte sequence.
FIG. 7 is a diagram illustrating a more specific hardware configuration of a computing device provided by an embodiment of the present disclosure, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The present specification also provides a computer readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the foregoing text data scanning method, the method is applied to scan text data to be scanned, which is stored in a byte form, and the method at least includes:
reading text data to be scanned in a byte stream mode;
matching the text data to be scanned according to bytes by a pre-configured byte stream regular matching tool, wherein the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is pre-written according to the byte coding format of the text data to be scanned;
and determining the successfully matched text data as the scanned key data, wherein the scanned key data is the text data in the form of byte sequence.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, and the modules described as separate components may or may not be physically separate, and the functions of the modules may be implemented in one or more software and/or hardware when implementing the embodiments of the present disclosure. And part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The foregoing is only a specific embodiment of the embodiments of the present disclosure, and it should be noted that, for those skilled in the art, a plurality of modifications and decorations can be made without departing from the principle of the embodiments of the present disclosure, and these modifications and decorations should also be regarded as the protection scope of the embodiments of the present disclosure.
The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims (7)

1. A text data scanning method is applied to text data to be scanned which is stored in a byte form, and is characterized by comprising the following steps:
reading text data to be scanned in a byte stream mode;
matching the text data to be scanned according to bytes by a pre-configured byte stream regular matching tool, wherein the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, the byte matching rule is written in advance according to the byte coding format of the text data to be scanned, the pre-configured byte stream regular matching tool corresponds to a byte stream regular expression describing the byte data matching rule, and the acquiring step of the byte stream regular expression comprises the following steps: acquiring a character stream regular expression, wherein the character stream regular expression describes a character data matching rule meeting the scanning requirement of the text data to be scanned; replacing character data in the character data matching rule with corresponding byte data, wherein the character data and the corresponding byte data can be converted by applying a specified encoding rule;
and determining the successfully matched text data as the scanned key data, wherein the scanned key data is the text data in the form of byte sequence.
2. The method of claim 1, wherein determining the text data in which the matching is successful as the scanned key data further comprises:
decoding the key data, and converting the key data in the form of byte sequence into text data in the form of character string.
3. The method of claim 1, wherein matching the text data to be scanned byte-wise by a preconfigured byte stream regular matching tool, comprises:
the preset byte stream regular matching tool accepts bytes in text data to be scanned one by one through a finite automaton, and when a byte sequence enables the state of the finite automaton to reach a final state, the byte sequence is judged to be successfully matched text data.
4. A text data scanning apparatus applied to text data to be scanned stored in a byte form, the apparatus comprising:
a data reading module: the scanning device is used for reading text data to be scanned in a byte stream mode;
a data matching module: the method is used for matching text data to be scanned according to bytes through a pre-configured byte stream regular matching tool, the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, the byte matching rule is written in advance according to a byte coding format of the text data to be scanned, the pre-configured byte stream regular matching tool corresponds to a byte stream regular expression describing the byte data matching rule, and the step of acquiring the byte stream regular expression comprises the following steps: acquiring a character stream regular expression, wherein the character stream regular expression describes a character data matching rule meeting the scanning requirement of the text data to be scanned; replacing character data in the character data matching rule with corresponding byte data, wherein the character data and the corresponding byte data can be converted by applying a specified encoding rule;
a data extraction module: and the method is used for determining the text data successfully matched with the data as the scanned key data, wherein the scanned key data is the text data in the form of byte sequence.
5. The apparatus of claim 4, wherein the text data scanning apparatus further comprises:
a decoding module: decoding the key data, and converting the key data in the form of byte sequence into text data in the form of character string.
6. The apparatus of claim 4, wherein the matching module is specifically configured to:
the preset byte stream regular matching tool accepts bytes in text data to be scanned one by one through a finite automaton, and when a byte sequence enables the state of the finite automaton to reach a final state, the byte sequence is judged to be successfully matched text data.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of claim 1 when executing the program.
CN201810531245.5A 2018-05-29 2018-05-29 Text data scanning method and device Active CN108734149B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810531245.5A CN108734149B (en) 2018-05-29 2018-05-29 Text data scanning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810531245.5A CN108734149B (en) 2018-05-29 2018-05-29 Text data scanning method and device

Publications (2)

Publication Number Publication Date
CN108734149A CN108734149A (en) 2018-11-02
CN108734149B true CN108734149B (en) 2022-01-18

Family

ID=63936591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810531245.5A Active CN108734149B (en) 2018-05-29 2018-05-29 Text data scanning method and device

Country Status (1)

Country Link
CN (1) CN108734149B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111309683B (en) * 2020-02-07 2023-04-14 北京明朝万达科技股份有限公司 Method and device for scanning full disk data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102685098A (en) * 2012-02-24 2012-09-19 华南理工大学 Recombination-free multi-mode matching method for out-of-order data package flow
CN106407475A (en) * 2016-11-18 2017-02-15 广州爱九游信息技术有限公司 Content screening method, device and server

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8724496B2 (en) * 2011-11-30 2014-05-13 Broadcom Corporation System and method for integrating line-rate application recognition in a switch ASIC
CN104361097A (en) * 2014-11-21 2015-02-18 国家电网公司 Real-time detection method for electric power sensitive mail based on multimode matching

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102685098A (en) * 2012-02-24 2012-09-19 华南理工大学 Recombination-free multi-mode matching method for out-of-order data package flow
CN106407475A (en) * 2016-11-18 2017-02-15 广州爱九游信息技术有限公司 Content screening method, device and server

Also Published As

Publication number Publication date
CN108734149A (en) 2018-11-02

Similar Documents

Publication Publication Date Title
EP2814212B1 (en) Method and apparatus for adding friend, and storage medium
CN107741935B (en) Data importing method and data importing device
CN111314388B (en) Method and apparatus for detecting SQL injection
CN111654555B (en) Method, electronic device and storage medium for message distribution
CN105515935A (en) Interaction information DOI processing method and device
CN110362547A (en) Coding, parsing, storage method and the device of journal file
CN112784112A (en) Message checking method and device
CN110008431B (en) Page component construction method and device, page generation equipment and readable storage medium
CN114638218A (en) Symbol processing method, device, electronic equipment and storage medium
CN108734149B (en) Text data scanning method and device
JP2009252153A5 (en)
CN111708680A (en) Error reporting information analysis method and device, electronic equipment and storage medium
CN111078900A (en) Data risk identification method and system
CN113220949B (en) Construction method and device of private data identification system
CN113110829B (en) Multi-UI component library data processing method and device
CN114302207A (en) Bullet screen display method, device, system, equipment and storage medium
CN106031296B (en) Message processing method and electronic device supporting same
CN113674083A (en) Internet financial platform credit risk monitoring method, device and computer system
CN113342811A (en) HBase table data processing method and device
CN113873450A (en) Short message configuration method and device, computer equipment and storage medium
CN109190352B (en) Method and device for verifying accuracy of authorization text
CN108197961B (en) User management method and device
CN112487765A (en) Method and device for generating notification text
CN110968503A (en) Code scanning system and method and plug-in
CN110968875B (en) Method and device for detecting permission vulnerability of webpage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201022

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201022

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant