Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.
The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
Data scanning of large amounts of text using regular expressions is a common practice in the industry in the field of big data security, such as: the text data may be scan matched using a regular expression that describes privacy scan rules to determine if customer privacy data is present in the text.
Referring to fig. 1, a diagram of scanning private data in the prior art is shown. In the prior art, text data is often stored in various storage media in a binary file form, and if the text data is to be scanned, a general method is to read the text data in a byte stream form, then decode and convert the byte data to make the byte data become meaningful readable character data, and then scan and match the character stream data by using a regular expression matching tool, so as to scan key data meeting a regular expression matching rule in a large amount of text data.
However, in the requirements of practical applications, it is generally only necessary to pay attention to the content of the key data satisfying the condition in the large amount of text, and even sometimes it is not necessary to pay attention to the specific content of the key data, and it is only necessary to confirm whether the large amount of text contains the key data satisfying the condition. Still take the example of scanning the private data, in general, we need to scan the private data in the text to be scanned and know the specific content of the private data, and at this time, it only needs to decode the scanned private data. In another case, we only need to confirm whether a large amount of text data contains the private data without knowing the specific content of the private data, and at this time, only scanning is needed, and the text data does not need to be decoded. In the prior art, no matter which situation is targeted, all text data to be scanned need to be decoded and converted first, and then scanning matching is performed. However, decoding and converting a large amount of text data completely first brings extra performance loss, affects the data scanning rate, and it is necessary to provide a faster and more efficient data scanning scheme.
In view of the above problems, an embodiment of the present specification provides a text data scanning method and a text data scanning apparatus for executing the method, and the following describes in detail the text data scanning method according to the embodiment, and referring to fig. 2, provides a text data scanning method, which is applied to text data to be scanned stored in a byte form, and may include the following steps:
s201, reading text data to be scanned in a byte stream mode;
a byte stream is a way for a computer to read data from a data source (disk, network, etc.), and the byte stream data is embodied as a sequence of bytes. And the byte sequence can be converted into a character sequence with specific meaning and readability through a corresponding decoding rule.
At present, text data is usually stored in various storage systems in the form of byte files, and a set of byte-long sequences with sequence can be obtained after reading the text data to be scanned in the form of byte streams.
S202, matching the text data to be scanned according to bytes by a pre-configured byte stream regular matching tool, wherein the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is pre-written according to the byte coding format of the text data to be scanned;
as shown in step S201, the text data to be scanned exists in a group of byte sequences after being read. The read bytes in the byte sequence are sequentially received through a pre-configured byte stream regular matching tool to determine whether a certain section of byte sequence or a certain section of byte sequence can accord with the matching rule of the byte stream regular matching tool, and when the certain section of byte sequence accords with the pre-configured byte matching rule, the matching of the section of byte sequence is judged to be successful.
For example, if mailbox information of a user is required to be scanned from a large amount of text data, the matching rule can be set to match a group of text data with a keyword 'mailbox' or a combined format of letters and numbers which meets a mailbox naming rule.
Taking a byte stream regular matching engine as an example, explaining a byte stream regular matching tool: the byte stream regular matching engine matches the text data to be scanned, typically by means of a finite automaton. The finite automata can be divided into an NFA (non-deterministic finite automata) and a DFA (deterministic finite automata), the finite automata can generate state change every time a new byte is received, and when a section of byte sequence enables the state of the finite automata to be changed continuously from an initial state to a final state, the section of byte sequence is judged to accord with a byte matching rule in a byte stream regular matching engine, and then matching is successful.
The finite automaton is compiled by a regular expression, the regular expression needs to be written in advance corresponding to a byte coding format of text data to be scanned, if the text data is coded by UTF-8, a matching rule described by the regular expression needs to be written corresponding to UTF-8, so that the finite automaton compiled according to the regular expression can sequentially accept each byte in a byte sequence and make corresponding state change according to the accepted byte. Referring to fig. 5, a byte stream regular matching engine is used as a regular matching tool, and during data scanning, regular matching can be directly performed on text data in a byte stream form, and a matching result in a byte sequence form is obtained. The intermediate conversion process from the byte stream to the character stream is avoided, and the data scanning efficiency is improved.
After scanning, if the matched text data exists, determining the text data as the scanned key data; and if the text data which is successfully matched does not exist, the fact that the text to be scanned does not contain the key data is indicated.
In practical application, a further data processing mode can be determined according to requirements. For example, the following steps are carried out: in some cases, the specific content of the key data does not need to be known, and only the fact that whether the text to be scanned contains the key data needs to be confirmed, in this case, only the fact that the data to be scanned is finished and whether the scanned key data exists is confirmed, and decoding operation is not needed; in other cases, the specific content of the key data needs to be known, and after the key data is scanned, the key data needs to be decoded to convert the key data in the form of byte sequence into text data in the form of character string. The decoding conversion between the character data and the byte data is a prior art, and is not described herein again. Regardless of the processing mode, the scanned key data needs to be decoded at most, so that the time waste caused by decoding all the data to be scanned is avoided, and the scanning efficiency is improved.
The byte stream regular matching tool configured in advance in step S202 corresponds to the byte stream regular expression describing the byte data matching rule, that is, the byte stream regular matching tool is written according to a byte stream regular expression describing the byte matching rule for the text data to be scanned. Based on this, an embodiment of the present specification provides a method for acquiring a byte stream regular matching tool, which is shown in fig. 3, and the method may include the following steps:
s301, obtaining a character stream regular expression, wherein the character stream regular expression describes a character data matching rule meeting the scanning requirement of the text data to be scanned;
s302, replacing character data in the character data matching rule with corresponding byte data to obtain a byte stream regular expression corresponding to the character stream regular expression, wherein the character data and the corresponding byte data can be converted with each other by applying a specified encoding rule;
for example, if the character data matching rule is to retrieve all characters "ABC" matched in the text data to be scanned, and the encoding mode of the text data to be scanned in the form of a byte stream is X, all parts of the characters "ABC" in the original character data matching rule are replaced with byte data obtained by converting the characters "ABC" using the encoding mode X. Therefore, the regular expression describing the rule can perform byte matching on the text data to be scanned with the encoding mode of X.
S303, configuring a corresponding byte stream regular matching tool according to the byte stream regular expression.
Referring to fig. 4, the present specification also provides another text data scanning method, which is applied to text data to be scanned stored in a byte form, and may include the following steps:
s401, reading text data to be scanned in a byte stream mode;
s402, a pre-configured byte stream regular matching tool accepts bytes in text data to be scanned one by one through a finite automaton, when a byte sequence enables the state of the finite automaton to reach a final state, the byte sequence is judged to be successfully matched text data, the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is pre-programmed according to a byte coding format of the text data to be scanned;
s403, determine whether there is successfully matched text data? If there is successfully matched text data, step S404 is executed, and if there is not successfully matched text data, the process is ended.
S404, determining the successfully matched text data as the scanned key data, wherein the scanned key data is the text data existing in the form of byte sequence.
S405, decoding the key data, and converting the key data in the form of byte sequence into text data in the form of character string.
Corresponding to the above method embodiment, an embodiment of the present specification further provides a text data scanning apparatus, as shown in fig. 6, which is applied to scan text data to be scanned, where the text data is stored in a byte form, and the apparatus may include: a data reading module 610, a data matching module 620 and a data extraction module 630.
The data reading module 610: the scanning device is used for reading text data to be scanned in a byte stream mode;
the data matching module 620: the device comprises a byte stream regular matching tool, a byte matching module and a data processing module, wherein the byte stream regular matching tool is used for matching the text data to be scanned according to bytes by the aid of a preconfigured byte stream regular matching tool, the byte matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is written in advance according to a byte coding format of the text data to be scanned;
the data extraction module 630: and the method is used for determining the text data successfully matched with the data as the scanned key data, wherein the scanned key data is the text data in the form of byte sequence.
An embodiment of the present specification further provides an electronic device, which at least includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the foregoing text data scanning method when executing the program, and the method is applied to scan text data to be scanned, which is stored in a byte form, and the method at least includes:
reading text data to be scanned in a byte stream mode;
matching the text data to be scanned according to bytes by a pre-configured byte stream regular matching tool, wherein the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is pre-written according to the byte coding format of the text data to be scanned;
and determining the successfully matched text data as the scanned key data, wherein the scanned key data is the text data in the form of byte sequence.
FIG. 7 is a diagram illustrating a more specific hardware configuration of a computing device provided by an embodiment of the present disclosure, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The present specification also provides a computer readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the foregoing text data scanning method, the method is applied to scan text data to be scanned, which is stored in a byte form, and the method at least includes:
reading text data to be scanned in a byte stream mode;
matching the text data to be scanned according to bytes by a pre-configured byte stream regular matching tool, wherein the byte stream regular matching tool comprises a byte matching rule meeting the scanning requirement of the text data to be scanned, and the byte matching rule is pre-written according to the byte coding format of the text data to be scanned;
and determining the successfully matched text data as the scanned key data, wherein the scanned key data is the text data in the form of byte sequence.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.
From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, and the modules described as separate components may or may not be physically separate, and the functions of the modules may be implemented in one or more software and/or hardware when implementing the embodiments of the present disclosure. And part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The foregoing is only a specific embodiment of the embodiments of the present disclosure, and it should be noted that, for those skilled in the art, a plurality of modifications and decorations can be made without departing from the principle of the embodiments of the present disclosure, and these modifications and decorations should also be regarded as the protection scope of the embodiments of the present disclosure.
The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.