Specific implementation mode
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistent with this specification.On the contrary, they are only and such as institute
The example of the consistent device and method of some aspects be described in detail in attached claims, this specification.
It is the purpose only merely for description specific embodiment in the term that this specification uses, is not intended to be limiting this explanation
Book.The "an" of used singulative, " described " and "the" are also intended to packet in this specification and in the appended claims
Most forms are included, unless context clearly shows that other meanings.It is also understood that term "and/or" used herein is
Refer to and include one or more associated list items purposes any or all may combine.
It will be appreciated that though various information may be described using term first, second, third, etc. in this specification, but
These information should not necessarily be limited by these terms.These terms are only used for same type of information being distinguished from each other out.For example, not taking off
In the case of this specification range, the first information can also be referred to as the second information, and similarly, the second information can also be claimed
For the first information.Depending on context, word as used in this " if " can be construed to " ... when " or
" when ... " or " in response to determination ".
It is common practice of the industry in big data security fields to carry out data scanning to a large amount of texts using regular expression,
Such as:The regular expression for describing privacy scanning rule can be utilized to be scanned matching to text data, to determine text
In whether there is customer privacy data.
Refer to the attached drawing 1, to scan the schematic diagram of private data in the prior art.In the prior art, text data is often
It is stored in the form of binary file in various storage mediums, to be scanned to it, general way is first with byte stream
Form is read out text data, is then decoded conversion to these byte datas, becomes significant readable
Character data reapplies regular expression matching tool and is scanned matching to character flow data, later in a large amount of textual datas
The critical data for meeting regular expression matching rule is scanned in.
But in demand in practical applications, generally only need to pay close attention to the critical data for meeting condition in a large amount of texts
Content, or even need not sometimes pay close attention to the particular content of critical data, it is only necessary to confirm whether a large amount of texts contain and meets item
The critical data of part.It is still illustrated with scanning private data, under normal circumstances, it would be desirable to be scanned in text to be scanned
Private data, and know the particular content of private data, it only needs to be decoded the private data scanned at this time in fact.Also
There is a kind of situation, we need not know the particular content of private data, it is only necessary to confirm in a large amount of text datas whether contain
Private data is only needed to scan at this time, need not be decoded to text data.And which kind of situation is the prior art be either directed to, all
It needs all text datas to be scanned being first decoded conversion, then is scanned matching.However it will be first by a large amount of text datas
All decoding conversions can bring additional performance to be lost, and influence data scanning rate, it is desirable to provide a kind of more rapidly efficient number
According to sweeping scheme.
In view of the above problems, this specification embodiment provides a kind of text data scan method, and it is a kind of for executing
The text data scanning means of this method is below described in detail the text data scan method that the present embodiment is related to, ginseng
See shown in attached drawing 2, provide a kind of text data scan method, which is applied to the text to be scanned stored with bytewise
Notebook data may comprise steps of:
S201 reads text data to be scanned in the form of byte stream;
Byte stream is a kind of reading manner of the computer to (disk, network etc.) data in data source, bytes of stream data tool
Body shows as having sequential byte sequence.And the byte sequence can be converted to specific meaning by corresponding decoding rule
With readable character string.
For at present, text data is typically to be stored in the form of byte file in all kinds of storage systems, with byte stream
After form reads text data to be scanned, you can obtain the sequential byte long sequence of one group of tool.
S202 is matched text data to be scanned by preconfigured byte stream canonical matching tool by byte,
The bytes match rule for meeting text data scanning demand to be scanned, the word are contained in the byte stream canonical matching tool
Section matching rule is write in advance according to the byte code format of text data to be scanned;
By step S201 it is found that the existence form after text data to be scanned is read is one group of byte sequence.By advance
The byte stream canonical matching tool of configuration, receives the byte in the byte sequence read successively, to determine whether certain section of word
Section sequence or certain sections of byte sequences can meet the matching rule of byte stream canonical matching tool, when certain section of byte sequence meets
When the preconfigured bytes match rule, this section of byte sequence successful match is judged.
Wherein, bytes match rule is preconfigured according to the scanning demand of text to be scanned, if for example, demand
The mailbox message of user is scanned from a large amount of text datas, then matching rule can be set as to matching strip keyword " mailbox "
Or the packed format of letter and number meets one group of text data of mailbox naming rule.
By taking byte stream canonical coupling engine as an example, byte stream canonical matching tool is illustrated:Byte stream canonical matches
Engine is usually treated scan text data by finite automaton and is matched.It is (non-determined that finite automaton can be divided into NFA again
Property finite automaton) with DFA (certainty finite automaton), when receiving new byte every time, shape can all occur for finite automaton
State changes, and when one section of byte sequence makes the state of finite automaton constantly be changed when reaching final state by initial state, judges this
Section byte sequence meets the rule of the bytes match in byte stream canonical coupling engine, and then successful match.
Finite automaton is compiled by regular expression, which needs to correspond to text data to be scanned
Byte code format is write in advance, if text data is encoded using UTF-8, then the regular expression description matching rule
Corresponding UTF-8 is needed to be write, so as to which syllable sequence can be received successively according to the finite automaton that the regular expression compiles
Each byte in row, and corresponding state is made according to the byte of receiving and is changed.Refer to the attached drawing 5 uses byte stream canonical
With engine as canonical matching tool, canonical can be done directly against the text data of byte stream form in data scanning
Match, and obtain in the form of byte sequence existing for matching result.The intermediate conversion process for flowing to character stream by byte is avoided, is carried
High data scanning efficiency.
After scanned, if there is the text data of successful match, crucial number that this article notebook data is determined as scanning
According to;If the not text data of successful match, illustrate not including critical data in text to be scanned.
In practical applications, further data processing method can be determined according to demand.It illustrates:In certain situations
Under, it need not know the particular content of critical data, it is only necessary to confirm in text to be scanned whether include critical data, this feelings
It only needs to finish data scanning to be scanned under condition and is confirmed whether there is the critical data scanned, need not be decoded
Operation;And in other cases, the particular content for knowing critical data is needed, then after scanning critical data, it is also necessary to right
The critical data is decoded, will in the form of byte sequence existing for critical data be converted to literary existing for character string forms
Notebook data.Wherein, the decoding conversion between character data and byte data is a kind of prior art, and details are not described herein.Without
What kind of processing mode pipe uses, and all at most needs to be decoded the critical data scanned, avoids decoding all to be scanned
Time caused by data wastes, and improves scan efficiency.
Wherein, preconfigured byte stream canonical matching tool and byte data matching rule is described in step S202
Byte stream regular expression is corresponding, i.e., the byte stream canonical matching tool be write according to a kind of byte stream regular expression and
At, and the byte stream regular expression describes the bytes match rule for text data to be scanned.Based on this, this specification
Embodiment provides a kind of acquisition methods of byte stream canonical matching tool, and shown in attached drawing 3, this method may include following step
Suddenly:
S301 obtains character stream regular expression, and the character stream regular expression, which describes, meets textual data to be scanned
According to the character data matching rule of scanning demand;
Character data in the character data matching rule is replaced with corresponding byte data by S302, with obtain with
The corresponding byte stream regular expression of character stream regular expression, the character data can be applied specified with corresponding byte data
Coding rule is converted mutually;
For example, if character data matching rule is all characters that retrieval is matched in text data to be scanned
" ABC ", and the coding mode of the text data to be scanned in byte stream form is X, then it will be in former character data matching rule
Related character " ABC " part all replace with by character " ABC " use the transformed byte datas of coding mode X.To make
The text data to be scanned progress bytes match that coding mode is X can be directed to by describing the regular expression of the rule.
S303 configures corresponding byte stream canonical matching tool according to the byte stream regular expression.
Shown in attached drawing 4, this specification also provides another text data scan method, the scan method be applied to
The text data to be scanned of bytewise storage, may comprise steps of:
S401 reads text data to be scanned in the form of byte stream;
S402, preconfigured byte stream canonical matching tool treat the word in scan text data by finite automaton
Section is received one by one, when there is byte sequence that the state of finite automaton is made to reach final state, judges that the byte sequence is
With successful text data, the word for meeting text data scanning demand to be scanned is contained in the byte stream canonical matching tool
Matching rule is saved, the bytes match rule is write in advance according to the byte code format of text data to be scanned;
S403 judges whether there is the text data of successful match?If there is the text data of successful match, step is executed
S404 terminates flow if the not text data of successful match.
S404, the critical data that the text data of wherein successful match is determined as scanning, the key scanned
Data be in the form of byte sequence existing for text data.
S405 decodes the critical data, by it is described in the form of byte sequence existing for critical data be converted to character
Text data existing for string form.
Corresponding to above method embodiment, this specification embodiment also provides a kind of text data scanning means, referring to Fig. 6
Shown, which is applied to the text data to be scanned that scanning is stored with bytewise, and described device may include:Digital independent
Module 610, data match module 620 and data extraction module 630.
Data read module 610:For reading text data to be scanned in the form of byte stream;
Data match module 620:For text data to be scanned to be passed through preconfigured byte stream canonical matching tool
It is matched by byte, the byte for meeting text data scanning demand to be scanned is contained in the byte stream canonical matching tool
Matching rule, the bytes match rule are write in advance according to the byte code format of text data to be scanned;
Data extraction module 630:Critical data for the text data of wherein successful match to be determined as scanning, institute
State the critical data that scans be in the form of byte sequence existing for text data.
This specification embodiment also provides a kind of electronic equipment, includes at least memory, processor and is stored in storage
On device and the computer program that can run on a processor, wherein processor realizes aforementioned texts data when executing described program
Scan method, this method are applied to the text data to be scanned that scanning is stored with bytewise, and the method includes at least:
Text data to be scanned is read in the form of byte stream;
Text data to be scanned is matched by preconfigured byte stream canonical matching tool by byte, the word
The bytes match rule for meeting text data scanning demand to be scanned, the bytes match are contained in throttling canonical matching tool
Rule is write in advance according to the byte code format of text data to be scanned;
The critical data that the text data of wherein successful match is determined as scanning, the critical data scanned are
Text data existing in the form of byte sequence.
Attached drawing 7 shows that a kind of more specifically computing device hardware configuration that this specification embodiment is provided is illustrated
Figure, the equipment may include:Processor 1010, memory 1020, input/output interface 1030, communication interface 1040 and bus
1050.Wherein processor 1010, memory 1020, input/output interface 1030 and communication interface 1040 are real by bus 1050
The now communication connection inside equipment each other.
General CPU (Central Processing Unit, central processing unit), micro- place may be used in processor 1010
Reason device, application specific integrated circuit (Application Specific Integrated Circuit, ASIC) or one
Or the modes such as multiple integrated circuits are realized, for executing relative program, to realize technical side that this specification embodiment is provided
Case.
ROM (Read Only Memory, read-only memory), RAM (Random Access may be used in memory 1020
Memory, random access memory), static storage device, the forms such as dynamic memory realize.Memory 1020 can store
Operating system and other applications are realizing technical solution that this specification embodiment is provided by software or firmware
When, relevant program code is stored in memory 1020, and is executed by processor 1010 to call.
Input/output interface 1030 is for connecting input/output module, to realize information input and output.Input and output/
Module can be used as component Configuration (not shown) in a device, can also be external in equipment to provide corresponding function.Wherein
Input equipment may include keyboard, mouse, touch screen, microphone, various kinds of sensors etc., output equipment may include display,
Loud speaker, vibrator, indicator light etc..
Communication interface 1040 is used for connection communication module (not shown), to realize the communication of this equipment and other equipment
Interaction.Wherein communication module can be realized by wired mode (such as USB, cable etc.) and be communicated, can also be wirelessly
(such as mobile network, WIFI, bluetooth etc.) realizes communication.
Bus 1050 include an access, equipment various components (such as processor 1010, memory 1020, input/it is defeated
Outgoing interface 1030 and communication interface 1040) between transmit information.
It should be noted that although above equipment illustrates only processor 1010, memory 1020, input/output interface
1030, communication interface 1040 and bus 1050, but in specific implementation process, which can also include realizing normal fortune
Other assemblies necessary to row.In addition, it will be appreciated by those skilled in the art that, can also only include real in above equipment
Component necessary to existing this specification example scheme, without including all components shown in figure.
This specification embodiment also provides a kind of computer readable storage medium, is stored thereon with computer program, the journey
Realize that text data scan method above-mentioned, this method are applied to scanning and are waited for what bytewise stored when sequence is executed by processor
Scan text data, the method include at least:
Text data to be scanned is read in the form of byte stream;
Text data to be scanned is matched by preconfigured byte stream canonical matching tool by byte, the word
The bytes match rule for meeting text data scanning demand to be scanned, the bytes match are contained in throttling canonical matching tool
Rule is write in advance according to the byte code format of text data to be scanned;
The critical data that the text data of wherein successful match is determined as scanning, the critical data scanned are
Text data existing in the form of byte sequence.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology realizes information storage.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, tape magnetic disk storage or other magnetic storage apparatus
Or any other non-transmission medium, it can be used for storage and can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
The function of each unit and the realization process of effect specifically refer to and correspond to step in the above method in above-mentioned apparatus
Realization process, details are not described herein.
For device embodiments, since it corresponds essentially to embodiment of the method, so related place is referring to method reality
Apply the part explanation of example.The apparatus embodiments described above are merely exemplary, wherein described be used as separating component
The unit of explanation may or may not be physically separated, and the component shown as unit can be or can also
It is not physical unit, you can be located at a place, or may be distributed over multiple network units.It can be according to actual
It needs that some or all of module therein is selected to realize the purpose of this specification scheme.Those of ordinary skill in the art are not
In the case of making the creative labor, you can to understand and implement.
As seen through the above description of the embodiments, those skilled in the art can be understood that this specification
Embodiment can add the mode of required general hardware platform to realize by software.Based on this understanding, this specification is implemented
Substantially the part that contributes to existing technology can be expressed in the form of software products the technical solution of example in other words,
The computer software product can be stored in a storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are making
It is each to obtain computer equipment (can be personal computer, server or the network equipment etc.) execution this specification embodiment
Method described in certain parts of a embodiment or embodiment.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity,
Or it is realized by the product with certain function.A kind of typically to realize that equipment is computer, the concrete form of computer can
To be personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play
In device, navigation equipment, E-mail receiver/send equipment, game console, tablet computer, wearable device or these equipment
The combination of arbitrary several equipment.
Each embodiment in this specification is described in a progressive manner, identical similar portion between each embodiment
Point just to refer each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality
For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method
Part explanation.The apparatus embodiments described above are merely exemplary, wherein described be used as separating component explanation
Module may or may not be physically separated, can be each module when implementing this specification example scheme
Function realize in the same or multiple software and or hardware.Can also select according to the actual needs part therein or
Person's whole module achieves the purpose of the solution of this embodiment.Those of ordinary skill in the art are not the case where making the creative labor
Under, you can to understand and implement.
The above is only the specific implementation mode of this specification embodiment, it is noted that for the general of the art
For logical technical staff, under the premise of not departing from this specification embodiment principle, several improvements and modifications can also be made, this
A little improvements and modifications also should be regarded as the protection domain of this specification embodiment.
The foregoing is merely the preferred embodiments of this specification, all in this explanation not to limit this specification
Within the spirit and principle of book, any modification, equivalent substitution, improvement and etc. done should be included in the model of this specification protection
Within enclosing.