CN108734149A - A kind of text data scan method and device - Google Patents

A kind of text data scan method and device Download PDF

Info

Publication number
CN108734149A
CN108734149A CN201810531245.5A CN201810531245A CN108734149A CN 108734149 A CN108734149 A CN 108734149A CN 201810531245 A CN201810531245 A CN 201810531245A CN 108734149 A CN108734149 A CN 108734149A
Authority
CN
China
Prior art keywords
data
byte
text data
scanned
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810531245.5A
Other languages
Chinese (zh)
Other versions
CN108734149B (en
Inventor
温悦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810531245.5A priority Critical patent/CN108734149B/en
Publication of CN108734149A publication Critical patent/CN108734149A/en
Application granted granted Critical
Publication of CN108734149B publication Critical patent/CN108734149B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Document Processing Apparatus (AREA)

Abstract

A kind of text data scan method of this specification embodiment offer and device, for scanning the text data to be scanned stored with bytewise, text data to be scanned is first read in the form of byte stream, text data to be scanned is matched by preconfigured byte stream canonical matching tool by byte again, the bytes match rule for meeting text data scanning demand to be scanned is contained in the byte stream canonical matching tool, the bytes match rule is write in advance according to the byte code format of text data to be scanned, the text data of wherein successful match is finally determined as the critical data scanned.

Description

A kind of text data scan method and device
Technical field
This specification is related to information security field more particularly to a kind of text data scan method and device.
Background technology
It is common practice of the industry in big data security fields to carry out data scanning to a large amount of texts using regular expression, Such as:The regular expression for describing privacy scanning rule can be configured to canonical coupling engine, text data is swept Matching is retouched, whether there is customer privacy data in text to determine.
In the prior art, text data is often stored in various storage mediums in the form of binary file, to right It is scanned, general way be first text data is read out in the form of byte stream, then to these byte datas into Row decoding conversion, becomes significant readable character data, reapplies regular expression matching tool later to character Flow data is scanned matching, to scan the critical data for meeting regular expression matching rule in a large amount of text datas. However will first can take a long time the decoding conversion of a large amount of text datas, influence data scanning rate, it is desirable to provide Yi Zhonggeng Data scanning scheme rapidly and efficiently.
Invention content
In view of the above technical problems, a kind of text data scan method of this specification embodiment offer and device, technical side Case is as follows:
According to this specification embodiment in a first aspect, a kind of text data scan method is provided, applied to byte shape The text data to be scanned of formula storage, the method includes:
A kind of text data scan method is applied to the text data to be scanned stored with bytewise, which is characterized in that The method includes:
Text data to be scanned is read in the form of byte stream;
Text data to be scanned is matched by preconfigured byte stream canonical matching tool by byte, the word The bytes match rule for meeting text data scanning demand to be scanned, the bytes match are contained in throttling canonical matching tool Rule is write in advance according to the byte code format of text data to be scanned;
The critical data that the text data of wherein successful match is determined as scanning, the critical data scanned are Text data existing in the form of byte sequence.
According to the second aspect of this specification embodiment, a kind of text data scanning means is provided, is applied to byte shape The text data to be scanned of formula storage, described device include:
Data read module:For reading text data to be scanned in the form of byte stream;
Data match module:For text data to be scanned to be pressed word by preconfigured byte stream canonical matching tool Section is matched, and the bytes match for meeting text data scanning demand to be scanned is contained in the byte stream canonical matching tool Rule, the bytes match rule are write in advance according to the byte code format of text data to be scanned;
Data extraction module:Critical data for the text data of wherein successful match to be determined as scanning, it is described The critical data scanned be in the form of byte sequence existing for text data.
According to the third aspect of this specification embodiment, a kind of computer equipment is provided, including memory, processor and deposit Store up the computer program that can be run on a memory and on a processor, wherein the processor is realized when executing described program A kind of text data scan method is applied to the text data to be scanned stored with bytewise, the method includes:
Text data to be scanned is read in the form of byte stream;
Text data to be scanned is matched by preconfigured byte stream canonical matching tool by byte, the word The bytes match rule for meeting text data scanning demand to be scanned, the bytes match are contained in throttling canonical matching tool Rule is write in advance according to the byte code format of text data to be scanned;
The critical data that the text data of wherein successful match is determined as scanning, the critical data scanned are Text data existing in the form of byte sequence.
The technical solution that this specification embodiment is provided, in data scanning directly against the textual data of byte stream form According to doing canonical matching, and matching result existing for obtaining in the form of byte sequence.Avoid the centre that character stream is flowed to by byte Transfer process improves data scanning efficiency.
It should be understood that above general description and following detailed description is only exemplary and explanatory, not This specification embodiment can be limited.
In addition, any embodiment in this specification embodiment does not need to reach above-mentioned whole effects.
Description of the drawings
In order to illustrate more clearly of this specification embodiment or technical solution in the prior art, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only Some embodiments described in this specification embodiment for those of ordinary skill in the art can also be attached according to these Figure obtains other attached drawings.
Fig. 1 is a kind of signal of the prior art text data scan method shown in one exemplary embodiment of this specification Figure;
Fig. 2 is a kind of flow chart of the text data scan method shown in one exemplary embodiment of this specification;
Fig. 3 is a kind of flow chart of the configuration words throttling canonical matching tool shown in one exemplary embodiment of this specification;
Fig. 4 is another flow chart of the text data scan method shown in one exemplary embodiment of this specification;
Fig. 5 is a kind of schematic diagram of the text data scan method shown in one exemplary embodiment of this specification;
Fig. 6 is a kind of schematic diagram of the text data scanning means shown in one exemplary embodiment of this specification;
Fig. 7 is the structural schematic diagram of a kind of electronic equipment shown in one exemplary embodiment of this specification.
Specific implementation mode
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with this specification.On the contrary, they are only and such as institute The example of the consistent device and method of some aspects be described in detail in attached claims, this specification.
It is the purpose only merely for description specific embodiment in the term that this specification uses, is not intended to be limiting this explanation Book.The "an" of used singulative, " described " and "the" are also intended to packet in this specification and in the appended claims Most forms are included, unless context clearly shows that other meanings.It is also understood that term "and/or" used herein is Refer to and include one or more associated list items purposes any or all may combine.
It will be appreciated that though various information may be described using term first, second, third, etc. in this specification, but These information should not necessarily be limited by these terms.These terms are only used for same type of information being distinguished from each other out.For example, not taking off In the case of this specification range, the first information can also be referred to as the second information, and similarly, the second information can also be claimed For the first information.Depending on context, word as used in this " if " can be construed to " ... when " or " when ... " or " in response to determination ".
It is common practice of the industry in big data security fields to carry out data scanning to a large amount of texts using regular expression, Such as:The regular expression for describing privacy scanning rule can be utilized to be scanned matching to text data, to determine text In whether there is customer privacy data.
Refer to the attached drawing 1, to scan the schematic diagram of private data in the prior art.In the prior art, text data is often It is stored in the form of binary file in various storage mediums, to be scanned to it, general way is first with byte stream Form is read out text data, is then decoded conversion to these byte datas, becomes significant readable Character data reapplies regular expression matching tool and is scanned matching to character flow data, later in a large amount of textual datas The critical data for meeting regular expression matching rule is scanned in.
But in demand in practical applications, generally only need to pay close attention to the critical data for meeting condition in a large amount of texts Content, or even need not sometimes pay close attention to the particular content of critical data, it is only necessary to confirm whether a large amount of texts contain and meets item The critical data of part.It is still illustrated with scanning private data, under normal circumstances, it would be desirable to be scanned in text to be scanned Private data, and know the particular content of private data, it only needs to be decoded the private data scanned at this time in fact.Also There is a kind of situation, we need not know the particular content of private data, it is only necessary to confirm in a large amount of text datas whether contain Private data is only needed to scan at this time, need not be decoded to text data.And which kind of situation is the prior art be either directed to, all It needs all text datas to be scanned being first decoded conversion, then is scanned matching.However it will be first by a large amount of text datas All decoding conversions can bring additional performance to be lost, and influence data scanning rate, it is desirable to provide a kind of more rapidly efficient number According to sweeping scheme.
In view of the above problems, this specification embodiment provides a kind of text data scan method, and it is a kind of for executing The text data scanning means of this method is below described in detail the text data scan method that the present embodiment is related to, ginseng See shown in attached drawing 2, provide a kind of text data scan method, which is applied to the text to be scanned stored with bytewise Notebook data may comprise steps of:
S201 reads text data to be scanned in the form of byte stream;
Byte stream is a kind of reading manner of the computer to (disk, network etc.) data in data source, bytes of stream data tool Body shows as having sequential byte sequence.And the byte sequence can be converted to specific meaning by corresponding decoding rule With readable character string.
For at present, text data is typically to be stored in the form of byte file in all kinds of storage systems, with byte stream After form reads text data to be scanned, you can obtain the sequential byte long sequence of one group of tool.
S202 is matched text data to be scanned by preconfigured byte stream canonical matching tool by byte, The bytes match rule for meeting text data scanning demand to be scanned, the word are contained in the byte stream canonical matching tool Section matching rule is write in advance according to the byte code format of text data to be scanned;
By step S201 it is found that the existence form after text data to be scanned is read is one group of byte sequence.By advance The byte stream canonical matching tool of configuration, receives the byte in the byte sequence read successively, to determine whether certain section of word Section sequence or certain sections of byte sequences can meet the matching rule of byte stream canonical matching tool, when certain section of byte sequence meets When the preconfigured bytes match rule, this section of byte sequence successful match is judged.
Wherein, bytes match rule is preconfigured according to the scanning demand of text to be scanned, if for example, demand The mailbox message of user is scanned from a large amount of text datas, then matching rule can be set as to matching strip keyword " mailbox " Or the packed format of letter and number meets one group of text data of mailbox naming rule.
By taking byte stream canonical coupling engine as an example, byte stream canonical matching tool is illustrated:Byte stream canonical matches Engine is usually treated scan text data by finite automaton and is matched.It is (non-determined that finite automaton can be divided into NFA again Property finite automaton) with DFA (certainty finite automaton), when receiving new byte every time, shape can all occur for finite automaton State changes, and when one section of byte sequence makes the state of finite automaton constantly be changed when reaching final state by initial state, judges this Section byte sequence meets the rule of the bytes match in byte stream canonical coupling engine, and then successful match.
Finite automaton is compiled by regular expression, which needs to correspond to text data to be scanned Byte code format is write in advance, if text data is encoded using UTF-8, then the regular expression description matching rule Corresponding UTF-8 is needed to be write, so as to which syllable sequence can be received successively according to the finite automaton that the regular expression compiles Each byte in row, and corresponding state is made according to the byte of receiving and is changed.Refer to the attached drawing 5 uses byte stream canonical With engine as canonical matching tool, canonical can be done directly against the text data of byte stream form in data scanning Match, and obtain in the form of byte sequence existing for matching result.The intermediate conversion process for flowing to character stream by byte is avoided, is carried High data scanning efficiency.
After scanned, if there is the text data of successful match, crucial number that this article notebook data is determined as scanning According to;If the not text data of successful match, illustrate not including critical data in text to be scanned.
In practical applications, further data processing method can be determined according to demand.It illustrates:In certain situations Under, it need not know the particular content of critical data, it is only necessary to confirm in text to be scanned whether include critical data, this feelings It only needs to finish data scanning to be scanned under condition and is confirmed whether there is the critical data scanned, need not be decoded Operation;And in other cases, the particular content for knowing critical data is needed, then after scanning critical data, it is also necessary to right The critical data is decoded, will in the form of byte sequence existing for critical data be converted to literary existing for character string forms Notebook data.Wherein, the decoding conversion between character data and byte data is a kind of prior art, and details are not described herein.Without What kind of processing mode pipe uses, and all at most needs to be decoded the critical data scanned, avoids decoding all to be scanned Time caused by data wastes, and improves scan efficiency.
Wherein, preconfigured byte stream canonical matching tool and byte data matching rule is described in step S202 Byte stream regular expression is corresponding, i.e., the byte stream canonical matching tool be write according to a kind of byte stream regular expression and At, and the byte stream regular expression describes the bytes match rule for text data to be scanned.Based on this, this specification Embodiment provides a kind of acquisition methods of byte stream canonical matching tool, and shown in attached drawing 3, this method may include following step Suddenly:
S301 obtains character stream regular expression, and the character stream regular expression, which describes, meets textual data to be scanned According to the character data matching rule of scanning demand;
Character data in the character data matching rule is replaced with corresponding byte data by S302, with obtain with The corresponding byte stream regular expression of character stream regular expression, the character data can be applied specified with corresponding byte data Coding rule is converted mutually;
For example, if character data matching rule is all characters that retrieval is matched in text data to be scanned " ABC ", and the coding mode of the text data to be scanned in byte stream form is X, then it will be in former character data matching rule Related character " ABC " part all replace with by character " ABC " use the transformed byte datas of coding mode X.To make The text data to be scanned progress bytes match that coding mode is X can be directed to by describing the regular expression of the rule.
S303 configures corresponding byte stream canonical matching tool according to the byte stream regular expression.
Shown in attached drawing 4, this specification also provides another text data scan method, the scan method be applied to The text data to be scanned of bytewise storage, may comprise steps of:
S401 reads text data to be scanned in the form of byte stream;
S402, preconfigured byte stream canonical matching tool treat the word in scan text data by finite automaton Section is received one by one, when there is byte sequence that the state of finite automaton is made to reach final state, judges that the byte sequence is With successful text data, the word for meeting text data scanning demand to be scanned is contained in the byte stream canonical matching tool Matching rule is saved, the bytes match rule is write in advance according to the byte code format of text data to be scanned;
S403 judges whether there is the text data of successful match?If there is the text data of successful match, step is executed S404 terminates flow if the not text data of successful match.
S404, the critical data that the text data of wherein successful match is determined as scanning, the key scanned Data be in the form of byte sequence existing for text data.
S405 decodes the critical data, by it is described in the form of byte sequence existing for critical data be converted to character Text data existing for string form.
Corresponding to above method embodiment, this specification embodiment also provides a kind of text data scanning means, referring to Fig. 6 Shown, which is applied to the text data to be scanned that scanning is stored with bytewise, and described device may include:Digital independent Module 610, data match module 620 and data extraction module 630.
Data read module 610:For reading text data to be scanned in the form of byte stream;
Data match module 620:For text data to be scanned to be passed through preconfigured byte stream canonical matching tool It is matched by byte, the byte for meeting text data scanning demand to be scanned is contained in the byte stream canonical matching tool Matching rule, the bytes match rule are write in advance according to the byte code format of text data to be scanned;
Data extraction module 630:Critical data for the text data of wherein successful match to be determined as scanning, institute State the critical data that scans be in the form of byte sequence existing for text data.
This specification embodiment also provides a kind of electronic equipment, includes at least memory, processor and is stored in storage On device and the computer program that can run on a processor, wherein processor realizes aforementioned texts data when executing described program Scan method, this method are applied to the text data to be scanned that scanning is stored with bytewise, and the method includes at least:
Text data to be scanned is read in the form of byte stream;
Text data to be scanned is matched by preconfigured byte stream canonical matching tool by byte, the word The bytes match rule for meeting text data scanning demand to be scanned, the bytes match are contained in throttling canonical matching tool Rule is write in advance according to the byte code format of text data to be scanned;
The critical data that the text data of wherein successful match is determined as scanning, the critical data scanned are Text data existing in the form of byte sequence.
Attached drawing 7 shows that a kind of more specifically computing device hardware configuration that this specification embodiment is provided is illustrated Figure, the equipment may include:Processor 1010, memory 1020, input/output interface 1030, communication interface 1040 and bus 1050.Wherein processor 1010, memory 1020, input/output interface 1030 and communication interface 1040 are real by bus 1050 The now communication connection inside equipment each other.
General CPU (Central Processing Unit, central processing unit), micro- place may be used in processor 1010 Reason device, application specific integrated circuit (Application Specific Integrated Circuit, ASIC) or one Or the modes such as multiple integrated circuits are realized, for executing relative program, to realize technical side that this specification embodiment is provided Case.
ROM (Read Only Memory, read-only memory), RAM (Random Access may be used in memory 1020 Memory, random access memory), static storage device, the forms such as dynamic memory realize.Memory 1020 can store Operating system and other applications are realizing technical solution that this specification embodiment is provided by software or firmware When, relevant program code is stored in memory 1020, and is executed by processor 1010 to call.
Input/output interface 1030 is for connecting input/output module, to realize information input and output.Input and output/ Module can be used as component Configuration (not shown) in a device, can also be external in equipment to provide corresponding function.Wherein Input equipment may include keyboard, mouse, touch screen, microphone, various kinds of sensors etc., output equipment may include display, Loud speaker, vibrator, indicator light etc..
Communication interface 1040 is used for connection communication module (not shown), to realize the communication of this equipment and other equipment Interaction.Wherein communication module can be realized by wired mode (such as USB, cable etc.) and be communicated, can also be wirelessly (such as mobile network, WIFI, bluetooth etc.) realizes communication.
Bus 1050 include an access, equipment various components (such as processor 1010, memory 1020, input/it is defeated Outgoing interface 1030 and communication interface 1040) between transmit information.
It should be noted that although above equipment illustrates only processor 1010, memory 1020, input/output interface 1030, communication interface 1040 and bus 1050, but in specific implementation process, which can also include realizing normal fortune Other assemblies necessary to row.In addition, it will be appreciated by those skilled in the art that, can also only include real in above equipment Component necessary to existing this specification example scheme, without including all components shown in figure.
This specification embodiment also provides a kind of computer readable storage medium, is stored thereon with computer program, the journey Realize that text data scan method above-mentioned, this method are applied to scanning and are waited for what bytewise stored when sequence is executed by processor Scan text data, the method include at least:
Text data to be scanned is read in the form of byte stream;
Text data to be scanned is matched by preconfigured byte stream canonical matching tool by byte, the word The bytes match rule for meeting text data scanning demand to be scanned, the bytes match are contained in throttling canonical matching tool Rule is write in advance according to the byte code format of text data to be scanned;
The critical data that the text data of wherein successful match is determined as scanning, the critical data scanned are Text data existing in the form of byte sequence.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology realizes information storage.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, tape magnetic disk storage or other magnetic storage apparatus Or any other non-transmission medium, it can be used for storage and can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
The function of each unit and the realization process of effect specifically refer to and correspond to step in the above method in above-mentioned apparatus Realization process, details are not described herein.
For device embodiments, since it corresponds essentially to embodiment of the method, so related place is referring to method reality Apply the part explanation of example.The apparatus embodiments described above are merely exemplary, wherein described be used as separating component The unit of explanation may or may not be physically separated, and the component shown as unit can be or can also It is not physical unit, you can be located at a place, or may be distributed over multiple network units.It can be according to actual It needs that some or all of module therein is selected to realize the purpose of this specification scheme.Those of ordinary skill in the art are not In the case of making the creative labor, you can to understand and implement.
As seen through the above description of the embodiments, those skilled in the art can be understood that this specification Embodiment can add the mode of required general hardware platform to realize by software.Based on this understanding, this specification is implemented Substantially the part that contributes to existing technology can be expressed in the form of software products the technical solution of example in other words, The computer software product can be stored in a storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are making It is each to obtain computer equipment (can be personal computer, server or the network equipment etc.) execution this specification embodiment Method described in certain parts of a embodiment or embodiment.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity, Or it is realized by the product with certain function.A kind of typically to realize that equipment is computer, the concrete form of computer can To be personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play In device, navigation equipment, E-mail receiver/send equipment, game console, tablet computer, wearable device or these equipment The combination of arbitrary several equipment.
Each embodiment in this specification is described in a progressive manner, identical similar portion between each embodiment Point just to refer each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method Part explanation.The apparatus embodiments described above are merely exemplary, wherein described be used as separating component explanation Module may or may not be physically separated, can be each module when implementing this specification example scheme Function realize in the same or multiple software and or hardware.Can also select according to the actual needs part therein or Person's whole module achieves the purpose of the solution of this embodiment.Those of ordinary skill in the art are not the case where making the creative labor Under, you can to understand and implement.
The above is only the specific implementation mode of this specification embodiment, it is noted that for the general of the art For logical technical staff, under the premise of not departing from this specification embodiment principle, several improvements and modifications can also be made, this A little improvements and modifications also should be regarded as the protection domain of this specification embodiment.
The foregoing is merely the preferred embodiments of this specification, all in this explanation not to limit this specification Within the spirit and principle of book, any modification, equivalent substitution, improvement and etc. done should be included in the model of this specification protection Within enclosing.

Claims (11)

1. a kind of text data scan method is applied to the text data to be scanned stored with bytewise, which is characterized in that institute The method of stating includes:
Text data to be scanned is read in the form of byte stream;
Text data to be scanned is matched by preconfigured byte stream canonical matching tool by byte, the byte stream The bytes match rule for meeting text data scanning demand to be scanned, the bytes match rule are contained in canonical matching tool It is write in advance according to the byte code format of text data to be scanned;
The critical data that the text data of wherein successful match is determined as scanning, the critical data scanned are with word Save text data existing for sequence form.
2. the method as described in claim 1, which is characterized in that the text data by wherein successful match is determined as scanning After the critical data gone out, further include:
Decode the critical data, by it is described in the form of byte sequence existing for critical data be converted to and exist with character string forms Text data.
3. the method as described in claim 1, which is characterized in that the preconfigured byte stream canonical matching tool and description The byte stream regular expression of byte data matching rule is corresponding.
4. method as claimed in claim 3, which is characterized in that the obtaining step of the byte stream regular expression includes:
Character stream regular expression is obtained, the character stream regular expression, which describes, meets text data scanning demand to be scanned Character data matching rule;
Character data in the character data matching rule is replaced with into corresponding byte data, the character data with it is corresponding Byte data can apply prescribed coding rule mutually conversion.
5. the method as described in claim 1, which is characterized in that described that text data to be scanned is passed through preconfigured byte Stream canonical matching tool is matched by byte, including:
Preconfigured byte stream canonical matching tool by finite automaton treat the byte in scan text data carry out by A receiving judges the byte sequence for successful match when there is byte sequence that the state of finite automaton is made to reach final state Text data.
6. a kind of text data scanning means is applied to the text data to be scanned stored with bytewise, which is characterized in that institute Stating device includes:
Data read module:For reading text data to be scanned in the form of byte stream;
Data match module:For by text data to be scanned by preconfigured byte stream canonical matching tool by byte into Row matches, and the bytes match rule for meeting text data scanning demand to be scanned are contained in the byte stream canonical matching tool Then, the bytes match rule is write in advance according to the byte code format of text data to be scanned;
Data extraction module:Critical data for the text data of wherein successful match to be determined as scanning, the scanning The critical data gone out be in the form of byte sequence existing for text data.
7. device as claimed in claim 6, which is characterized in that the text data scanning means further includes:
Decoder module:Decode the critical data, by it is described in the form of byte sequence existing for critical data be converted to character Text data existing for string form.
8. device as claimed in claim 6, which is characterized in that the preconfigured byte stream canonical matching tool and description The byte stream regular expression of byte data matching rule is corresponding.
9. device as claimed in claim 8, which is characterized in that the matching module is specifically used for:
Character stream regular expression is obtained, the character stream regular expression, which describes, meets text data scanning demand to be scanned Character data matching rule;
Character data in the character data matching rule is replaced with into corresponding byte data, the character data with it is corresponding Byte data can apply prescribed coding rule mutually conversion.
10. device as claimed in claim 6, which is characterized in that the matching module is specifically used for:
Preconfigured byte stream canonical matching tool by finite automaton treat the byte in scan text data carry out by A receiving judges the byte sequence for successful match when there is byte sequence that the state of finite automaton is made to reach final state Text data.
11. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, wherein the processor realizes the method as described in claim 1 when executing described program.
CN201810531245.5A 2018-05-29 2018-05-29 Text data scanning method and device Active CN108734149B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810531245.5A CN108734149B (en) 2018-05-29 2018-05-29 Text data scanning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810531245.5A CN108734149B (en) 2018-05-29 2018-05-29 Text data scanning method and device

Publications (2)

Publication Number Publication Date
CN108734149A true CN108734149A (en) 2018-11-02
CN108734149B CN108734149B (en) 2022-01-18

Family

ID=63936591

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810531245.5A Active CN108734149B (en) 2018-05-29 2018-05-29 Text data scanning method and device

Country Status (1)

Country Link
CN (1) CN108734149B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111309683A (en) * 2020-02-07 2020-06-19 北京明朝万达科技股份有限公司 Method and device for scanning full disk data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102685098A (en) * 2012-02-24 2012-09-19 华南理工大学 Recombination-free multi-mode matching method for out-of-order data package flow
CN103139072A (en) * 2011-11-30 2013-06-05 美国博通公司 System and method for integrating line-rate application recognition in a switch ASIC
CN104361097A (en) * 2014-11-21 2015-02-18 国家电网公司 Real-time detection method for electric power sensitive mail based on multimode matching
CN106407475A (en) * 2016-11-18 2017-02-15 广州爱九游信息技术有限公司 Content screening method, device and server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103139072A (en) * 2011-11-30 2013-06-05 美国博通公司 System and method for integrating line-rate application recognition in a switch ASIC
CN102685098A (en) * 2012-02-24 2012-09-19 华南理工大学 Recombination-free multi-mode matching method for out-of-order data package flow
CN104361097A (en) * 2014-11-21 2015-02-18 国家电网公司 Real-time detection method for electric power sensitive mail based on multimode matching
CN106407475A (en) * 2016-11-18 2017-02-15 广州爱九游信息技术有限公司 Content screening method, device and server

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111309683A (en) * 2020-02-07 2020-06-19 北京明朝万达科技股份有限公司 Method and device for scanning full disk data
CN111309683B (en) * 2020-02-07 2023-04-14 北京明朝万达科技股份有限公司 Method and device for scanning full disk data

Also Published As

Publication number Publication date
CN108734149B (en) 2022-01-18

Similar Documents

Publication Publication Date Title
US20170295263A1 (en) System and method for applying an efficient data compression scheme to url parameters
CN104301207B (en) Web information processing method and device
KR20160138424A (en) Flexible schema for language model customization
CN111191255B (en) Information encryption processing method, server, terminal, device and storage medium
CN108733317B (en) Data storage method and device
CN107919943A (en) Coding, coding/decoding method and the device of binary data
CN111314388B (en) Method and apparatus for detecting SQL injection
CN103999082B (en) Method, computer program and computer for detecting the community in social media
CN110008740B (en) Method, device, medium and electronic equipment for processing document access authority
CN110704833A (en) Data permission configuration method, device, electronic device and storage medium
CN106330846A (en) Cross-platform object recommendation method and device
US10050915B2 (en) Adding images to a text based electronic message
CN108734149A (en) A kind of text data scan method and device
Cho Fixed point theorems in complete cone metric spaces over Banach algebras
CN103853421A (en) Method and equipment for character and picture mixed input
US20160378774A1 (en) Predicting Geolocation Of Users On Social Networks
CN113873450B (en) Short message configuration method, device, computer equipment and storage medium
CN112487765B (en) Method and device for generating notification text
CN106031296B (en) Message processing method and electronic device supporting same
CN109712011B (en) Community discovery method and device
CN107026841A (en) The method and apparatus for issuing works in a network
Bogoya et al. Systems with local and nonlocal diffusions, mixed boundary conditions, and reaction terms
CN111967001A (en) Decoding and coding safety isolation method based on double containers
Hsu Multiple Positive Solutions for a Quasilinear Elliptic System Involving Concave‐Convex Nonlinearities and Sign‐Changing Weight Functions
CN111178010A (en) Method and system for displaying digital signature, data editing method and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201022

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201022

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant