CN102169485B - Method and system for searching a plurality of strings - Google Patents
Method and system for searching a plurality of strings Download PDFInfo
- Publication number
- CN102169485B CN102169485B CN201010116709.XA CN201010116709A CN102169485B CN 102169485 B CN102169485 B CN 102169485B CN 201010116709 A CN201010116709 A CN 201010116709A CN 102169485 B CN102169485 B CN 102169485B
- Authority
- CN
- China
- Prior art keywords
- pattern
- word
- substring
- string
- memory device
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method and system for searching a plurality of strings. The system comprises a first storage device for storing pattern strings respectively starting from a first pattern word; and a second storage device for storing the combination of the first pattern word of the pattern string starting from the first pattern word and a corresponding pattern length; the system further comprises a search engine for iteratively identifying a word matched with one of the first pattern words in a text as a current word, wherein the search engine iteratively extracts a sub-string which starts from the current word and is provided with a sub-string with the length equal to one of the pattern lengths, and iteratively compares the sub-string with each pattern string which is provided with the first pattern word and the string length equal to that the first pattern word and the length of the sub-string; and the system further comprises a third storage device for storing any sub-string matched with one of the pattern strings.
Description
Technical field
The technical field of relate generally to information processing of the present invention, more specifically, relates to the method and system for information search.
Background technology
Along with the development of cyber-net correlation technique, increasing people's combine digital search identifies or finds in digital document the certain content of the demand meeting them.Such as, people (such as father and mother) or authoritative institution can attempt in the obtainable digital document of children, find some limiting content being not suitable for child (such as go here and there, express or word), then make children away from these contents.But in a lot of situation, due to size and/or the substantial amounts of such as digital document, people or authoritative institution identify or find these limiting contents to be tasks consuming time.Therefore, need the searching method improved, be used for performing search efficiently and reduce search time.
Summary of the invention
The object of the application will be provided for performing efficiently the search of multiple string based on the first word of multiple string and the combination of length to reduce the system and method for search time.
According to the first aspect of the application, provide a kind of for the system by using the instruction in a computer of one or more processor executive resident to search for multiple string from text.This system comprises: the first memory device, for storing respectively with the pattern string that the first pattern word starts; And second memory device, for storing the first pattern word of pattern string and the combination of corresponding pattern length that start with the first pattern word.This system also comprises: search engine, for identifying that the word matched with one of described first pattern word is to be set to current word in the text iteratively; Extraction apparatus, to start with described current word for extracting iteratively and to have the substring of the substring length equal with one of pattern length; And comparer, for iteratively by described substring with there is first pattern word identical with length with the first pattern word of this substring compare with each pattern string of string length.This system can also comprise the 3rd memory device, for storing the information relevant with this substring when this substring matches to one of pattern string.
According to the another aspect of the application, provide a kind of for the method by using the instruction in a computer of one or more processor executive resident to search for multiple string from text.The method comprises: be stored in the first memory device by the pattern string started with the first pattern word respectively; And the first pattern word of pattern string started with the first pattern word respectively and the combination of corresponding pattern length are stored in the second memory device.The method also comprises: utilize search engine to identify iteratively and the word that one of described first pattern word matches in the text, to be set to current word; Utilize search engine to extract iteratively and to start with described current word and there is the substring of the substring length equal with one of pattern length; Utilize search engine iteratively by this substring with there is first pattern word identical with length with the first pattern word of this substring compare with each pattern string of string length; And if one of this substring and pattern string match, then the information relevant to this substring is stored in the 3rd memory device.
The system and method for the application can find and locate a series of with the predefined pattern string (such as Chinese word) using the language of large character set (charset) (such as Chinese) to write in the text efficiently.The technology that the application adopts considers the characteristic of the language using large character set, and can obtain linear working time and the search time of minimizing.This technology can by such as forbidding the text comprising one or more predefined pattern word in Bulletin Board Systems (BBS) thread.
Accompanying drawing explanation
In the accompanying drawings in an illustrative manner and unrestriced mode illustrates embodiments of the invention, the similar label instruction similar components in accompanying drawing, in the accompanying drawings:
Fig. 1 is the block diagram of the system for searching for multiple target pattern string (target pattern string) from text illustrated according to exemplary embodiment;
Fig. 2 is the process flow diagram of the method for searching for multiple target pattern string from text illustrated according to exemplary embodiment; And
Fig. 3 is the block diagram illustrating the machine according to exemplary embodiment with the exemplary form of computer system.
Embodiment
System and method for searching for multiple target pattern string from text will be described.In the following description, multiple detail is proposed to illustrate, to the invention provides complete understanding.But those skilled in the art should understand, the present invention also can be implemented without the need to these details.
In a lot of situation, people can attempt search digital document and find and locate some certain content.Such as, some father and mother or authoritative institution can attempt searching for the digital document open to children, to determine the harmful content that whether has to use the Asian language of large character set (such as, Chinese) to write in digital document.These harmful content can be the multiple intended target pattern strings (such as Chinese word or word) being not suitable for child, such as " pornographic ", " pornographic net ", " porny ", " violence " and " violence TV play " etc.Each target pattern string (such as " pornographic net ") has one first word (such as " look ") and a pattern string length (such as 3).
Fig. 1 is the block diagram of the system 100 for searching for multiple target pattern string from text illustrated according to exemplary embodiment.In certain embodiments, system 100 can comprise the first memory device 10, second memory device 20, the 3rd memory device 30 and search engine 40.System 100 can also comprise one or more processor 50, for performing reservation instruction in a computer to operate other assemblies.
In certain embodiments, first memory device 10 can store predefined target pattern string (such as, " pornographic ", " pornographic net ", " porny ", " violence " and " violence TV play "), these target pattern strings have the first word (such as separately, " look " or " cruelly ") and pattern string length (such as, 2,3,4 or 5).First memory device 10 can comprise a HashSet (hash set).This HashSet is the specific implementation of Set (set) interface.It creates and uses the gathering (collection) of hash table for storage.The mechanism that hash table is called as Hash (hashing) by use stores information.In certain embodiments, system 100 can also comprise user interface 60, and it can be used for receiving and will be stored in the target pattern string in the first memory device 10.
In certain embodiments, second memory device 20 can store target pattern string (such as, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") the first word (such as, " look " or " cruelly ") and pattern string length is (such as, 2,3,4 or 5) unique combinations (such as, < " look ", (2,3,4) > and < " cruelly ", (2,5) >).The pattern string started with the first pattern word (such as " look ") (such as, " pornographic ", " pornographic net ", " pornographic acute ", " porny ") the first pattern word and pattern length (such as, 2,3,3,4) combination (such as < " look ", (2,3,4) >) be unduplicated.Second memory device 20 can comprise at least one HashMap (hash figure).HashMap refers to a kind of data structure, and it uses hash function some identifier or key word (key) to be mapped to efficiently the value (such as, their telephone number) of association.Hash function is used for key word being converted to the index by the array element (groove (slot) or bucket (bucket)) therefrom finding analog value.
In certain embodiments, search engine 40 can identify and the word that one of the first pattern word (such as " look ") matches iteratively from text beginning, and this coupling word is set to the current word in text.
In certain embodiments, search engine 40 can extract iteratively and to start with current word (such as " look ") and to have the substring of the substring length equaling one of target pattern string length (such as 2).
In certain embodiments, search engine 40 can iteratively by this substring with there is the first pattern word (such as " look ") identical with string length with the first word of this substring respectively compare with each target pattern string (such as " pornographic ") of string length (such as 2).If one of the substring extracted and target pattern string (such as " pornographic ") match, the information relevant to this substring then can be stored in the 3rd memory device 30.The information relevant to this substring can comprise the position of this substring.
This process all the way moves forward to and arrives text ending.In certain embodiments, the word string navigated to is highlighted to warn user.In certain embodiments, system 100 can comprise display 70, is used for showing the information relevant to the substring be stored in the 3rd memory device 30.
System 100 can find efficiently in the text and locate and the one or more substrings matched in predefined target pattern string, to reduce search time.
Fig. 2 is the process flow diagram of the method 200 for searching for multiple target pattern string from text illustrated according to exemplary embodiment.
In certain embodiments, at operation 202 place, multiple target pattern string (such as, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") is stored in the first memory device 10 respectively.These target pattern strings start with the first pattern word (such as " look " or " cruelly ") respectively.
At operation 204 place, the target pattern string that will start with the first pattern word (such as " look " or " cruelly ") (such as, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") the first pattern word (such as " look " or " cruelly ") and corresponding pattern length (such as 2, 3, 4 or 5) unique combinations (such as, < " look ", 2, 3, 4>) with < " cruelly ", 2, 5> is stored in the second memory device 20.
At operation 206 place, search engine 40 is utilized to identify iteratively in the text and the word that one of first pattern word (such as " look ") matches, to be set to current word.
At operation 208 place, utilize search engine 40 to extract iteratively and to start with current word (such as " look ") and there is the substring of the substring length equal with one of target pattern string length (such as 2).
At operation 210 place, utilize search engine 40 iteratively by this substring with there is the first identical with substring length with the first pattern word of this substring respectively pattern word (such as " look ") compare with each target pattern string (such as " pornographic ") of target strings length (such as 2).
If this substring mates with one of target pattern string (such as " pornographic "), then at operation 212 place, the information relevant to this substring is stored in the 3rd memory device 30.The information relevant to this substring can comprise the position of this substring.Operation 206 to 212 is repeated, until arrive the ending of text.
At operation 214 place, display 70 is utilized to show the information relevant to all substrings be stored in the 3rd memory device 30.
The operation of embodiment can divide two stages to perform: initialization and process.At initial phase, target pattern word (searched word) is placed in HashSet, and the first word of each target pattern word and length are placed in HashMap.Because the HashMap in JAVA does not allow the key word of repetition, therefore the different length of the pattern word that the first word is identical is placed into HashMap.
Processing stage, text is from the starting iterated process.Each word of text is examined.If find current word in HashMap, then the substring of possibility current location is pattern word.Obtain possible length from HashMap, and for each may length, extract the substring according to current length, to check it whether in HashSet.If obtain hit, then find target pattern word in the text, starting position is current location and length is current length.Otherwise process proceeds to next may length.If likely length is all processed in institute, process then proceeds to next word of text.
Suppose that text size is M, and there is N number of target pattern string.Be A*N at initial phase required time, wherein A comprises for being stored in HashSet by word, extracts the first word of substring and word and length be stored into the constant of the time in HashMap.Processing stage required time be B*M, wherein B comprises for searching word in HashMap, extracting substring and search the constant of time of this substring when finding hit in HashSet.The T.T. complexity of this algorithm is the function of (A*N+B*M).Fig. 3 is the block diagram illustrating a machine with the exemplary form of computer system 300, can perform the set for causing machine to perform the instruction sequence of any one method in method discussed here in this computer system 300.In an alternate embodiment, this machine can be server computer, client computer, personal computer (PC), tablet PC, Set Top Box (STB), personal digital assistant (PDA), cell phone, network tool, network router, switch or bridge, maybe can perform any machine of specifying the instruction set of the action taked by this machine.In addition, although only have individual machine to be illustrated, term " machine " also can comprise any gathering of multiple machine, and these machines individually or jointly set of instructions perform any one or multiple method in the method discussed here.
Exemplary computer system 300 comprise processor 302 (such as CPU (central processing unit) (CPU), Graphics Processing Unit (GPU) or its both), primary memory 304 and static memory 306, they communicate with one another via bus 308.Computer system 300 can also comprise video display unit 310 (such as liquid crystal display (LCD) or cathode-ray tube (CRT) (CRT)).Computer system 300 also comprises Alphanumeric Entry Device 312 (such as keyboard), cursor control device 314 (such as mouse), disk drive unit 316, signal generation equipment 328 (such as loudspeaker) and Network Interface Unit 320.
Disk drive unit 316 comprises machine readable media 322, it stores any one or the multiple one or more instruction sets (such as software 324) that embody in method described herein or function.Software 324 by computer system 300 the term of execution can also reside in completely or at least partly in primary memory 304 and/or processor 320, primary memory 304 and processor 320 also form machine readable media.
Software 324 can also be sent via Network Interface Unit 320 or be received on network 326.Although machine readable media 322 is shown as single medium in the exemplary embodiment, but term " machine readable media " should be believed to comprise and store the single medium of one or more instruction set or multiple medium (such as, the buffer memory of centralized or distributed data base and/or association and server).Term " machine readable media " also will be believed to comprise the arbitrary medium that can store, encode or carry following instruction set, described instruction set performed by machine and cause machine perform embodiments of the invention method operation in any one or multiple.Term " machine readable media " therefore will be believed to comprise (but being not limited to) solid-state memory, light and magnetic medium and carrier signal.
Therefore, be described from the method and system of text search multiple target pattern string.Although the present invention is described with reference to concrete exemplary embodiment, will be seen that, various amendment and change can be made to these embodiments when not departing from wider spirit and scope of the present invention.Therefore, instructions and accompanying drawing will be regarded as illustrative rather than restrictive.
Claims (9)
1. a series of to use a method for the predefined pattern string of the written of large character set for searching for from text, the method comprises:
The pattern string started with the first pattern word is respectively stored in the first memory device;
Be stored in the second memory device by the first pattern word of pattern string started with the first pattern word respectively and the combination of corresponding pattern string length, wherein, the first pattern word of the pattern string started with the first pattern word and the combination of pattern string length are unduplicated;
Search engine is utilized to identify iteratively and the word that one of described first pattern word matches in the text, to be set to current word;
Utilize described search engine to extract iteratively and to start with described current word and there is the substring of the substring length equal with one of pattern string length;
Utilize described search engine iteratively by described substring with there is first pattern word identical with this substring compare with each pattern string of string length; And
If one of this substring and pattern string match, then the information relevant to this substring is stored in the 3rd memory device.
2. the method for claim 1, also comprises and receives described pattern string from user interface.
3. the method for claim 1, also comprises and utilizes display to show the information relevant to the substring be stored in the 3rd memory device.
4. the method for claim 1, wherein described first memory device comprises at least one HashSet, and wherein said second memory device comprises at least one HashMap.
5. a series of to use a system for the predefined pattern string of the written of large character set for searching for from text, this system comprises:
For the pattern string started with the first pattern word being respectively stored in the device in the first memory device;
For the first pattern word of pattern string started with the first pattern word respectively and the combination of corresponding pattern string length being stored in the device in the second memory device, wherein, the first pattern word of pattern string started with the first pattern word and the combination of pattern string length are unduplicated;
Identify that for utilizing search engine the word matched with one of described first pattern word is to be set to the device of current word in the text iteratively;
Extract iteratively for utilizing described search engine and to start with described current word and there is the device of the substring of the substring length equal with one of pattern string length;
For utilizing described search engine iteratively by described substring and the device there is first pattern word identical with this substring comparing with each pattern string of string length; And
For the information relevant with this substring being stored in when this substring matches to one of pattern string the device in the 3rd memory device.
6. system as claimed in claim 5, also comprises the device for receiving described pattern string from user interface.
7. system as claimed in claim 5, also comprises the device for utilizing display to show the information relevant to the substring be stored in the 3rd memory device.
8. system as claimed in claim 5, wherein relevant to the substring be stored in the 3rd memory device information comprises described substring position in the text.
9. system as claimed in claim 5, if wherein one of described substring and pattern string match, described search engine then highlights this substring.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410757944.3A CN104484381B (en) | 2010-02-26 | 2010-02-26 | For searching for the method and system of multiple strings |
CN201010116709.XA CN102169485B (en) | 2010-02-26 | 2010-02-26 | Method and system for searching a plurality of strings |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010116709.XA CN102169485B (en) | 2010-02-26 | 2010-02-26 | Method and system for searching a plurality of strings |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410757944.3A Division CN104484381B (en) | 2010-02-26 | 2010-02-26 | For searching for the method and system of multiple strings |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102169485A CN102169485A (en) | 2011-08-31 |
CN102169485B true CN102169485B (en) | 2015-01-07 |
Family
ID=44490648
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201010116709.XA Expired - Fee Related CN102169485B (en) | 2010-02-26 | 2010-02-26 | Method and system for searching a plurality of strings |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102169485B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1496522A (en) * | 2000-03-29 | 2004-05-12 | �ʼҷ����ֵ�������˾ | Data serch user interface with ergonomic mechanism for user profile definition and manipulation |
CN1794236A (en) * | 2004-12-21 | 2006-06-28 | 英特尔公司 | Efficient CAM-based techniques to perform string searches in packet payloads |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080267504A1 (en) * | 2007-04-24 | 2008-10-30 | Nokia Corporation | Method, device and computer program product for integrating code-based and optical character recognition technologies into a mobile visual search |
-
2010
- 2010-02-26 CN CN201010116709.XA patent/CN102169485B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1496522A (en) * | 2000-03-29 | 2004-05-12 | �ʼҷ����ֵ�������˾ | Data serch user interface with ergonomic mechanism for user profile definition and manipulation |
CN1794236A (en) * | 2004-12-21 | 2006-06-28 | 英特尔公司 | Efficient CAM-based techniques to perform string searches in packet payloads |
Also Published As
Publication number | Publication date |
---|---|
CN102169485A (en) | 2011-08-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111241282B (en) | Text theme generation method and device and electronic equipment | |
US20150178273A1 (en) | Unsupervised Relation Detection Model Training | |
US9959340B2 (en) | Semantic lexicon-based input method editor | |
US20110106805A1 (en) | Method and system for searching multilingual documents | |
CN111737559B (en) | Resource ordering method, method for training ordering model and corresponding device | |
JP2015528604A (en) | Feature-based candidate selection | |
US20160188569A1 (en) | Generating a Table of Contents for Unformatted Text | |
JP7397903B2 (en) | Intelligent interaction methods, devices, electronic devices and storage media | |
JP7140913B2 (en) | Video distribution statute of limitations determination method and device | |
CN111309200B (en) | Method, device, equipment and storage medium for determining extended reading content | |
US11182681B2 (en) | Generating natural language answers automatically | |
US20150227497A1 (en) | Method and apparatus for identifying garbage template article | |
CN102867049A (en) | Chinese PINYIN quick word segmentation method based on word search tree | |
CN105404677A (en) | Tree structure based retrieval method | |
US8914377B2 (en) | Methods for prefix indexing | |
CN109885641A (en) | A kind of method and system of database Chinese Full Text Retrieval | |
CN105138649A (en) | Data search method and device and terminal | |
CN102169485B (en) | Method and system for searching a plurality of strings | |
CN105426490A (en) | Tree structure based indexing method | |
CN105426389A (en) | Fuzzy retrieval locating method based on UI directory tree view | |
CN108292307A (en) | With the quick operating prefix Burrow-Wheeler transformation to compressed data | |
CN104484381A (en) | Method and system for searching multiple strings | |
CN102567424A (en) | Poetry association library system and realization method thereof as well as electronic learning equipment | |
CN106933999B (en) | Apache lucene score highlighting method for independent search | |
CN112784596A (en) | Method and device for identifying sensitive words |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20150107 Termination date: 20210226 |