CN102169485B - Method and system for searching a plurality of strings - Google Patents

Method and system for searching a plurality of strings Download PDF

Info

Publication number
CN102169485B
CN102169485B CN201010116709.XA CN201010116709A CN102169485B CN 102169485 B CN102169485 B CN 102169485B CN 201010116709 A CN201010116709 A CN 201010116709A CN 102169485 B CN102169485 B CN 102169485B
Authority
CN
China
Prior art keywords
pattern
word
substring
string
memory device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201010116709.XA
Other languages
Chinese (zh)
Other versions
CN102169485A (en
Inventor
张�林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
eBay Inc
Original Assignee
eBay Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by eBay Inc filed Critical eBay Inc
Priority to CN201010116709.XA priority Critical patent/CN102169485B/en
Priority to CN201410757944.3A priority patent/CN104484381B/en
Publication of CN102169485A publication Critical patent/CN102169485A/en
Application granted granted Critical
Publication of CN102169485B publication Critical patent/CN102169485B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to a method and system for searching a plurality of strings. The system comprises a first storage device for storing pattern strings respectively starting from a first pattern word; and a second storage device for storing the combination of the first pattern word of the pattern string starting from the first pattern word and a corresponding pattern length; the system further comprises a search engine for iteratively identifying a word matched with one of the first pattern words in a text as a current word, wherein the search engine iteratively extracts a sub-string which starts from the current word and is provided with a sub-string with the length equal to one of the pattern lengths, and iteratively compares the sub-string with each pattern string which is provided with the first pattern word and the string length equal to that the first pattern word and the length of the sub-string; and the system further comprises a third storage device for storing any sub-string matched with one of the pattern strings.

Description

For searching for the method and system of multiple string
Technical field
The technical field of relate generally to information processing of the present invention, more specifically, relates to the method and system for information search.
Background technology
Along with the development of cyber-net correlation technique, increasing people's combine digital search identifies or finds in digital document the certain content of the demand meeting them.Such as, people (such as father and mother) or authoritative institution can attempt in the obtainable digital document of children, find some limiting content being not suitable for child (such as go here and there, express or word), then make children away from these contents.But in a lot of situation, due to size and/or the substantial amounts of such as digital document, people or authoritative institution identify or find these limiting contents to be tasks consuming time.Therefore, need the searching method improved, be used for performing search efficiently and reduce search time.
Summary of the invention
The object of the application will be provided for performing efficiently the search of multiple string based on the first word of multiple string and the combination of length to reduce the system and method for search time.
According to the first aspect of the application, provide a kind of for the system by using the instruction in a computer of one or more processor executive resident to search for multiple string from text.This system comprises: the first memory device, for storing respectively with the pattern string that the first pattern word starts; And second memory device, for storing the first pattern word of pattern string and the combination of corresponding pattern length that start with the first pattern word.This system also comprises: search engine, for identifying that the word matched with one of described first pattern word is to be set to current word in the text iteratively; Extraction apparatus, to start with described current word for extracting iteratively and to have the substring of the substring length equal with one of pattern length; And comparer, for iteratively by described substring with there is first pattern word identical with length with the first pattern word of this substring compare with each pattern string of string length.This system can also comprise the 3rd memory device, for storing the information relevant with this substring when this substring matches to one of pattern string.
According to the another aspect of the application, provide a kind of for the method by using the instruction in a computer of one or more processor executive resident to search for multiple string from text.The method comprises: be stored in the first memory device by the pattern string started with the first pattern word respectively; And the first pattern word of pattern string started with the first pattern word respectively and the combination of corresponding pattern length are stored in the second memory device.The method also comprises: utilize search engine to identify iteratively and the word that one of described first pattern word matches in the text, to be set to current word; Utilize search engine to extract iteratively and to start with described current word and there is the substring of the substring length equal with one of pattern length; Utilize search engine iteratively by this substring with there is first pattern word identical with length with the first pattern word of this substring compare with each pattern string of string length; And if one of this substring and pattern string match, then the information relevant to this substring is stored in the 3rd memory device.
The system and method for the application can find and locate a series of with the predefined pattern string (such as Chinese word) using the language of large character set (charset) (such as Chinese) to write in the text efficiently.The technology that the application adopts considers the characteristic of the language using large character set, and can obtain linear working time and the search time of minimizing.This technology can by such as forbidding the text comprising one or more predefined pattern word in Bulletin Board Systems (BBS) thread.
Accompanying drawing explanation
In the accompanying drawings in an illustrative manner and unrestriced mode illustrates embodiments of the invention, the similar label instruction similar components in accompanying drawing, in the accompanying drawings:
Fig. 1 is the block diagram of the system for searching for multiple target pattern string (target pattern string) from text illustrated according to exemplary embodiment;
Fig. 2 is the process flow diagram of the method for searching for multiple target pattern string from text illustrated according to exemplary embodiment; And
Fig. 3 is the block diagram illustrating the machine according to exemplary embodiment with the exemplary form of computer system.
Embodiment
System and method for searching for multiple target pattern string from text will be described.In the following description, multiple detail is proposed to illustrate, to the invention provides complete understanding.But those skilled in the art should understand, the present invention also can be implemented without the need to these details.
In a lot of situation, people can attempt search digital document and find and locate some certain content.Such as, some father and mother or authoritative institution can attempt searching for the digital document open to children, to determine the harmful content that whether has to use the Asian language of large character set (such as, Chinese) to write in digital document.These harmful content can be the multiple intended target pattern strings (such as Chinese word or word) being not suitable for child, such as " pornographic ", " pornographic net ", " porny ", " violence " and " violence TV play " etc.Each target pattern string (such as " pornographic net ") has one first word (such as " look ") and a pattern string length (such as 3).
Fig. 1 is the block diagram of the system 100 for searching for multiple target pattern string from text illustrated according to exemplary embodiment.In certain embodiments, system 100 can comprise the first memory device 10, second memory device 20, the 3rd memory device 30 and search engine 40.System 100 can also comprise one or more processor 50, for performing reservation instruction in a computer to operate other assemblies.
In certain embodiments, first memory device 10 can store predefined target pattern string (such as, " pornographic ", " pornographic net ", " porny ", " violence " and " violence TV play "), these target pattern strings have the first word (such as separately, " look " or " cruelly ") and pattern string length (such as, 2,3,4 or 5).First memory device 10 can comprise a HashSet (hash set).This HashSet is the specific implementation of Set (set) interface.It creates and uses the gathering (collection) of hash table for storage.The mechanism that hash table is called as Hash (hashing) by use stores information.In certain embodiments, system 100 can also comprise user interface 60, and it can be used for receiving and will be stored in the target pattern string in the first memory device 10.
In certain embodiments, second memory device 20 can store target pattern string (such as, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") the first word (such as, " look " or " cruelly ") and pattern string length is (such as, 2,3,4 or 5) unique combinations (such as, < " look ", (2,3,4) > and < " cruelly ", (2,5) >).The pattern string started with the first pattern word (such as " look ") (such as, " pornographic ", " pornographic net ", " pornographic acute ", " porny ") the first pattern word and pattern length (such as, 2,3,3,4) combination (such as < " look ", (2,3,4) >) be unduplicated.Second memory device 20 can comprise at least one HashMap (hash figure).HashMap refers to a kind of data structure, and it uses hash function some identifier or key word (key) to be mapped to efficiently the value (such as, their telephone number) of association.Hash function is used for key word being converted to the index by the array element (groove (slot) or bucket (bucket)) therefrom finding analog value.
In certain embodiments, search engine 40 can identify and the word that one of the first pattern word (such as " look ") matches iteratively from text beginning, and this coupling word is set to the current word in text.
In certain embodiments, search engine 40 can extract iteratively and to start with current word (such as " look ") and to have the substring of the substring length equaling one of target pattern string length (such as 2).
In certain embodiments, search engine 40 can iteratively by this substring with there is the first pattern word (such as " look ") identical with string length with the first word of this substring respectively compare with each target pattern string (such as " pornographic ") of string length (such as 2).If one of the substring extracted and target pattern string (such as " pornographic ") match, the information relevant to this substring then can be stored in the 3rd memory device 30.The information relevant to this substring can comprise the position of this substring.
This process all the way moves forward to and arrives text ending.In certain embodiments, the word string navigated to is highlighted to warn user.In certain embodiments, system 100 can comprise display 70, is used for showing the information relevant to the substring be stored in the 3rd memory device 30.
System 100 can find efficiently in the text and locate and the one or more substrings matched in predefined target pattern string, to reduce search time.
Fig. 2 is the process flow diagram of the method 200 for searching for multiple target pattern string from text illustrated according to exemplary embodiment.
In certain embodiments, at operation 202 place, multiple target pattern string (such as, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") is stored in the first memory device 10 respectively.These target pattern strings start with the first pattern word (such as " look " or " cruelly ") respectively.
At operation 204 place, the target pattern string that will start with the first pattern word (such as " look " or " cruelly ") (such as, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") the first pattern word (such as " look " or " cruelly ") and corresponding pattern length (such as 2, 3, 4 or 5) unique combinations (such as, < " look ", 2, 3, 4>) with < " cruelly ", 2, 5> is stored in the second memory device 20.
At operation 206 place, search engine 40 is utilized to identify iteratively in the text and the word that one of first pattern word (such as " look ") matches, to be set to current word.
At operation 208 place, utilize search engine 40 to extract iteratively and to start with current word (such as " look ") and there is the substring of the substring length equal with one of target pattern string length (such as 2).
At operation 210 place, utilize search engine 40 iteratively by this substring with there is the first identical with substring length with the first pattern word of this substring respectively pattern word (such as " look ") compare with each target pattern string (such as " pornographic ") of target strings length (such as 2).
If this substring mates with one of target pattern string (such as " pornographic "), then at operation 212 place, the information relevant to this substring is stored in the 3rd memory device 30.The information relevant to this substring can comprise the position of this substring.Operation 206 to 212 is repeated, until arrive the ending of text.
At operation 214 place, display 70 is utilized to show the information relevant to all substrings be stored in the 3rd memory device 30.
The operation of embodiment can divide two stages to perform: initialization and process.At initial phase, target pattern word (searched word) is placed in HashSet, and the first word of each target pattern word and length are placed in HashMap.Because the HashMap in JAVA does not allow the key word of repetition, therefore the different length of the pattern word that the first word is identical is placed into HashMap.
Processing stage, text is from the starting iterated process.Each word of text is examined.If find current word in HashMap, then the substring of possibility current location is pattern word.Obtain possible length from HashMap, and for each may length, extract the substring according to current length, to check it whether in HashSet.If obtain hit, then find target pattern word in the text, starting position is current location and length is current length.Otherwise process proceeds to next may length.If likely length is all processed in institute, process then proceeds to next word of text.
Suppose that text size is M, and there is N number of target pattern string.Be A*N at initial phase required time, wherein A comprises for being stored in HashSet by word, extracts the first word of substring and word and length be stored into the constant of the time in HashMap.Processing stage required time be B*M, wherein B comprises for searching word in HashMap, extracting substring and search the constant of time of this substring when finding hit in HashSet.The T.T. complexity of this algorithm is the function of (A*N+B*M).Fig. 3 is the block diagram illustrating a machine with the exemplary form of computer system 300, can perform the set for causing machine to perform the instruction sequence of any one method in method discussed here in this computer system 300.In an alternate embodiment, this machine can be server computer, client computer, personal computer (PC), tablet PC, Set Top Box (STB), personal digital assistant (PDA), cell phone, network tool, network router, switch or bridge, maybe can perform any machine of specifying the instruction set of the action taked by this machine.In addition, although only have individual machine to be illustrated, term " machine " also can comprise any gathering of multiple machine, and these machines individually or jointly set of instructions perform any one or multiple method in the method discussed here.
Exemplary computer system 300 comprise processor 302 (such as CPU (central processing unit) (CPU), Graphics Processing Unit (GPU) or its both), primary memory 304 and static memory 306, they communicate with one another via bus 308.Computer system 300 can also comprise video display unit 310 (such as liquid crystal display (LCD) or cathode-ray tube (CRT) (CRT)).Computer system 300 also comprises Alphanumeric Entry Device 312 (such as keyboard), cursor control device 314 (such as mouse), disk drive unit 316, signal generation equipment 328 (such as loudspeaker) and Network Interface Unit 320.
Disk drive unit 316 comprises machine readable media 322, it stores any one or the multiple one or more instruction sets (such as software 324) that embody in method described herein or function.Software 324 by computer system 300 the term of execution can also reside in completely or at least partly in primary memory 304 and/or processor 320, primary memory 304 and processor 320 also form machine readable media.
Software 324 can also be sent via Network Interface Unit 320 or be received on network 326.Although machine readable media 322 is shown as single medium in the exemplary embodiment, but term " machine readable media " should be believed to comprise and store the single medium of one or more instruction set or multiple medium (such as, the buffer memory of centralized or distributed data base and/or association and server).Term " machine readable media " also will be believed to comprise the arbitrary medium that can store, encode or carry following instruction set, described instruction set performed by machine and cause machine perform embodiments of the invention method operation in any one or multiple.Term " machine readable media " therefore will be believed to comprise (but being not limited to) solid-state memory, light and magnetic medium and carrier signal.
Therefore, be described from the method and system of text search multiple target pattern string.Although the present invention is described with reference to concrete exemplary embodiment, will be seen that, various amendment and change can be made to these embodiments when not departing from wider spirit and scope of the present invention.Therefore, instructions and accompanying drawing will be regarded as illustrative rather than restrictive.

Claims (9)

1. a series of to use a method for the predefined pattern string of the written of large character set for searching for from text, the method comprises:
The pattern string started with the first pattern word is respectively stored in the first memory device;
Be stored in the second memory device by the first pattern word of pattern string started with the first pattern word respectively and the combination of corresponding pattern string length, wherein, the first pattern word of the pattern string started with the first pattern word and the combination of pattern string length are unduplicated;
Search engine is utilized to identify iteratively and the word that one of described first pattern word matches in the text, to be set to current word;
Utilize described search engine to extract iteratively and to start with described current word and there is the substring of the substring length equal with one of pattern string length;
Utilize described search engine iteratively by described substring with there is first pattern word identical with this substring compare with each pattern string of string length; And
If one of this substring and pattern string match, then the information relevant to this substring is stored in the 3rd memory device.
2. the method for claim 1, also comprises and receives described pattern string from user interface.
3. the method for claim 1, also comprises and utilizes display to show the information relevant to the substring be stored in the 3rd memory device.
4. the method for claim 1, wherein described first memory device comprises at least one HashSet, and wherein said second memory device comprises at least one HashMap.
5. a series of to use a system for the predefined pattern string of the written of large character set for searching for from text, this system comprises:
For the pattern string started with the first pattern word being respectively stored in the device in the first memory device;
For the first pattern word of pattern string started with the first pattern word respectively and the combination of corresponding pattern string length being stored in the device in the second memory device, wherein, the first pattern word of pattern string started with the first pattern word and the combination of pattern string length are unduplicated;
Identify that for utilizing search engine the word matched with one of described first pattern word is to be set to the device of current word in the text iteratively;
Extract iteratively for utilizing described search engine and to start with described current word and there is the device of the substring of the substring length equal with one of pattern string length;
For utilizing described search engine iteratively by described substring and the device there is first pattern word identical with this substring comparing with each pattern string of string length; And
For the information relevant with this substring being stored in when this substring matches to one of pattern string the device in the 3rd memory device.
6. system as claimed in claim 5, also comprises the device for receiving described pattern string from user interface.
7. system as claimed in claim 5, also comprises the device for utilizing display to show the information relevant to the substring be stored in the 3rd memory device.
8. system as claimed in claim 5, wherein relevant to the substring be stored in the 3rd memory device information comprises described substring position in the text.
9. system as claimed in claim 5, if wherein one of described substring and pattern string match, described search engine then highlights this substring.
CN201010116709.XA 2010-02-26 2010-02-26 Method and system for searching a plurality of strings Expired - Fee Related CN102169485B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201010116709.XA CN102169485B (en) 2010-02-26 2010-02-26 Method and system for searching a plurality of strings
CN201410757944.3A CN104484381B (en) 2010-02-26 2010-02-26 For searching for the method and system of multiple strings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010116709.XA CN102169485B (en) 2010-02-26 2010-02-26 Method and system for searching a plurality of strings

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201410757944.3A Division CN104484381B (en) 2010-02-26 2010-02-26 For searching for the method and system of multiple strings

Publications (2)

Publication Number Publication Date
CN102169485A CN102169485A (en) 2011-08-31
CN102169485B true CN102169485B (en) 2015-01-07

Family

ID=44490648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010116709.XA Expired - Fee Related CN102169485B (en) 2010-02-26 2010-02-26 Method and system for searching a plurality of strings

Country Status (1)

Country Link
CN (1) CN102169485B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1496522A (en) * 2000-03-29 2004-05-12 �ʼҷ����ֵ������޹�˾ Data serch user interface with ergonomic mechanism for user profile definition and manipulation
CN1794236A (en) * 2004-12-21 2006-06-28 英特尔公司 Efficient CAM-based techniques to perform string searches in packet payloads

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080267504A1 (en) * 2007-04-24 2008-10-30 Nokia Corporation Method, device and computer program product for integrating code-based and optical character recognition technologies into a mobile visual search

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1496522A (en) * 2000-03-29 2004-05-12 �ʼҷ����ֵ������޹�˾ Data serch user interface with ergonomic mechanism for user profile definition and manipulation
CN1794236A (en) * 2004-12-21 2006-06-28 英特尔公司 Efficient CAM-based techniques to perform string searches in packet payloads

Also Published As

Publication number Publication date
CN102169485A (en) 2011-08-31

Similar Documents

Publication Publication Date Title
CN111241282B (en) Text theme generation method and device and electronic equipment
US20150178273A1 (en) Unsupervised Relation Detection Model Training
JP6122499B2 (en) Feature-based candidate selection
US9959340B2 (en) Semantic lexicon-based input method editor
US20110106805A1 (en) Method and system for searching multilingual documents
KR102475235B1 (en) Method for resource sorting, method for training sorting model and corresponding apparatuses
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
JP7397903B2 (en) Intelligent interaction methods, devices, electronic devices and storage media
US11182681B2 (en) Generating natural language answers automatically
US20150227497A1 (en) Method and apparatus for identifying garbage template article
CN102867049A (en) Chinese PINYIN quick word segmentation method based on word search tree
US8914377B2 (en) Methods for prefix indexing
CN105138649A (en) Data search method and device and terminal
CN109885641A (en) A kind of method and system of database Chinese Full Text Retrieval
CN111309200B (en) Method, device, equipment and storage medium for determining extended reading content
JP2012221489A (en) Method and apparatus for efficiently processing query
CN102169485B (en) Method and system for searching a plurality of strings
CN105426490A (en) Tree structure based indexing method
CN105426389A (en) Fuzzy retrieval locating method based on UI directory tree view
CN108292307A (en) With the quick operating prefix Burrow-Wheeler transformation to compressed data
CN113486148A (en) PDF file conversion method and device, electronic equipment and computer readable medium
CN104484381A (en) Method and system for searching multiple strings
CN102567424A (en) Poetry association library system and realization method thereof as well as electronic learning equipment
CN106933999B (en) Apache lucene score highlighting method for independent search
CN112784596A (en) Method and device for identifying sensitive words

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150107

Termination date: 20210226