CN102169485A - Method and system for searching a plurality of strings - Google Patents

Method and system for searching a plurality of strings Download PDF

Info

Publication number
CN102169485A
CN102169485A CN201010116709XA CN201010116709A CN102169485A CN 102169485 A CN102169485 A CN 102169485A CN 201010116709X A CN201010116709X A CN 201010116709XA CN 201010116709 A CN201010116709 A CN 201010116709A CN 102169485 A CN102169485 A CN 102169485A
Authority
CN
China
Prior art keywords
pattern
word
substring
string
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201010116709XA
Other languages
Chinese (zh)
Other versions
CN102169485B (en
Inventor
张�林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
eBay Inc
Original Assignee
eBay Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by eBay Inc filed Critical eBay Inc
Priority to CN201410757944.3A priority Critical patent/CN104484381B/en
Priority to CN201010116709.XA priority patent/CN102169485B/en
Publication of CN102169485A publication Critical patent/CN102169485A/en
Application granted granted Critical
Publication of CN102169485B publication Critical patent/CN102169485B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a method and system for searching a plurality of strings. The system comprises a first storage device for storing pattern strings respectively starting from a first pattern word; and a second storage device for storing the combination of the first pattern word of the pattern string starting from the first pattern word and a corresponding pattern length; the system further comprises a search engine for iteratively identifying a word matched with one of the first pattern words in a text as a current word, wherein the search engine iteratively extracts a sub-string which starts from the current word and is provided with a sub-string with the length equal to one of the pattern lengths, and iteratively compares the sub-string with each pattern string which is provided with the first pattern word and the string length equal to that the first pattern word and the length of the sub-string; and the system further comprises a third storage device for storing any sub-string matched with one of the pattern strings.

Description

Be used to search for the method and system of a plurality of strings
Technical field
The technical field of relate generally to information processing of the present invention more specifically, relates to the method and system that is used for information search.
Background technology
Along with computing machine and network development of technologies, increasing people's combine digital is searched for the certain content of discerning or finding the demand that satisfies them in the digital document.For example, people (for example father and mother) or authoritative institution can attempt finding some limiting content that is not suitable for child (for example go here and there, expression or speech) in the obtainable digital document of children, make children away from these contents then.But in a lot of situations, because for example the size and/or the quantity of digital document are huge, people or authoritative institution discern or find these limiting contents is tasks consuming time.Therefore, need improved searching method, be used for carrying out and search for efficiently and reduce search time.
Summary of the invention
The application's purpose is will be provided for carrying out search to a plurality of strings efficiently to reduce the system and method for search time based on the combination of first word of a plurality of strings and length.
According to the application's first aspect, provide a kind of and be used for by using one or more processor executive residents to come the system of a plurality of strings of search from text in the instruction of computing machine.This system comprises: first memory device is used to store the pattern string that begins with first pattern word respectively; And second memory device, be used to store first pattern word of the pattern string that begins with first pattern word and the combination of corresponding pattern length.This system also comprises: search engine is used for discerning the word that is complementary with one of described first pattern word iteratively to be set to current word at text; Extraction apparatus is used for extracting iteratively the substring that begins and have the substring length that equates with one of pattern length with described current word; And comparer, be used for iteratively with described substring with have first pattern word identical with length and each pattern string of string length and compare with first pattern word of this substring.This system can also comprise the 3rd memory device, is used for the storage information relevant with this substring under this substring and situation that one of pattern string is complementary.
According to the application on the other hand, provide a kind of method that is used for by using one or more processor executive residents to come a plurality of strings of search from text in the instruction of computing machine.This method comprises: the pattern string that will begin with first pattern word respectively is stored in first memory device; And first pattern word of the pattern string that will be respectively begins with first pattern word and the combination of corresponding pattern length are stored in second memory device.This method also comprises: utilize search engine to discern the word that is complementary with one of described first pattern word iteratively in text, to be set to current word; Utilize search engine to extract the substring that begins and have the substring length that equates with one of pattern length with described current word iteratively; Utilize search engine iteratively with this substring with have first pattern word identical with length and each pattern string of string length and compare with first pattern word of this substring; And if one of this substring and pattern string be complementary, then information stores that will be relevant with this substring is in the 3rd memory device.
The application's system and method can find and locate a series of predefined pattern strings of writing with the language (for example Chinese) that uses large character set (charset) (for example Chinese word) efficiently in text.The technology that the application adopted considers to use the Characteristics of Language of large character set, and can obtain the linear working time and the search time of minimizing.This technology can for example be used for forbidding comprising at Bulletin Board Systems (BBS) thread the text of one or more predefined pattern words.
Description of drawings
Unrestricted mode illustrates embodiments of the invention in the mode of example in the accompanying drawings, the similar label indication similar components in the accompanying drawing, in the accompanying drawings:
Fig. 1 illustrates being used for from the block diagram of the system of the text a plurality of target pattern strings of search (target pattern string) according to exemplary embodiment;
Fig. 2 illustrates being used for from the process flow diagram of the method for a plurality of target pattern strings of text search according to exemplary embodiment; And
Fig. 3 is that the exemplary form with computer system illustrates the block diagram according to the machine of exemplary embodiment.
Embodiment
Be used for to be described from the system and method for a plurality of target pattern strings of text search.In the following description, a plurality of details are proposed for explanation, so that to the invention provides complete understanding.But those skilled in the art should understand that the present invention need not these details and also can be implemented.
In a lot of situations, people can attempt searching for digital document and find and locate some certain content.For example, some father and mother or authoritative institution can attempt searching for the digital document open to children, to determine whether harmful content of writing with the Asian language (for example, Chinese) that uses large character set is arranged in digital document.These harmful contents can be a plurality of intended target pattern strings (for example Chinese word or word) that are not suitable for child, for example " pornographic ", " pornographic net ", " porny ", " violence " and " violence TV play " or the like.Each target pattern string (for example " pornographic net ") has one first word (for example " look ") and a pattern string length (for example 3).
Fig. 1 illustrates being used for from the block diagram of the system 100 of a plurality of target pattern strings of text search according to exemplary embodiment.In certain embodiments, system 100 can comprise first memory device 10, second memory device 20, the 3rd memory device 30 and search engine 40.System 100 can also comprise one or more processors 50, is used for carrying out the instruction that is retained in computing machine and operates other assemblies.
In certain embodiments, first memory device 10 (for example can be stored predefined target pattern string, " pornographic ", " pornographic net ", " porny ", " violence " and " violence TV play "), these target pattern strings (for example have first word separately, " look " or " cruelly ") and pattern string length (for example, 2,3,4 or 5).First memory device 10 can comprise a HashSet (hash set).This HashSet is the specific implementation of Set (set) interface.It creates the gathering (collection) of use hash table to be used for storage.Hash table comes canned data by the mechanism that use is called as Hash (hashing).In certain embodiments, system 100 can also comprise user interface 60, and it can be used for receiving the target pattern string that will be stored in first memory device 10.
In certain embodiments, second memory device 20 (for example can be stored target pattern string, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") first word (for example, " look " or " cruelly ") and unique combination of pattern string length (for example, 2,3,4 or 5) (for example,<" look ", (2,3,4)〉and<" cruelly ", (2,5) 〉).With the pattern string of first pattern word (for example " look ") beginning (for example, " pornographic ", " pornographic net ", " pornographic acute ", " porny ") first pattern word and (for example, 2,3,3,4) the combination (for example<" look ", (2 of pattern length, 3,4) be unduplicated 〉).Second memory device 20 can comprise at least one HashMap (hash figure).HashMap refers to a kind of data structure, and it uses hash function that some identifier or key word (key) are mapped to related value (for example, their telephone number) efficiently.Hash function is used for key word is converted to the index of the array element that will therefrom seek analog value (groove (slot) or bucket (bucket)).
In certain embodiments, the word that search engine 40 can be complementary from text beginning identification iteratively and one of first pattern word (for example " look "), and should mate word and be set to current word in the text.
In certain embodiments, search engine 40 can extract the substring that begins and have the substring length that equals one of target pattern string length (for example 2) with current word (for example " look ") iteratively.
In certain embodiments, search engine 40 can be iteratively compared this substring with each the target pattern string (for example " pornographic ") with identical with string length with first word of this substring respectively first pattern word (for example " look ") and string length (for example 2).If one of the substring that extracts and target pattern string (for example " pornographic ") are complementary, the information relevant with this substring then can be stored in the 3rd memory device 30.The information relevant with this substring can comprise the position of this substring.
This process all the way moves forward to and arrives the text ending.In certain embodiments, the word string that navigates to by highlight with the caution user.In certain embodiments, system 100 can comprise display 70, be used for showing be stored in the 3rd memory device 30 in the relevant information of substring.
System 100 can in text, find efficiently and location and predefined target pattern string in one or more substrings that are complementary, to reduce search time.
Fig. 2 illustrates being used for from the process flow diagram of the method 200 of a plurality of target pattern strings of text search according to exemplary embodiment.
In certain embodiments, at operation 202 places, a plurality of target pattern strings (for example, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") are stored in respectively in first memory device 10.These target pattern strings begin with first pattern word (for example " look " or " cruelly ") respectively.
At operation 204 places, will be (for example with the target pattern string of first pattern word (for example " look " or " cruelly ") beginning, " pornographic ", " pornographic net ", " pornographic acute ", " porny ", " violence " and " violence TV play ") first pattern word (for example " look " or " cruelly ") and unique combination of corresponding pattern length (for example 2,3,4 or 5) (for example,<" look ", 2,3,4 〉) and<" cruelly ", 2,5〉be stored in second memory device 20.
At operation 206 places, utilize search engine 40 in text, to discern the word that is complementary with one of first pattern word (for example " look ") iteratively, to be set to current word.
At operation 208 places, utilize search engine 40 to extract the substring that begins and have the substring length that equates with one of target pattern string length (for example 2) with current word (for example " look ") iteratively.
At operation 210 places, utilize search engine 40 iteratively this substring to be compared with each the target pattern string (for example " pornographic ") with identical with substring length with first pattern word of this substring respectively first pattern word (for example " look ") and target strings length (for example 2).
If one of this substring and target pattern string (for example " pornographic ") coupling, then at operation 212 places, information stores that will be relevant with this substring is in the 3rd memory device 30.The information relevant with this substring can comprise the position of this substring.Operation 206 to 212 is repeated, up to the ending that arrives text.
The operation 214 places, utilize display 70 show be stored in the 3rd memory device 30 in the relevant information of all substrings.
The operation of embodiment can divide two stages to carry out: initialization and processing.At initial phase, target pattern speech (searched speech) is placed among the HashSet, and first word and the length of each target pattern speech are placed among the HashMap.Because the HashMap among the JAVA does not allow the key word that repeats, therefore the different length of the pattern speech that first word is identical is placed into HashMap.
The processing stage, text is from the starting by iterative processing.Each word of text all is examined.If find current word in HashMap, then the substring of possibility current location is the pattern speech.Obtain possible length from HashMap, and may length, extract substring, to check that it is whether in HashSet according to current length at each.If obtain to hit, then in text, find target pattern speech, the starting position is that current location and length are current lengths.Otherwise process advances to next may length.If the possible length of institute is all processed, process then advances to next word of text.
Suppose that text size is M, and have N target pattern string.At the initial phase required time is A*N, and wherein A comprises being used for speech is stored into HashSet, extracts first word of substring and word and length stored into the constant of the time among the HashMap.The processing stage required time be B*M, wherein B comprises the constant that extracts substring and search the time of this substring in HashSet under the situation that is used for searching word at HashMap, finding to hit.The T.T. complexity of this algorithm is the function of (A*N+B*M).Fig. 3 is the block diagram that illustrates a machine with the exemplary form of computer system 300, can carry out the set of the instruction sequence of any one method that is used for causing machine to carry out method discussed here in this computer system 300.In alternative embodiment, this machine can be server computer, client computer, personal computer (PC), tablet PC, set-top box (STB), PDA(Personal Digital Assistant), cell phone, network tool, network router, switch or bridge, maybe can carry out any machine of the instruction set of the action that appointment will be taked by this machine.In addition, though have only individual machine to be illustrated, term " machine " also can comprise any gathering of a plurality of machines, and these machines individually or jointly set of instructions are carried out any one or several different methods in the method for discussing here.
Exemplary computer system 300 comprises processor 302 (for example CPU (central processing unit) (CPU), Graphics Processing Unit (GPU) or its both), primary memory 304 and static memory 306, and they communicate with one another via bus 308.Computer system 300 can also comprise video display unit 310 (for example LCD (LCD) or cathode ray tube (CRT)).Computer system 300 comprises that also Alphanumeric Entry Device 312 (for example keyboard), cursor control device 314 (for example mouse), disk drive unit 316, signal generate equipment 328 (for example loudspeaker) and Network Interface Unit 320.
Disk drive unit 316 comprises machine readable media 322, has stored any one or the multiple one or more instruction sets (for example software 324) that embody in method described herein or the function on it.Software 324 can also reside in primary memory 304 and/or the processor 320 by computer system 300 term of execution fully or to small part, and primary memory 304 and processor 320 also constitute machine readable media.
Software 324 can also be sent on network 326 or be received via Network Interface Unit 320.Though machine readable media 322 is shown as single medium in the exemplary embodiment, but term " machine readable media " should be believed to comprise single medium or a plurality of medium (for example, centralized or distributed data base and/or related buffer memory and server) of the one or more instruction sets of storage.Term " machine readable media " also will be believed to comprise the arbitrary medium that can store, encode or carry following instruction set, described instruction set carried out by machine and cause machine carry out the method for embodiments of the invention in the operation any one or multiple.Term " machine readable media " therefore will be believed to comprise (but being not limited to) solid-state memory, light and magnetic medium and carrier signal.
Therefore, the method and system from a plurality of target pattern of text search string is described.Though the present invention is described with reference to concrete exemplary embodiment, will be seen that, can make various modifications and change to these embodiment under the situation that does not break away from wideer spirit and scope of the present invention.Therefore, instructions and accompanying drawing will be regarded as illustrative rather than restrictive.

Claims (10)

1. method that is used for by using one or more processor executive residents to come a plurality of strings of search from text in the instruction of computing machine, this method comprises:
The pattern string that begins with first pattern word respectively is stored in first memory device;
First pattern word of the pattern string that begins with first pattern word respectively and the combination of corresponding pattern length are stored in second memory device;
Utilize search engine in text, to discern the word that is complementary with one of described first pattern word iteratively, to be set to current word;
Utilize described search engine to extract the substring that begins and have the substring length that equates with one of pattern length with described current word iteratively;
Utilize described search engine iteratively with described substring with have first pattern word identical with length and each pattern string of string length and compare with first pattern word of this substring; And
If one of this substring and pattern string are complementary, then information stores that will be relevant with this substring is in the 3rd memory device.
2. the method for claim 1 also comprises from user interface receiving described pattern string.
3. the method for claim 1 also comprises the relevant information of substring in utilizing the display demonstration and being stored in the 3rd memory device.
4. first pattern word of the pattern string that the method for claim 1, wherein begins with first pattern word and the combination of pattern length are unduplicated.
5. the method for claim 1, wherein described first memory device comprises at least one HashSet, and wherein said second memory device comprises at least one HashMap.
6. one kind is used for by using one or more processor executive residents to come the system of a plurality of strings of search from text in the instruction of computing machine, and this system comprises:
First memory device is used to store the pattern string that begins with first pattern word respectively;
Second memory device is used to store first pattern word of the pattern string that begins with first pattern word and the combination of corresponding pattern length;
Search engine, be used for discerning the word that is complementary with one of described first pattern word iteratively to be set to current word at text, wherein said search engine extracts the substring that begins and have the substring length that equates with one of pattern length with described current word iteratively, and wherein said search engine iteratively with described substring with have first pattern word identical with length and each pattern string of string length and compare with first pattern word of this substring; And
The 3rd memory device is used for the storage information relevant with this substring under this substring and situation that one of pattern string is complementary.
7. system as claimed in claim 6 also comprises being used for the interface, is used to receive described pattern string.
8. system as claimed in claim 6 also comprises display, is used for showing the information relevant with the substring that is stored in the 3rd memory device.
9. system as claimed in claim 6, wherein be stored in the 3rd memory device in the relevant information of substring comprise the position of described substring in text.
10. system as claimed in claim 6, if one of wherein described substring and pattern string are complementary, described search engine is this substring of highlight then.
CN201010116709.XA 2010-02-26 2010-02-26 Method and system for searching a plurality of strings Expired - Fee Related CN102169485B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201410757944.3A CN104484381B (en) 2010-02-26 2010-02-26 For searching for the method and system of multiple strings
CN201010116709.XA CN102169485B (en) 2010-02-26 2010-02-26 Method and system for searching a plurality of strings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010116709.XA CN102169485B (en) 2010-02-26 2010-02-26 Method and system for searching a plurality of strings

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201410757944.3A Division CN104484381B (en) 2010-02-26 2010-02-26 For searching for the method and system of multiple strings

Publications (2)

Publication Number Publication Date
CN102169485A true CN102169485A (en) 2011-08-31
CN102169485B CN102169485B (en) 2015-01-07

Family

ID=44490648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010116709.XA Expired - Fee Related CN102169485B (en) 2010-02-26 2010-02-26 Method and system for searching a plurality of strings

Country Status (1)

Country Link
CN (1) CN102169485B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1496522A (en) * 2000-03-29 2004-05-12 �ʼҷ����ֵ������޹�˾ Data serch user interface with ergonomic mechanism for user profile definition and manipulation
CN1794236A (en) * 2004-12-21 2006-06-28 英特尔公司 Efficient CAM-based techniques to perform string searches in packet payloads
US20080267504A1 (en) * 2007-04-24 2008-10-30 Nokia Corporation Method, device and computer program product for integrating code-based and optical character recognition technologies into a mobile visual search

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1496522A (en) * 2000-03-29 2004-05-12 �ʼҷ����ֵ������޹�˾ Data serch user interface with ergonomic mechanism for user profile definition and manipulation
CN1794236A (en) * 2004-12-21 2006-06-28 英特尔公司 Efficient CAM-based techniques to perform string searches in packet payloads
US20080267504A1 (en) * 2007-04-24 2008-10-30 Nokia Corporation Method, device and computer program product for integrating code-based and optical character recognition technologies into a mobile visual search

Also Published As

Publication number Publication date
CN102169485B (en) 2015-01-07

Similar Documents

Publication Publication Date Title
US8577882B2 (en) Method and system for searching multilingual documents
CN108959257B (en) Natural language parsing method, device, server and storage medium
US20150178273A1 (en) Unsupervised Relation Detection Model Training
US9959340B2 (en) Semantic lexicon-based input method editor
KR102475235B1 (en) Method for resource sorting, method for training sorting model and corresponding apparatuses
CN107085583B (en) Electronic document management method and device based on content
CN111984825A (en) Method and apparatus for searching video
JP7140913B2 (en) Video distribution statute of limitations determination method and device
CN110807038A (en) CMDB information full-text retrieval method based on elastic search
CN105138649B (en) Searching method, device and the terminal of data
CN110704608A (en) Text theme generation method and device and computer equipment
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN105426389A (en) Fuzzy retrieval locating method based on UI directory tree view
CN105426490A (en) Tree structure based indexing method
CN108292307A (en) With the quick operating prefix Burrow-Wheeler transformation to compressed data
CN102169485B (en) Method and system for searching a plurality of strings
CN113486148A (en) PDF file conversion method and device, electronic equipment and computer readable medium
CN102567424A (en) Poetry association library system and realization method thereof as well as electronic learning equipment
CN106933999B (en) Apache lucene score highlighting method for independent search
CN104077418A (en) Mobile terminal application program searching method and system
CN104484381A (en) Method and system for searching multiple strings
US9009200B1 (en) Method of searching text based on two computer hardware processing properties: indirect memory addressing and ASCII encoding
CN107038230A (en) A kind of short message searching method and system based on Suffix array clustering
KR102649429B1 (en) Method and system for extracting information from semi-structured documents
US8688668B1 (en) Method and apparatus for improved navigation among search results

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150107

Termination date: 20210226