WO2005069199A1 - Methods and systems for text segmentation - Google Patents
Methods and systems for text segmentation Download PDFInfo
- Publication number
- WO2005069199A1 WO2005069199A1 PCT/US2003/041609 US0341609W WO2005069199A1 WO 2005069199 A1 WO2005069199 A1 WO 2005069199A1 US 0341609 W US0341609 W US 0341609W WO 2005069199 A1 WO2005069199 A1 WO 2005069199A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- characters
- token
- tokens
- long
- group
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Definitions
- the present invention relates generally to text segmentation and, more particularly, to segmenting strings of characters.
- a domain name is a domain name.
- the domain can locate an organization or other entity on the Internet.
- the domain can locate an organization or other entity on the Internet.
- the domain can locate an organization or other entity on the Internet. For example, the domain
- Embodiments of the present invention comprise systems and methods for text segmentation. Embodiments identify tokens in strings of text. One aspect of an
- embodiment of the present invention comprises accessing a string of characters
- present invention are directed to computer systems and to computer-readable media
- FIG. 1 illustrates a block diagram of a system in accordance with one
- FIG. 2 illustrates a flow diagram of a method in accordance with one
- FIG. 3 illustrates a subroutine of the method shown in FIG. 2.
- the present invention comprises methods and systems for text
- segmentation including methods and systems for identifying tokens in a string of
- FIG. 1 is a diagram illustrating an exemplary system in which exemplary
- inventions of the present invention may operate.
- the present invention may operate,
- the system 100 shown in FIG. 1 includes multiple client devices 102a-n,
- the network 106 shown includes the Internet. In other embodiments, other networks, such as an intranet may be used. Moreover,
- RAM random access memory
- the processor 110 executes a set of computer-executable program
- processors may include a microprocessor, an
- processors include, or may be in communication with,
- ATLLIB01 1595704 7 media for example computer-readable media, which stores instructions that, when
- Embodiments of computer-readable media include, but are not limited to,
- processor such as the processor in communication with a touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch-sensitive touch
- ROM read only memory
- RAM random access memory
- ASIC application specific integrated circuit
- a computer including a router, private or public network, or other
- the instructions may
- Client devices 102a-n may also include a number of external or internal
- a mouse such as a mouse, a CD-ROM, a keyboard, a display, or other input or output
- client devices 102a-n are personal computers, digital assistants,
- a client device 102a-n may be any type of processor-based platform connected
- the client is to a network 106 and that interacts with one or more application programs.
- the client is to a network 106 and that interacts with one or more application programs.
- devices 102a-n shown include personal computers executing a browser application
- a server device 104 is also coupled to the network
- the server device 104 shown includes a server executing a segmentation
- the server device 104 shown includes a processor 116 coupled to a computer
- Server device 104 depicted as a single computer system, may
- Examples of a server device 104 be implemented as a network of computer processors. Examples of a server device 104
- servers mainframe computers, networked computers, a processor-based device and
- processors 116 can be any of a number of well known computer processors, such as processors
- the server device 104 may also be connected to a database 126.
- the server device 104 can access the network 106 to
- Characters can include, for example, marks or symbols used in a writing system
- the segmentation engine 120 segments a suing of
- a token can comprise a word, a
- the segmentation engine 120 includes a long token processor 122 and a token processor 124. In the embodiment shown, each comprises computer
- the long token processor 122 matches known long tokens in a string of characters and pins down contiguous characters contained in the
- the token processor 124 utilizes the pinned down characters to determine a
- the token processor 124 determines a probability for each combination in
- token processor 124 are further described below.
- Server device 104 also provides access to other storage elements, such as
- a token storage element in the example shown a token database 120.
- Data storage elements may include any one or
- the present invention may comprise systems
- the long token processor 122 may not be
- segmentation engine 120 may not be located on the same server device.
- the system 100 shown in FIG. 1 is merely exemplary, and is used to explain the
- One exemplary method according to the present invention comprises accessing a
- the exemplary method can also further comprise determining a likelihood or a
- the long token is greater than seven characters.
- determining token comprises accessing adjacent
- Determining tokens can also be pinned down contiguous characters for a long token. Determining tokens can also be pinned down contiguous characters for a long token. Determining tokens can also be pinned down contiguous characters for a long token. Determining tokens can also be pinned down contiguous characters for a long token. Determining tokens can also be pinned down contiguous characters for a long token. Determining tokens can also
- ATLLIB01 1595704.7 revised group matches with a second token then storing the second token if it is determined that the second token contains none of the pinned down characters or if it is
- the second group of adjacent characters can
- tokens are repeated until a plurality of combinations of tokens have been determined.
- FIG. 2 illustrates an exemplary method according to an embodiment of the present invention. This exemplary method is provided by way of example, as there
- method 200 shown in FIG. 2 can be executed or otherwise performed by any of various systems. The method 200 is described below as carried out by the system 100 shown in
- FIG. 1 by way of example, and various elements of the system 100 are referenced in
- the method 200 shown provides a method for identifying tokens from a string of characters.
- Each block shown in FIG. 2 represents one or more steps carried
- Block 202 is followed by block 204, in which a string of characters is obtained by the segmentation engine 120.
- the string of characters can be
- Block 204 is followed by block 206, in which the long token processor
- a long token can be, for example, any token equal to
- the long tokens can be received from token database 126, from a device connected to network 106, or from another device.
- the long token processor 122 The long token processor 122
- the long token processor 122 may start with the first character of the
- processor 122 can move down the string starting with the second character and the next
- the long token processor 122 can start with the character "t" and the adjacent seven
- the long token processor 122 can determine that the characters
- transfor can potentially be matched to some long tokens, such as “transform” and “transformation”.
- the long token processor 122 continues reading characters and
- the long token processor 122 receives the character "m"
- the processor would match the characters with the long token processor 122
- processor 122 continues receiving characters attempting to determine if the characters
- ATL IB01 1595704.7 characters should be matched with the token "transformation" and proceed processing.
- the long token processor 122 would be unable to match the next two sets of eight
- the long token processor 122 can then determine that the next group of characters "probabil" can be matched with a long token and would continue receiving characters.
- the long token processor 122 can then match the long token "probability" with the remaining characters in the
- Block 206 is followed by block 208, in which the long token processor
- the long token processor 122 signifies that the characters should be used together in a
- the long token processor 122 pins down contiguous characters in
- the long token processor can pin
- the long token processor 122 can leave two or three characters unpinned
- Block 208 is followed by block 210, in which the potential combinations
- the token processor 124 matches the string of characters to multiple tokens while considering any pinned down characters.
- the subroutine 210 continues until all
- the token processor 124 can "cut" one or more adjacent characters that it cannot match to a token in a particular
- the cut characters are removed from consideration for matching with a token for a particular combination of
- tokens may comprise: "for the heart”, “fort heart”, “fort he art”, and “for x heart”.
- FIG. 3 illustrates a subroutine 210 for carrying out the method 200 shown in FIG. 2.
- the subroutine 210 determines potential combinations of tokens for
- the subroutine begins at block 300.
- block 300 a group of adjacent pixels
- At least two characters are received. Alternatively, one character can be received. In one
- the token processor 124 begins at one end of the string of characters and
- Block 300 is followed by block 303, in which the token processor 122 determines if the received characters match to any token paths.
- the tokens are stored in a tree-like structure where, for example, each character is at the top
- the token processor 122 can determine that at least one token
- the token processor 124 attempts to match these two characters
- the token processor returns to block 300 and receives a new set of adjacent token
- the token processor returns to block 300 and receives a new group of adjacent tokens
- Subroutine 210 can be, for example, the group starting with "2d".
- token processor 124 determines if the group of characters match to a token. For
- the token processor determines whether the characters "tra" have been received. If the characters "tra" have been received, the token processor determines whether the characters "tra" have been received.
- the token processor 124 then can determine if the revised group matches any token paths in block 302. For the example,
- the token processor 124 determines that all of the adjacent pinned down
- the token processor 124 determines that not all of the adjacent pinned down characters
- the number of combinations of tokens for the character string can be greatly reduced.
- the token processor 124 stores the token in block 312. The token processor then receives the next
- processor 124 receives the next adjacent character in block 306 and continues the
- next adjacent character is received. For example, if the characters "transform" are received, the token processor can note the potential token break. However, when the characters "transform" are received, the token processor can note the potential token break. However, when the characters "transform" are received, the token processor can note the potential token break. However, when the characters "transform" are received, the token processor can note the potential token break. However, when the characters "transform" are received, the token processor can note the potential token break. However, when the characters "transform" are received, the token processor can note the potential token break. However, when the characters "transform" are received, the token processor can note the potential token break. However, when the characters "transform" are received, the token processor can note the potential token break. However, when the characters "transform" are received, the token processor can note the potential token break. However, when the characters "transform" are received, the token processor can note the potential token break. However, when the characters "transform" are received, the token processor can note the potential token break. However, when the
- the token processor 124 receives a match with tokens.
- the new group of adjacent characters In one embodiment, for example, the new group of
- block 210 is followed by block 212, in which the likelihood or probability for each combination in the list is determined by the token
- processor 124 In one embodiment, the likelihood of each combination is based on the
- the frequency of tokens in the combination can be
- Cut characters can be given a low likelihood.
- a plurality of the top combinations based on likelihood can be
- the present invention can be used in a variety of applications where the
- segmentation of text such as a domain name
- the segmentation engine can be used
- ATLLIB01 1595704.7 to segment the domain name entered, so that a website or advertisement relevant to the
- the present invention can also be used when a user desires to purchase a domain name, but the domain name is unavailable.
- segmentation engine can segment the entered domain name and this information can be
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Character Input (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP11163970.4A EP2345970A3 (en) | 2003-12-30 | 2003-12-30 | Method and system for text segmentation |
AU2003300437A AU2003300437A1 (en) | 2003-12-30 | 2003-12-30 | Methods and systems for text segmentation |
EP03819295.1A EP1700250B1 (en) | 2003-12-30 | 2003-12-30 | Method and system for text segmentation |
PCT/US2003/041609 WO2005069199A1 (en) | 2003-12-30 | 2003-12-30 | Methods and systems for text segmentation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2003/041609 WO2005069199A1 (en) | 2003-12-30 | 2003-12-30 | Methods and systems for text segmentation |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2005069199A1 true WO2005069199A1 (en) | 2005-07-28 |
Family
ID=34793608
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2003/041609 WO2005069199A1 (en) | 2003-12-30 | 2003-12-30 | Methods and systems for text segmentation |
Country Status (3)
Country | Link |
---|---|
EP (2) | EP2345970A3 (en) |
AU (1) | AU2003300437A1 (en) |
WO (1) | WO2005069199A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11334717B2 (en) | 2013-01-15 | 2022-05-17 | Google Llc | Touch keyboard using a trained model |
US11379663B2 (en) | 2012-10-16 | 2022-07-05 | Google Llc | Multi-gesture text input prediction |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6269189B1 (en) | 1998-12-29 | 2001-07-31 | Xerox Corporation | Finding selected character strings in text and providing information relating to the selected character strings |
-
2003
- 2003-12-30 EP EP11163970.4A patent/EP2345970A3/en not_active Withdrawn
- 2003-12-30 EP EP03819295.1A patent/EP1700250B1/en not_active Expired - Lifetime
- 2003-12-30 AU AU2003300437A patent/AU2003300437A1/en not_active Abandoned
- 2003-12-30 WO PCT/US2003/041609 patent/WO2005069199A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6269189B1 (en) | 1998-12-29 | 2001-07-31 | Xerox Corporation | Finding selected character strings in text and providing information relating to the selected character strings |
Non-Patent Citations (2)
Title |
---|
See also references of EP1700250A4 * |
THANARUK THEERAMUNKONG; SASIPORN USANAVASIN: "Non dictionary based Thai word segmentation using decision trees", HUMAN LANGUAGE TECHNOLOGY CONFERENCE, PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON HUMAN LANGUAGE TECHNOLOGY RESEARCH, 18 March 2001 (2001-03-18) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11379663B2 (en) | 2012-10-16 | 2022-07-05 | Google Llc | Multi-gesture text input prediction |
US11334717B2 (en) | 2013-01-15 | 2022-05-17 | Google Llc | Touch keyboard using a trained model |
US11727212B2 (en) | 2013-01-15 | 2023-08-15 | Google Llc | Touch keyboard using a trained model |
Also Published As
Publication number | Publication date |
---|---|
EP2345970A2 (en) | 2011-07-20 |
EP1700250A1 (en) | 2006-09-13 |
EP2345970A3 (en) | 2013-05-29 |
AU2003300437A1 (en) | 2005-08-03 |
EP1700250B1 (en) | 2015-02-18 |
EP1700250A4 (en) | 2009-01-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8078633B2 (en) | Methods and systems for improving text segmentation | |
US8489387B2 (en) | Methods and systems for selecting a language for text segmentation | |
US20120278339A1 (en) | Query parsing for map search | |
US9183287B2 (en) | Social media analysis system | |
US20110282903A1 (en) | Dictionary Word and Phrase Determination | |
US9652529B1 (en) | Methods and systems for augmenting a token lexicon | |
CN108228710B (en) | Word segmentation method and device for URL | |
CN112347767B (en) | Text processing method, device and equipment | |
US7801898B1 (en) | Methods and systems for compressing indices | |
CN116992052B (en) | Long text abstracting method and device for threat information field and electronic equipment | |
EP1700250B1 (en) | Method and system for text segmentation | |
WO2022019275A1 (en) | Document search device, document search system, document search program, and document search method | |
RU2348071C2 (en) | Text segmentation methods and systems | |
US7302645B1 (en) | Methods and systems for identifying manipulated articles | |
JP6902131B2 (en) | Query processing method, query processing device and computer readable medium | |
CN118114660A (en) | Text detection method, system and computer readable storage medium | |
KR100738519B1 (en) | System for providing search information and storage medium for saving program of executing the same | |
CN116910331A (en) | Request identification method, apparatus, device and storage medium | |
CN117743527A (en) | Method, system and storage medium for extracting user search word path | |
KR20080010262A (en) | System for providing search information and storage medium for saving program of executing the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 1402/KOLNP/2006 Country of ref document: IN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2003819295 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2006127425 Country of ref document: RU |
|
WWP | Wipo information: published in national office |
Ref document number: 2003819295 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: JP |