CN117396878A - Word segmentation algorithm with offset mapping - Google Patents

Word segmentation algorithm with offset mapping Download PDF

Info

Publication number
CN117396878A
CN117396878A CN202280038305.4A CN202280038305A CN117396878A CN 117396878 A CN117396878 A CN 117396878A CN 202280038305 A CN202280038305 A CN 202280038305A CN 117396878 A CN117396878 A CN 117396878A
Authority
CN
China
Prior art keywords
character
string
index value
target
original string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280038305.4A
Other languages
Chinese (zh)
Inventor
M·古普塔
K·莫特拉尼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/444,347 external-priority patent/US11899698B2/en
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Priority claimed from PCT/IB2022/000257 external-priority patent/WO2022248933A1/en
Publication of CN117396878A publication Critical patent/CN117396878A/en
Pending legal-status Critical Current

Links

Abstract

There is provided a computer system comprising a processor coupled to a mass storage device storing instructions that, when executed by the processor, cause the processor to store an original string of a plurality of characters, perform a word segmentation algorithm on the original string, and tokenize the original string to generate a processed string comprising a plurality of word tokens separated by spaces. The processor is further configured to generate an offset map between a location within the word token in the processed string and a corresponding location in the original string and classify a portion of the processed string as a target. The processor is further configured to identify a target character corresponding to the target in the original string using the offset map and perform a predetermined action on the target character in the original string.

Description

Word segmentation algorithm with offset mapping
Background
Word segmentation algorithms are used in a variety of contexts in computing. One particular application using a word segmentation algorithm is Data Loss Prevention (DLP). DLP systems are intended to prevent threats such as theft and accidental disclosure of data loss when storing or transmitting sensitive data such as computers and computer networks. For example, these DLP systems may use word classification programs to monitor and detect sensitive information contained in electronic communications (e.g., emails and messages) to prevent the sensitive information from being sent outside of a corporate network. DLP technology supports multi-byte characters contained in languages such as Chinese, korean, and Japanese. For strings of these multi-byte characters, a word segmentation algorithm may be used to break the original string into discrete words, typically separated by spaces, and generate a processed string that includes the tokenized words.
However, for those multi-byte languages, in many cases the processed strings generated by the word segmentation algorithm may have different lengths than the original strings, and some characters present in the original strings, such as commas and punctuation marks, may also be lost, and spaces, tabs, and other space characters may also be different from the original strings. When there is an inconsistency between the original string and the processed string, word classification programs in the DLP system may not correctly identify sensitive information in the original string, which may lead to theft or accidental disclosure of key sensitive information.
Disclosure of Invention
According to one aspect of the present disclosure, a computer system is provided that includes a processor coupled to a mass storage device storing instructions. The instructions, when executed by the processor, cause the processor to store an original string of a plurality of characters, perform a word segmentation algorithm on the original string, and tokenize the original string to generate a processed string comprising a plurality of word tokens separated by spaces. The processor may be further configured to generate an offset map between a location within the word token in the processed string and a corresponding location in the original string, classify a portion of the processed string as a target, identify a target character in the original string that corresponds to the target using the offset map, and perform a predetermined action on the target character in the original string.
One potential advantage of this configuration is that the word segmentation algorithm can correctly identify target characters in an original string composed of multi-byte language characters based on the processed string, even if there is an inconsistency between the original string and the processed string. As a result, word classification programs employing such algorithms can correctly identify sensitive information and inhibit theft or accidental disclosure of information.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
Drawings
FIG. 1 shows a schematic diagram of a computing system including a computing device configured to execute a word segmentation algorithm to generate processed strings that are searchable by a search program, and to identify target characters in an original string corresponding to matching targets in the processed strings using an offset map when a match is found by the search program.
Fig. 2 shows a schematic diagram of another configuration of the computing system of fig. 1, including a server system including a first computing device configured with compliance and security programs that set policies and sensitive data definitions, and a second computing device and a third computing device, each of which execute a search program according to the definitions and policies set by the first computing device.
FIG. 3 illustrates a schematic diagram of a plurality of data structures manipulated by the computing system of FIG. 1 in executing a word segmentation algorithm with offset mapping.
FIG. 4 illustrates a schematic diagram of an example GUI for setting the compliance and safety program policies of FIG. 2.
Fig. 5A to 5D show four different GUI examples utilized when performing a predetermined action on a target character in an original string.
FIG. 6 illustrates a schematic diagram of a plurality of data structures manipulated by the computing system of FIG. 1 when executing a word segmentation algorithm with offset mapping on different examples of an original string.
Fig. 7 illustrates a flow chart of an exemplary implementation method according to the present disclosure.
FIG. 8 illustrates a schematic diagram of an example computing environment in which the method of FIG. 1 may be implemented.
Detailed Description
As described above, word segmentation algorithms are useful for a variety of software applications, including DLP systems for monitoring and detecting sensitive information in a variety of languages, as described above. Word segmentation algorithms are also used in search programs, such as search engines and indexing programs running on modern personal computers and servers. In these search procedures, a word segmentation algorithm is used to process the data set of strings into a processed string data set that includes tagged words. Once the search program finds a match, the match in the dataset may be displayed for viewing by the user in a highlighted form, may be extracted from the dataset, or may be obscured, for example. To take such action, it is understood that the word segmentation algorithm determines that the search program identified the location of the word token in the tokenized processed string and determined the location of the matched word token corresponding to the character in the original string, even though the word token was slightly different from the character in the original string. This can be a challenging task, as described below.
Languages such as English, french, and Spanish have spaces between words as word boundaries. For these languages, the word boundaries of the processed strings output by the word classifier agree with the word boundaries of the original strings. Thus, the search program searches the dataset of strings containing spaces separating words, and it is relatively easy to identify the location of the corresponding word in the original string.
However, multi-byte encoding languages such as chinese, korean, and japanese do not have word boundaries represented by spaces. If a conventional word segmentation algorithm is applied to an original string such as those of multi-byte encoding languages, it will produce a processed string with word boundaries. However, because of the addition of spaces for separating word tokens, the processed strings generated by the word segmentation algorithm may have a different length than the original string in many cases. The processed string may also miss some characters present in the original string, such as commas and punctuation marks, and may also have different spaces, tabs, and other space characters than the original string. When there is an inconsistency between the original string and the processed strings, for example, the search program may not be able to correctly identify the characters in the original string that correspond to the matched word tokens in the processed strings.
To address these problems, in accordance with one aspect of the present disclosure, a computer system and computerized method for use therewith are disclosed. FIG. 1 illustrates a computing system 10 that includes a computing device 12 configured to execute a word segmentation algorithm program 42 to generate processed strings 36 that are searchable by a search program 54, and when the search program 54 finds a match, an offset map 48 is utilized to identify target characters 74 in an original string 34 that correspond to matching targets in the processed strings 36, according to one aspect of the present disclosure. In the depicted configuration, computing device 12 may include a processor 14, a memory 16, and a mass storage device 18, which may be operably coupled to each other by a communication bus. Each of the processor 14, memory 16, and mass storage 18 may be configured as one or more physical components, such as one or more processor cores and/or one or more physical memory modules. The mass memory device 18 stores instructions for the various software components described herein that are executed by the processor 14, and also stores a data set 30 used by those software components. Computing device 12 may also include an input device 26, which may be a keyboard, a mouse, a touch screen, a touch pad, an accelerometer, a microphone, or some other suitable type of input device. In addition, computing device 12 may also include an output device 22, which may be a display, a speaker, or some other suitable type of output device. Computing device 12 may be any of a variety of types of computing devices, such as a laptop computer, desktop computer, or server. Computing device 12 may also be a mobile computing device such as a handheld tablet or smart phone device.
The instructions stored in the mass storage device 18, when executed by the processor 14, cause the processor 14 to store an original string 34 comprised of a plurality of characters. This may occur, for example, when an indexing program indexes files on a user's hard drive, when a server crawls a network and stores a collection of index files collected from the network, when a software program stores records in a database, or when a software program stores work data for communications such as e-mail or chat messages or for documents. In the illustrative example below, as shown in FIG. 2, a user is communicating with a travel agency about the user's business travel passport number, and the communication is marked by security policies set by the user's organization. It will be appreciated that the original strings 34 may be tens, hundreds, thousands, or even millions of original strings stored on the mass storage device 18, depending on the application. In the configuration depicted in FIG. 1, the stored original string 34 is composed of Japanese characters translated into "This is my passport … number AA1XXXXXX7 (which is My passport … number AA1XXXXXX 7)". The original string 34 written in Japanese has no spaces between words because Japanese does not use spaces to denominate nouns. The original string 34 may be extracted from or contained in an electronic document or electronic message, such as an email or word processing document that the user is sending to a travel agency in the illustrated example. The mass storage device 18 may also store sensitive data 38 as part of the original string 34. Definition of sensitive data 38 and policies for processing sensitive data 38 may be defined by a user or administrator, as described below with reference to fig. 2. In the example depicted in fig. 1, the sensitive data 38 includes a japanese "passport number". It is understood that the original string 34 may be composed in other multi-byte coded languages, such as chinese, korean, and tay.
After the original string 34 is stored, the process flow generally from (1) to (9) as shown in fig. 1 is followed. At (1) and (2), the word segmentation algorithm program 42 of the system 10 receives as input the stored original string 34 and outputs the processed string 36. Processor 14 is configured to execute a word segmentation algorithm by word segmentation algorithm program 42. The word segmentation algorithm program 42 may include a processed string generator 40, the processed string generator 40 having a tokenization module 44, the tokenization module 44 configured to perform a word segmentation algorithm on the original string 34 and tokenize the original string 34 to generate a processed string 36, the processed string 36 including a plurality of word tokens separated by spaces. It will be appreciated that the tokenization of the original string 34 generates the processed string 36 as its output by the tokenization module 44. Thus, the generation of the processed string occurs through tokenization. Differences between the original string 34 and the processed string 36 are described in detail below. In this example, the original string 34 is written in Japanese and the English translation of the original string is "This is my passport … number AA1XXXXXX7. "in Japanese, this original string 34 has twenty-four characters. In the depicted example, the processed string 36 produced by the processed string generator 40 of the word segmentation algorithm program 42 includes word tokens separated by seven spaces, and two points at the 10 th and 11 th positions in the original string 34 are removed, thus omitted from the processed string 36 after the word segmentation algorithm is performed. As a result, the processed string 36 has twenty-nine characters including spaces, as depicted in FIG. 1.
As described above, the processed string 36 is in a form compatible with the search program 54, and thus the search efficiency is high. For example, the processed string 36 may be index data or the like in a searchable index created from one of the sources for the data set 30 described above. Thus, while the processed string 36 is easy to search, it is not a natural form suitable for consumption. To this end, an offset map 48 is generated in order to display to the user the targets 56 found in the processed string 36 using portions of the original string 34.
Specifically, as shown at (3) and (4), an offset map 48 between locations within the word tokens in the processed string 36 and corresponding locations in the original string 34 may be generated via an offset map generator 46. The offset map 48 may include a first data structure 50 and a second data structure 52 created based on the original string 34 and the processed string 36. The first data structure 50 includes a start character offset index value and a character length for each marker word in the original string 34, each marker word is detected in the original string 34 during the word segmentation algorithm, the second data structure 52 includes an end character offset index value for each marker in the processed string 36, and each of the first and second data structures has the same number of elements, i.e., each stores the same number of marker index values 57, although there is corresponding data (start character offset index value and character length, or end character offset index value) for each of the first and second data structures. As further described below in FIG. 3, offset map generator 46 obtains from original string 34 [0] (0, 2), [1] (2, 1), [2] (3, 1) … … [6] (14, 2) and [7] (16, 9) of first data structure 50, and obtains from processed string 36 [0] (3), [1] (5), [2] (7) … … [6] (21) and [7] (31) of second data structure 52.
At (5) and (6), the processor 14 may also be configured to detect and classify a portion of the processed string 36 as a target 56 via a search program 54. The target 56 is sensitive information of a predetermined sensitive information data type defined by a user or administrator, as described below with respect to fig. 2. Search program 54 searches the processed string 36 for sensitive data 38 and locates a target 56, such that matches sensitive data definitions 94 (see FIG. 2) set by an administrator or user, for example, the target 56 corresponding to characters contained in sensitive data 38. In the depicted example, the japanese "passport number" is contained in the sensitive data 38, and the japanese target 56 "passport number" is located in the processed string 36 via the search program 54. The search program 54 obtains and transmits the target word indicia 58 of the target 56 and/or the associated start character offset index value 56A and end character offset index value 56B to the original string target locator 60.
In the depicted example, for the "passport number" of the processed string 36, the target word indicia 58 ("passport number") and the start character offset index value 56A ("9" in fig. 1) and the end character offset index value 56B ("16" in fig. 1) are sent to the original string target locator 60.
At (7) and (8), the processor may be further configured to identify a target character 74 in the original string 34 that corresponds to the target 56 using the offset map 48 and the target word tokens 58 provided by the search program 54 and/or associated start character offset index values 56A and end character offset index values 56B. The target character 74 in the original string 34 may be located by identifying a start character offset index value 64A and an end character offset index value 64B of the target character 74. As further described below in FIG. 3, the offset conversion module 62 of the original string object locator 60 of the word segmentation algorithm 42 converts the object word tokens 58 and associated start character offset index values 56A and end character offset index values 56B ("9" and "16" in the figures) into start character offset index values 64A and end character offset index values 64B of the object characters 74 of the original string 34 using the first data structure 50 and the second data structure 52 of the offset map 48. The original string object locator 60 then sends the start character offset index value 64A and the end character offset index value 64B of the object character 74 in the original string 34 to the action module 66 of the word segmentation algorithm 42. In the depicted example, the original string target locator 60 uses the offset map 48 to convert the start character offset index value 56A ("9") and the end character offset index value 56B ("16") of the target word tag to (5, 13) as the start character offset index value 64A and the end character offset index value 64B of the original string 34, enabling the action module 66 to identify the target character 74"passport … number" (English translation of Japanese target characters) in the original string 34.
At (9), the processor 14 may also be configured to perform a predetermined action on the target character 74 in the original string 34 via the action module 66 of the word segmentation algorithm 42. Action module 66 identifies a target character 74 of original string 34 based on start character offset index value 64A and end character offset index value 64B of target character 74 and performs predetermined action 68 on target character 74. In this example, the action module 66 identifies and highlights the target character 74 "port … number" on the GUI 70. Turning briefly to fig. 5A, 5B, 5C, and 5D, different actions may be programmed by a user or administrator and performed on the target character 74. In a first configuration shown in fig. 5A, the target character 74 may be highlighted, such as highlighting, underlining, bold characters, italic characters, changing colors, larger or smaller font sizes, and the like. In a second configuration shown in fig. 5B, the target character 74 may be obscured or deleted. In a third configuration shown in fig. 5C, a warning message regarding the target character 74 may be displayed. In a fourth configuration shown in fig. 5D, the target character 74 may be extracted. Because of these actions performed by action module 56 of segmentation algorithm 42, in the illustrative example, the user may be prevented from sharing sensitive data 38 of the user.
FIG. 2 shows a schematic diagram of another configuration of computing system 10 of FIG. 1, including a server system 11 and a plurality of computing devices 12A-12C. Computing devices 12A-12C are configured similarly to computing device 12 described above, except as described differently below. It will be appreciated that computing devices 12B and 12A are shown in the same server system 11, e.g., in the same data center, and thus connected for high-speed communication via a local area network, while computing device 12C communicates with server system 11 via WAN 98, such as the internet. In this way, administrative users can control security policies of computers within the data center and across the Internet.
The server system 11 includes a first computing device 12A that acts as a management server role, the first computing device 12A being configured with compliance and security procedures that set sensitive data definitions 94 and policies 96. The server system 11 also includes a second computing device 12B and a third computing device 12C, each of the second computing device 12B and the third computing device 12C executing the search program 54 according to a definition 94 and a policy 96 set by the first computing device 12A. Each of computing devices 12B and 12C of server system 11 is configured to execute a respective instance of search program 54, each instance configured to receive as input sensitive data definition 94 and one or more policies 96 from computing device 12A acting as a management server role, and search a dataset comprising a plurality of original strings for sensitive data according to the sensitive data definition stored on each respective computing device of computing devices 12B and 12C. Further, although not shown, it is to be appreciated that computing device 12A can also implement a search program on a data set accessible to computing device 12A.
To achieve this, a user or administrator may run compliance and security program 92 on computing device 12A to set the definition of sensitive data 94 and policies 96 for processing sensitive data within an organization. Turning briefly to fig. 4, a user or administrator may create and edit sensitive data definitions 94 and policies 96 on the compliance and security program GUI96 of the compliance and security program 92. In the depicted example, a user or administrator may add sensitive data types, such as passport numbers, social security numbers, and credit card numbers, to sensitive data definitions 94 in activity sensitive data definitions 94 being applied according to policies 96. According to policy 96, a user or administrator may set locations to be searched by search program 54, including mail servers, cloud storage, and file servers, and may indicate data paths for each resource. The user or administrator may also set a predetermined action on the target character 74 of the original string 34, as described above in fig. 5A-5D.
FIG. 3 illustrates a schematic diagram of a plurality of data structures manipulated by the computing system of FIG. 1 in executing a word segmentation algorithm with offset mapping. In the depicted example depicted in FIG. 1 above, offset map generator 46 obtains a first data structure 50 of [0] (0, 2), [1] (2, 1), [2] (3, 1) … [6] (14, 2) and [7] (16, 9) from original string 34 and a second data structure 52 of [0] (3), [1] (5), [2] (7) … [6] (21) and [7] (31) from processed string 36 to generate offset map 48. The first data structure 50 stores a starting character offset index value and a character length for each marker word in the original string 34, each marker word being detected in the original string 34 during the word segmentation algorithm. In this example, since the first word starts from "0" and there are two characters, the starting character for the first tag [0] is offset by an index value and a character length of (0, 2). Since the next word starts from "2" and has only one character, the starting character for the second tag [1] is offset by an index value and a character length of (2, 1). The start character offset index value and character length for the last mark [7] are (16, 9). The starting character offset index values and character lengths for the remaining tag words are obtained in the same manner.
On the other hand, the second data structure 52 stores an end character offset index value for each tag in the processed string 36, each tag being obtainable using the first formula 83 of FIG. 3. The ending character offset index value of the first tag [0] is "(3)" according to the first formula 83 ("0+2+1=3"), as calculated above, because the previous ending character offset index value is "0", the length of the first tag word of the original string is "2", and the length of the added space is 1. Then, the ending character offset index value of the second mark [1] is "(5)" according to the first formula 83 ("3+1+1=5"), because the previous ending character offset index value is "3", the length of the second mark word of the original string is "1", and the length of the added space is 1. The ending character offset index value of the last tag [7] is "(31)" according to the first formula 83 ("21+9+1=31"), because the previous ending character offset index value is "21" and the second tag word of the original string has a length of "9" and the added space has a length of 1. The end character offset index values of the remaining marker words of the processed string 36 are obtained in the same manner. Here, each of the first data structure 50 and the second data structure 52 has the same number of elements, i.e., each stores the same number of index values 57 of the tag, although there is corresponding data (a start character offset index value and a character length, or an end character offset index value) for each of the first data structure 50 and the second data structure 52.
The search program 54 of FIG. 1 identifies targets 56 in the processed string 36 and determines target word tokens 58. The target word tokens 58 each have an associated token index value 57 indicated in brackets. The target word token 58 also has associated start character offset index values 56A and end character offset index values 56B ("9" and "16") in the processed string 36. The target word tag 58 and/or the associated start character offset index value 56A and end character offset index value 56B are sent to the original string target locator 60 of fig. 1. Optionally, the markers 59A,59B themselves may also be sent to the original string target locator 60 for determining the start character offset index value 56A and the end character offset index value 56B.
The offset conversion module 62 of the original string object locator 60 of the word segmentation algorithm program 42 converts the object word tokens 58 ("9" and "16") into a start character offset index value 64A and an end character offset index value 64B ("5" and "13" in fig. 1) of the object characters 74 of the original string 34 using the first data structure 50 and the second data structure 52 of the offset map 48. Two different steps using the second and third formulas 84 and 86 may be used for this conversion to identify the start character offset index value 64A and the end character offset index value 64B of the target character 74 ("5" and "13" in fig. 1).
First, the marker index value 57 of each of the start character offset index value 56A of the start marker 59A and the end character offset index value 56B of the end marker 59B (9 and 16 in fig. 1) of the target in the processed string 36 is determined using the second formula 84. In the depicted example, the marker index value 57 of the start character offset index value 56A ("9") is determined to be "4" using the second formula 84 because the start character offset index value 56A ("9") is greater than or equal to the end character offset index value "(9)" of the fourth marker [3] of the processed string 36 and less than the end character offset index value "(15)" of the fifth marker [4] of the processed string. Using the second formula 84, the tag index value 57 of the end character offset index value 56B ("16") is determined to be "5", since the end character offset index value 56B ("16") is less than or equal to the end character offset index value "18" of the sixth tag index [5] of the processed string 36 and greater than the end character offset index value "15" of the fifth tag [4 ]. Accordingly, the marker index value 57 of each of the start character offset index value 56A and the end character offset index value 56B of the target 56 in the processed string 36 is "4" and "5", respectively.
Finally, by incorporating the above index values ("4" and "5") into the third formula 86, the start character offset index value 64A of the start tag 59A 'in the original string 34 is used, and the end character offset index value 64B of the end tag 59B' is used, which is stored in the first data structure of the offset map 48. In the depicted example, because the start character offset index value of the fifth tag [4] in the original string 34 is "5" in [4] (5, 5), the start character offset index value of the start tag in the original string 34 is "5". On the other hand, as described in fig. 3, since the start character shift index value for the sixth flag [5] in the original string 34 is "12", and its length is "2", and "12" + "2" - "1" = "13", the end character shift index value of the end flag is "13". As a result, the start character offset index value 64A and the end character offset index value 64B of the target character 74 are identified as (5, 13), and the corresponding target character "port … number" (english translation) in the original string 34 is identified.
FIG. 6 shows a schematic diagram of a plurality of data structures operated on by the computing system of FIG. 1 when a word segmentation algorithm with offset mapping is performed on different examples of the original string 34. In this example, the original string 34 is written in Japanese and the English translation of the original string is "apurchase history will soon be available only when you log in and will be centrally managed within my number card (purchase history will be available soon and will be managed centrally within my number card only when you log in)". The target character "number card" of the original string 34 (english translation of the japanese target character) is identified using the offset map 48 of fig. 1, following the same procedure described above in fig. 3. In this example, the term "number card" has been entered into the sensitive data definition by an administrator, and the corresponding word tags of the targets 56 in the processed string 36 have been identified by the search program 54.
In the depicted example depicted in FIG. 6, the offset map generator 46 of FIG. 1 obtains the first data structure 50 of [0] (0, 2), [1] (2, 2), [2] (4, 1) … [20] (42,1) and [21] (43,2) from the original string 34 and obtains the second data structure 52 of [0] (3), [1] (6), [2] (8) … [20] (63) and [21] (66) from the processed string 36 using the first formula 83 to generate the offset map 48. Next, the search program 54 of FIG. 1 identifies the targets 56 in the processed string 36 and determines target word tokens 58. The target word token 58 also has associated in the processed string 36A start character offset index value 56A and an end character offset index value 56B ("42" and "52"). Next, for each of the start character offset index value 56A and the end character offset index value 56B of the start tag 59A and the end tag 59B of the target 56 in the processed string 36, the tag index value 57 ("14" and "16") is obtained from the target word tag 58 using the second formula 85. Finally, by incorporating the above index values ("14" and "16") into the third formula 86, the start character offset index value of the start marker "28" and the end character offset index value of the end marker "36" in the original string 34 are obtained. As a result, the start character offset index value 56A and the end character offset index value 56B of (28, 36) in the original string 34, and the corresponding target character "within number card" (english translation) are marked.
Fig. 7 illustrates a flow chart of a computerized method 300 according to one example implementation of the present disclosure. At step, computerized method 300 may include storing an original string composed of a plurality of characters. The method may further include performing a word segmentation algorithm on the original string at step 304. At step 306, the method 300 may further include tokenizing the original string to generate a processed string including a plurality of word tokens separated by spaces. From step 306, the method branches into two parallel workflows. According to a first branch of the parallel workflow, at step 308, the method 300 may further include generating an offset map 48 between locations within word tokens in the processed string and corresponding locations in the original string. At 310, the method may include storing word segmentation algorithm metadata from the original string and the processed string into a data structure. The metadata may indicate the location and length of the target in the original string and the processed string. For example, as shown at 312, the offset map may include a first data structure, and the method may include storing in the first data structure a starting character offset index value and a character length for each marker word in the original string, each marker word detected in the original string during the word segmentation algorithm. Additionally or alternatively, as shown at 314, the offset map may include a second data structure, and the method may include storing an end character offset index value for each tag in the processed string. Returning to the second branch of the parallel workflow, at step 340 the method may further include classifying a portion of the processed string as a target by the search program. At step 342, the method 300 may further include identifying a target word tag and associated start character offset index value and end character offset index value.
At step 318, the method 300 may further include identifying a target character corresponding to the target in the original string using the offset map and the target word tag and associated start character offset index value and end character offset index value. To identify the target character, at 320, the method may determine a marker index value for each of a start character offset index value and an end character offset index value of the target in the processed string. At 322, the method may further include determining a start character offset index value for the start tag in the original string using the start character offset index value stored in the first data structure of the offset map. At 324, the method may include determining an end character offset index value for the end marker using the end character offset index value stored in the second data structure of the offset map. Once the target character is identified, at step 326, the method 300 may further include performing a predetermined action 328 on the target character in the original string, including highlighting the target word, obscuring the target word, and/or extracting the target word.
It should be appreciated that the above-described systems and methods may be used with a dataset comprising strings of a multi-byte encoding language to search for target characters that match sensitive data definitions in processed strings that have been tokenized to optimize the search, and to identify corresponding characters in the original string from which the processed string was generated. By identifying the corresponding character in this manner, the appropriate action may be performed on the corresponding character in the original string.
In some embodiments, the methods and processes described herein may be tied to a computing system of one or more computing devices, and in particular, such methods and processes may be implemented as a computer application program or service, application Programming Interface (API), library, and/or other computer program product.
FIG. 8 schematically illustrates a non-limiting embodiment of a computing system 400 that may implement one or more of the above-described methods and processes. Computing system 400 is shown in simplified form. Computing system 400 may embody computer device 12 as described above and shown in fig. 1, as well as the various computing devices shown in fig. 2. Computing system 400 may take the form of one or more personal computers, server computers, tablet computers, home entertainment computers, network computing devices, gaming devices, mobile computing devices, mobile communication devices (e.g., smart phone) and/or other computing devices, as well as wearable computing devices such as smart watches and head mounted augmented reality devices.
Computing system 400 includes a logic processor 402, volatile memory 404, and non-volatile storage 406, computing system 400 may optionally include a display subsystem 408, an input subsystem 410, a communication subsystem 412, and/or other components not shown in fig. 8.
Logical processor 402 includes one or more physical devices configured to execute instructions. For example, a logical processor may be configured to execute instructions that are part of one or more applications, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more components, implement a technical effect, or otherwise achieve a desired result.
A logical processor may include one or more physical processors (hardware) configured to execute software instructions. Additionally or alternatively, the logic processor may include one or more hardware logic circuits or firmware devices configured to execute hardware-implemented logic or firmware instructions. The processors of logical processor 402 may be single-core or multi-core, and the instructions executed thereon may be configured for sequential, parallel, and/or distributed processing. The various components of the logical processor may optionally be distributed among two or more separate devices, which may be located remotely and/or configured to coordinate processing. Aspects of the logical processor may be virtualized and executed by remotely accessible, networked computing devices configured in a cloud computing configuration. In this case, it will be appreciated that these virtualized aspects run on different physical logical processors of the various different machines.
The non-volatile storage 406 includes one or more physical devices configured to hold instructions executable by a logical processor to implement the methods and processes described herein. When these methods and processes are implemented, the state of the non-volatile storage 406 may be transformed-e.g., to save different data.
Nonvolatile storage 406 may include removable and/or built-in physical devices. Nonvolatile storage 406 may include optical storage (e.g., CD, DVD, HD-DVD, blu-ray disc, etc.), semiconductor storage (e.g., ROM, EPROM, EEPROM, FLASH storage, etc.), and/or magnetic storage (e.g., hard disk drive, floppy disk drive, tape drive, MRAM, etc.), or other mass storage device technology. Nonvolatile storage 406 may include nonvolatile, dynamic, static, read/write, read-only, sequential access, location-addressable, file-addressable, and/or content-addressable devices. It is appreciated that the non-volatile storage device 406 is configured to retain instructions even when the non-volatile storage device 406 is powered down.
Volatile memory 404 may include physical devices including random access memory. Volatile memory 404 is typically used by logic processor 402 to temporarily store information during the processing of software instructions. It will be appreciated that when the volatile memory 404 is powered down, the volatile memory 404 will not typically continue to store instructions.
Aspects of the logic processor 402, the volatile memory 404, and the non-volatile storage 406 may be integrated together into one or more hardware logic components. For example, such hardware logic components may include Field Programmable Gate Arrays (FPGAs), program and application specific integrated circuits (PASICs/ASICs), program and application specific standard products (PSSPs/ASSPs), system On Chip (SOCs), and Complex Programmable Logic Devices (CPLDs).
The terms "module," "program," and "engine" may be used to describe an aspect of the computing system 400 that is typically implemented by a processor in software to perform certain functions using portions of volatile memory, which relate to the conversion process that configures the processor specifically to perform the function. Thus, a portion of the volatile memory 404 may be used to instantiate modules, programs, and/or engines via the logical processor 402 executing instructions held by the non-volatile storage 406. It should be appreciated that different modules, programs, and/or engines may be instantiated from the same application, service, code block, object, library, routine, API, function, etc. Likewise, the same module, program, and/or engine may be instantiated by different applications, services, code blocks, objects, routines, APIs, functions, etc. The terms "module," "program," and "engine" may include a single or a group of executable files, data files, libraries, drivers, scripts, database records, and the like.
When included, the display subsystem 408 may be used to present a visual representation of data held by the nonvolatile storage device 406. The visual representation may take the form of a Graphical User Interface (GUI). Because the herein described methods and processes change the data held by the non-volatile storage device, and thus transform the state of the non-volatile storage device, the state of the display subsystem 408 may likewise be transformed to visually represent changes in the underlying data. Display subsystem 408 may include one or more display devices utilizing virtually any type of technology. Such display devices may be incorporated in the shared housing with the logic processor 402, volatile memory 404, and/or non-volatile storage 406, or such display devices may be peripheral display devices.
When included, input subsystem 410 may include or interface with one or more user input devices, such as a keyboard, mouse, touch screen, or game controller. In some embodiments, the input subsystem may include or interface with a selected Natural User Input (NUI) component. Such components may be integrated or peripheral, and the transduction and/or processing of input actions may be on-board or off-board processing. Example NUI components may include microphones for speech and/or speech recognition; infrared, color, stereo, and/or depth cameras for machine vision and/or gesture recognition; head trackers, eye trackers, accelerometers and/or gyroscopes for motion detection and/or intent recognition; an electric field sensing assembly for assessing brain activity; and/or any other suitable sensor.
When included, communication subsystem 412 may be configured to communicatively couple the various computing devices described herein with each other and with other devices. Communication subsystem 412 may include wired and/or wireless communication devices compatible with one or more different communication protocols. As non-limiting examples, the communication subsystem may be configured to communicate via a wireless telephone network, or a wired or wireless local or wide area network, such as HDMI over Wi-Fi connection. In some embodiments, the communication subsystem may allow computing system 400 to send and/or receive messages to other devices via a network, such as the internet.
The following paragraphs discuss several aspects of the present disclosure. According to one aspect of the present disclosure, a computer system is provided. The computer system may include a processor coupled to a mass storage device storing instructions that, when executed by the processor, cause the processor to store an original string of a plurality of characters. The processor may also be configured to perform a word segmentation algorithm on the original string. The processor may be further configured to tokenize the original string to generate a processed string comprising a plurality of word tokens separated by spaces. The processor may be further configured to generate an offset map between a location within the word token in the processed string and a corresponding location in the original string. The processor may be further configured to classify a portion of the processed string as a target. The processor may be further configured to identify a target character corresponding to the target in the original string using the offset map. The processor may be further configured to perform a predetermined action on the target character in the original string.
According to this aspect, to perform the predetermined action, the processor may be further configured to highlight the target character, blur the target character, and/or extract the target character.
According to this aspect, to identify the target character in the original string, the processor may be further configured to identify a start character offset index value and a character length of the target character, and/or to identify a start character offset index value and an end character offset index value of the target character.
According to this aspect, to identify the target character in the original string, the processor may be further configured to determine a start character offset index value and an end character offset index value for the target in the processed string, determine a marker index value for each of the start character offset index value and the end character offset index value for the target in the processed string, determine a start character offset index value for the start marker in the original string using the start character offset index value stored in the first data structure of the offset map, and determine an end character offset index value for the end marker using the end character offset index value stored in the second data structure of the offset map.
According to this aspect, the target may be sensitive information of a predetermined sensitive information data type.
According to this aspect, the original string may include japanese, chinese, korean or tay characters.
According to this aspect, the original string may be extracted from an electronic document or electronic message.
According to this aspect, the original string may include characters that are omitted from the processed string after the word segmentation algorithm is performed.
According to this aspect, the offset map may include a first data structure storing a start character offset index value and a character length of each marker word in the original string, each marker word detected in the original string during the word segmentation algorithm, and a second data structure storing an end character offset index value of each marker in the processed string, wherein each of the first data structure and the second data structure has the same number of elements.
According to another aspect of the present disclosure, a computerized method is provided. The computerized method may include storing an original string composed of a plurality of characters. The computerized method may also include performing a word segmentation algorithm on the original string and tokenizing the original string to generate a processed string comprising a plurality of word tokens separated by spaces. The computerized method may further include generating an offset map between a location within the word token in the processed string and a corresponding location in the original string. The computerized method may further include classifying a portion of the processed string as a target. The computerized method may further include identifying a target character corresponding to the target in the original string using the offset map. The computerized method may also include performing a predetermined action on the target character in the original string.
According to this aspect, performing the predetermined action may include one or more of: highlighting target characters, blurring target characters, and/or extracting target characters.
According to this aspect, identifying the target character in the original string may include one or more of: a start character offset index value and a character length identifying the target character and/or a start character offset index value and an end character offset index value identifying the target character.
According to this aspect, identifying the target character in the original string may be accomplished, at least in part, by: the method includes determining a start character offset index value and an end character offset index value of a target in a processed string, determining a marker index value for each of the start character offset index value and the end character offset index value of the target in the processed string, determining a start character offset index value of a start marker in the original string using the start character offset index value stored in a first data structure of the offset map, and determining an end character offset index value of an end marker using the end character offset index value stored in a second data structure of the offset map.
According to this aspect, the target may be sensitive information of a predetermined sensitive information data type.
According to this aspect, the original string may include japanese, chinese, korean or tay characters.
According to this aspect, the original string may be extracted from an electronic document or electronic message.
According to this aspect, the original string may include characters that are omitted from the processed string after the word segmentation algorithm is performed.
According to this aspect, the offset map may include a first data structure storing a start character offset index value and a character length of each marker word in the original string, each marker word detected in the original string during the word segmentation algorithm, and a second data structure storing an end character offset index value of each marker in the processed string, wherein each of the first data structure and the second data structure has the same number of elements.
According to this aspect, the end character offset index value for each marker in the processed string may be calculated using the previous end character offset index value for each marker in the processed string and the length of the corresponding marker in the original string.
According to another aspect of the present disclosure, a computer system configured to classify words is provided. The computer system may include a server computing device configured to execute a search program configured to receive as input a sensitive data definition and one or more policies, and to define a search dataset from the sensitive data, the dataset comprising a plurality of original strings for the sensitive data. The server computing device may be further configured to perform a word segmentation algorithm on an original string selected from the plurality of original strings. The server computing device may be further configured to tokenize the selected original string to generate a processed string comprising a plurality of word tokens separated by spaces. The server computing device may be further configured to generate an offset map between a location within the word token in the processed string and a corresponding location in the original string. The server computing device may also be configured to classify a portion of the processed string as a target. The server computing device may be further configured to identify a target character corresponding to the target in the original string using the offset map. The server computing device may also be configured to perform a predetermined action on the target character in the original string.
It is to be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Also, the order of the above-described processes may be changed.
The subject matter of the present disclosure includes all novel and non-obvious combinations and subcombinations of the various processes, systems and configurations, as well as other features, functions, acts and/or properties disclosed herein, and any and all equivalents thereof.

Claims (15)

1. A computer system, comprising:
a processor coupled to a mass storage device storing instructions that, when executed by the processor, cause the processor to:
storing an original string composed of a plurality of characters;
executing a word segmentation algorithm on the original string;
tokenizing the original string to generate a processed string, the processed string comprising a plurality of word tokens separated by spaces;
Generating an offset map between a location within the word token in the processed string and a corresponding location in the original string;
classifying a portion of the processed string as a target;
identifying a target character corresponding to the target in the original string using the offset map; and
performing a predetermined action on the target character in the original string.
2. The computer system of claim 1, wherein to perform the predetermined action, the processor is configured to:
highlighting the target character;
blurring the target character; and/or
And extracting the target character.
3. The computer system of claim 1, wherein to identify the target character in the original string, the processor is configured to:
identifying a start character offset index value and a character length of the target character; and/or identifying a start character offset index value and an end character offset index value of the target character.
4. The computer system of claim 1, wherein to identify a target character in the original string, the processor is configured to:
determining a start character offset index value and an end character offset index value of the target in the processed string;
Determining a marker index value for each of the start character offset index value and the end character offset index value of the target in the processed string;
determining a start character offset index value for the start tag in the original string using the start character offset index value stored in the first data structure of the offset map; and
an end character offset index value of an end marker is determined using the end character offset index value stored in the second data structure of the offset map.
5. The computer system of claim 1, wherein the target is sensitive information of a predetermined sensitive information data type.
6. The computer system of claim 1, wherein the original string comprises japanese, chinese, korean, or tay characters.
7. The computer system of claim 1, wherein the original string is extracted from an electronic document or an electronic message.
8. The computer system of claim 1, wherein the original string comprises characters that are omitted in the processed string after execution of the word segmentation algorithm.
9. The computer system of claim 1, wherein the offset map comprises:
A first data structure storing a starting character offset index value and a character length of each marker word in the original string, the each marker word being detected in the original string during the word segmentation algorithm; and
a second data structure storing an ending character offset index value for each tag in the processed string;
wherein each of the first data structure and the second data structure has the same number of elements.
10. A computerized method comprising:
storing an original string composed of a plurality of characters;
executing a word segmentation algorithm on the original string;
tokenizing the original string to generate a processed string, the processed string comprising a plurality of word tokens separated by spaces;
generating an offset map between a location within the word token in the processed string and a corresponding location in the original string;
classifying a portion of the processed string as a target;
identifying a target character corresponding to the target in the original string using the offset map; and
performing a predetermined action on the target character in the original string.
11. The computerized method of claim 10, wherein performing the predetermined action comprises one or more of:
Highlighting the target character;
blurring the target character; and/or
And extracting the target character.
12. The computerized method of claim 10, wherein identifying a target character in the original string comprises one or more of:
identifying a start character offset index value and a character length of the target character; and/or identifying a start character offset index value and an end character offset index value of the target character.
13. The computerized method of claim 10, wherein identifying a target character in the original string is accomplished at least in part by:
determining a start character offset index value and an end character offset index value of the target in the processed string;
determining a marker index value for each of the start character offset index value and the end character offset index value of the target in the processed string;
determining a start character offset index value for the start tag in the original string using the start character offset index value stored in the first data structure of the offset map; and
an end character offset index value of an end marker is determined using the end character offset index value stored in the second data structure of the offset map.
14. The computerized method of claim 10, wherein the original string is extracted from an electronic document or an electronic message.
15. The computerized method of claim 10, wherein the offset mapping comprises:
a first data structure storing a starting character offset index value and a character length of each marker word in the original string, the each marker word being detected in the original string during the word segmentation algorithm; and
a second data structure storing an ending character offset index value for each tag in the processed string;
wherein each of the first data structure and the second data structure has the same number of elements.
CN202280038305.4A 2021-05-28 2022-05-05 Word segmentation algorithm with offset mapping Pending CN117396878A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
IN202141023933 2021-05-28
US17/444,347 2021-08-03
US17/444,347 US11899698B2 (en) 2021-05-28 2021-08-03 Wordbreak algorithm with offset mapping
PCT/IB2022/000257 WO2022248933A1 (en) 2021-05-28 2022-05-05 Wordbreak algorithm with offset mapping

Publications (1)

Publication Number Publication Date
CN117396878A true CN117396878A (en) 2024-01-12

Family

ID=89463575

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280038305.4A Pending CN117396878A (en) 2021-05-28 2022-05-05 Word segmentation algorithm with offset mapping

Country Status (1)

Country Link
CN (1) CN117396878A (en)

Similar Documents

Publication Publication Date Title
CN109804363B (en) Connection using format modification by way of example
CN108171073B (en) Private data identification method based on code layer semantic parsing drive
US11210327B2 (en) Syntactic profiling of alphanumeric strings
US20110162084A1 (en) Selecting portions of computer-accessible documents for post-selection processing
US20140189866A1 (en) Identification of obfuscated computer items using visual algorithms
US20140212040A1 (en) Document Alteration Based on Native Text Analysis and OCR
US20240054802A1 (en) System and method for spatial encoding and feature generators for enhancing information extraction
US20130159346A1 (en) Combinatorial document matching
CN108280197A (en) A kind of method and system of the homologous binary file of identification
US9898467B1 (en) System for data normalization
CN114418398A (en) Scene task development method, device, equipment and storage medium
US8639707B2 (en) Retrieval device, retrieval system, retrieval method, and computer program for retrieving a document file stored in a storage device
US9483535B1 (en) Systems and methods for expanding search results
EP2916238A1 (en) Corpus generating device, corpus generating method, and corpus generating program
FR3020883A1 (en) RELATIONAL FILE BASE AND GRAPHICAL INTERFACE FOR MANAGING SUCH A BASE
US20210064697A1 (en) List-based entity name detection
JP6194180B2 (en) Text mask device and text mask program
US9286349B2 (en) Dynamic search system
CN117396878A (en) Word segmentation algorithm with offset mapping
US11899698B2 (en) Wordbreak algorithm with offset mapping
KR101781597B1 (en) Apparatus and method for creating information on electronic publication
US9507947B1 (en) Similarity-based data loss prevention
WO2022248933A1 (en) Wordbreak algorithm with offset mapping
JP5916666B2 (en) Apparatus, method, and program for analyzing document including visual expression by text
US11132400B2 (en) Data classification using probabilistic data structures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination