CN115983262A - Text sensitive information identification method and device, storage medium and electronic equipment - Google Patents
Text sensitive information identification method and device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN115983262A CN115983262A CN202211709838.9A CN202211709838A CN115983262A CN 115983262 A CN115983262 A CN 115983262A CN 202211709838 A CN202211709838 A CN 202211709838A CN 115983262 A CN115983262 A CN 115983262A
- Authority
- CN
- China
- Prior art keywords
- word
- sensitive
- recognized
- words
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
The disclosure provides a method and a device for identifying sensitive information of a text, a storage medium and electronic equipment, and relates to the technical field of data processing. The text sensitive information identification method comprises the following steps: acquiring a pre-established sensitive word data set; preprocessing a text to be processed to obtain a word to be recognized in the text to be processed; and comparing the word characteristics of the words to be recognized with the word characteristics of the sensitive words in the sensitive word data set to determine whether the words to be recognized are sensitive information. The method and the device can reduce the calculation amount of sensitive information identification in the text to a certain extent.
Description
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a method and an apparatus for identifying sensitive information of a text, a storage medium, and an electronic device.
Background
The identification of the sensitive data is the fundamental problem of data security technology management, focuses on data security, finds out the sensitive data and behaviors of enterprise employees and user operating environments in time, performs effective management and control, and is an important task which is urgently needed to be solved.
In the related art, the sensitive data in the text to be recognized is recognized by comparing the text to be recognized with the sensitive data, so that under the condition of large text data amount, the calculation amount is large, and the processing efficiency is low.
Disclosure of Invention
The disclosure provides a text sensitive information identification method, a text sensitive information identification device, a storage medium and an electronic device, so as to reduce the calculation amount of the text sensitive information identification to a certain extent.
According to a first aspect of the present disclosure, there is provided a method for identifying sensitive information of a text, including: acquiring a pre-established sensitive word data set; preprocessing a text to be processed to obtain a word to be recognized in the text to be processed; and determining whether the word to be recognized is sensitive information or not by comparing the word characteristics of the word to be recognized with the word characteristics of the sensitive words in the sensitive word data set.
In an embodiment, the preprocessing the text to be processed to obtain the word to be recognized in the text to be processed includes: performing word segmentation on the text to be processed to obtain a word set to be processed; deleting the words with the word length smaller than a word length preset threshold value in the word set to be processed to obtain the words to be recognized.
In one embodiment, deleting the word with the word length smaller than the preset word length threshold in the word set to be processed to obtain the word to be recognized includes: acquiring word length preset thresholds corresponding to different parts of speech; deleting the words of which the word length under each part of speech in the word set to be processed is smaller than a word length preset threshold value corresponding to the part of speech to obtain the words to be recognized.
In one embodiment, the determining whether the word to be recognized is sensitive information by comparing the word features of the word to be recognized with the word features of sensitive words in the sensitive word data set includes: generating a word vector of the word to be recognized according to the word features of the word to be recognized; calculating the similarity between the word vector of the word to be recognized and the word vector of the sensitive word in the sensitive word data set; the word vector of the sensitive word is obtained according to the word feature of the sensitive word; and determining whether the word to be recognized is sensitive information or not according to the similarity.
In one embodiment, the method further comprises: screening the sensitive word data set according to the words to be screened to obtain a screened sensitive word data set; the words to be screened are words in the text to be processed except the words to be identified;
the determining whether the word to be recognized is sensitive information by comparing the word features of the word to be recognized with the word features of the sensitive words in the sensitive word data set includes: and comparing the word characteristics of the word to be recognized with the word characteristics of the sensitive words in the screened sensitive word data set to determine whether the word to be recognized is sensitive information.
In an embodiment, the screening the sensitive word data set according to the word to be screened to obtain the screened sensitive word data set includes: and deleting words with the same parts of speech as the words to be screened from the sensitive word data set according to the parts of speech of the words to be screened to obtain the screened sensitive word data set.
In one embodiment, the word feature includes at least one of a part of speech, a word length, and a word frequency.
According to a second aspect of the present disclosure, there is provided a text sensitive information recognition apparatus, including:
the acquisition module is configured to acquire a pre-established sensitive word data set;
the preprocessing module is configured to preprocess a text to be processed to obtain a word to be recognized in the text to be processed;
the determining module is configured to determine whether the word to be recognized is sensitive information by comparing the word features of the word to be recognized with the word features of the sensitive words in the sensitive word data set.
In one embodiment, the preprocessing module is configured to: performing word segmentation on the text to be processed to obtain a word set to be processed; deleting the words with the word length smaller than a word length preset threshold value in the word set to be processed to obtain the words to be recognized.
In one embodiment, the preprocessing module is configured to: acquiring word length preset thresholds corresponding to different parts of speech; and deleting words of which the word length under each part of speech in the word set to be processed is smaller than a word length preset threshold corresponding to the part of speech to obtain the words to be recognized.
In one embodiment, the determining module is configured to: generating a word vector of the word to be recognized according to the word features of the word to be recognized; calculating the similarity between the word vector of the word to be recognized and the word vector of the sensitive word in the sensitive word data set; the word vector of the sensitive word is obtained according to the word feature of the sensitive word; and determining whether the word to be recognized is sensitive information or not according to the similarity.
In one embodiment, the apparatus for identifying sensitive information of text further includes a filtering module configured to: screening the sensitive word data set according to the words to be screened to obtain a screened sensitive word data set; the words to be screened are words in the text to be processed except the words to be identified;
correspondingly, the determining module is configured to: and comparing the word characteristics of the word to be recognized with the word characteristics of the sensitive words in the screened sensitive word data set to determine whether the word to be recognized is sensitive information.
In one embodiment, the screening module is configured to: and deleting words with the same part of speech as the word to be screened from the sensitive word data set according to the part of speech of the word to be screened to obtain the screened sensitive word data set.
In one embodiment, the word characteristics include at least one of a part of speech, a word length, and a word frequency.
According to a third aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, implements the method for recognizing sensitive information of a text of the first aspect and possible implementations thereof.
According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions for the processor; wherein the processor is configured to perform the textual sensitive information recognition method of the first aspect and possible implementations thereof described above via execution of executable instructions.
The technical scheme of the disclosure has the following beneficial effects:
in the scheme, a pre-established sensitive word data set is obtained; preprocessing a text to be processed to obtain a word to be recognized in the text to be processed; and determining whether the word to be recognized is sensitive information or not by comparing the word characteristics of the word to be recognized with the word characteristics of the sensitive words in the sensitive word data set. The method comprises the steps that a text to be processed is preprocessed, words to be recognized with small data volume are obtained, and then the words to be recognized with small data volume are compared with sensitive words in a sensitive data set, so that whether the words to be recognized are sensitive information or not is determined; in this way, the amount of calculation for the sensitive information determination can be reduced.
Drawings
Fig. 1 is a schematic diagram of a system architecture provided by an embodiment of the present disclosure;
fig. 2 is a schematic flow chart illustrating an implementation of a method for identifying sensitive information of a text according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of an implementation flow for preprocessing a text to be processed according to an embodiment of the present disclosure;
fig. 4 is a schematic flow chart illustrating an implementation of screening a to-be-processed data set according to an embodiment of the present disclosure;
fig. 5 is a schematic view of an implementation flow for determining whether a word to be recognized is sensitive information according to an embodiment of the present disclosure;
fig. 6 is a schematic flow chart illustrating an implementation of screening a sensitive word data set according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a device for recognizing sensitive information of a text according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described more fully hereinafter with reference to the accompanying drawings.
The drawings are schematic illustrations of the present disclosure and are not necessarily drawn to scale. Some of the block diagrams shown in the figures may be functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in hardware modules or integrated circuits, or in a network, processor or microcontroller. Embodiments may be embodied in many different forms and should not be construed as limited to the examples set forth herein. The described features, structures, or characteristics of the disclosure may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough explanation of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that one or more of the specific details can be omitted, or one or more of the specific details can be replaced with other methods, components, devices, steps, etc., in implementing the aspects of the disclosure.
The identification of sensitive data is the fundamental problem of data security technology management, concerns about data security, finds sensitive data and behaviors of enterprise employees and user operating environments in time, performs effective management and control, and is an important task which needs to be solved urgently.
In the related art, the sensitive data in the text to be recognized is recognized by comparing the text to be recognized with the sensitive data, so that under the condition of large text data amount, the calculation amount is large, and the processing efficiency is low.
In view of the above problems, exemplary embodiments of the present disclosure provide a method for recognizing sensitive information of a text, which can reduce the amount of computation for recognizing sensitive information in the text to some extent.
The system architecture and application scenario of an operating environment of the above method for identifying sensitive information of a text are exemplarily described below with reference to fig. 1.
Fig. 1 shows a schematic diagram of a system architecture, and the system architecture 100 may include a terminal 110 and a server 120. The terminal 110 may be a smart phone, a tablet computer, a personal computer, or the like, and the terminal 110 may receive a text to be processed input or designated by a user. Server 120 may generally refer to a background system (e.g., an intelligent voice service system) that provides sensitive information recognition related services, and may be a server or a cluster of servers. The terminal 110 and the server 120 may form a connection through a wired or wireless communication link for data interaction.
In one embodiment, the user inputs text to be processed into the terminal 110, for example, the user may call an intelligent robot on the terminal 110 and input a voice command, and the terminal 110 converts the voice command into a text format of a sentence to be recognized. Then, the terminal 110 may send the text to be processed to the server 120, and the server 120 obtains the sensitive information in the text to be processed by executing the above sensitive information identification method, and may return the sensitive information to the terminal 110.
In one embodiment, the present exemplary embodiment may also be implemented separately based on the terminal 110. For example, after acquiring a to-be-processed text input by a user, the terminal 110 obtains the sensitive information in the to-be-processed text by executing the above sensitive information identification method.
In one embodiment, the exemplary embodiment may also be implemented separately based on the server 120. For example, the server 120 may obtain the text to be processed from the background database, and obtain the sensitive information in the text to be processed by executing the above sensitive information identification method.
As can be seen from the above, in the present exemplary embodiment, the method for identifying sensitive information of a text may be executed by the terminal 110 or the server 120. The present disclosure is not limited thereto.
Fig. 2 is a schematic flow chart of an implementation of a text sensitive information identification method provided in an embodiment of the present disclosure, where the method includes the following steps S210 to S203:
step S210, acquiring a pre-established sensitive word data set;
s220, preprocessing a text to be processed to obtain a word to be recognized in the text to be processed;
step S230, comparing the word features of the word to be recognized with the word features of the sensitive words in the sensitive word data set, and determining whether the word to be recognized is sensitive information.
In the method for identifying the sensitive information of the text, a pre-established sensitive word data set is obtained; preprocessing a text to be processed to obtain a word to be recognized in the text to be processed; and comparing the word characteristics of the word to be recognized with the word characteristics of the sensitive words in the sensitive word data set to determine whether the word to be recognized is sensitive information. The method comprises the steps that a text to be processed is preprocessed, words to be recognized with small data volume are obtained, and then the words to be recognized with small data volume are compared with sensitive words in a sensitive data set, so that whether the words to be recognized are sensitive information or not is determined; in this way, the amount of calculation for the sensitive information determination can be reduced.
Each step in fig. 2 is explained in detail below.
Referring to fig. 2, in step S210, a pre-established sensitive word data set is acquired.
The sensitive word data set can be the sensitive word data characteristics in business systems of internal and external networks of the industry.
The sensitive word data is generally divided into public sensitive word data, industry sensitive word data and enterprise sensitive word data; a sensitive word data set can be constructed according to at least one of public sensitive word data, industry sensitive word data and enterprise sensitive word data; the sensitive word data characteristics at least comprise one of word length, word property and word frequency.
With continued reference to fig. 2, in step S220, the text to be processed is preprocessed to obtain the words to be recognized in the text to be processed.
The preprocessing aims to reduce the data volume so as to obtain a word to be recognized with smaller data volume; here, the preprocessing may be operations such as word segmentation and screening; in one implementation, the word segmentation may be performed on the text to be processed first, and then the text to be processed after the word segmentation is screened, so as to obtain the word to be recognized with smaller data size; such as: after the word segmentation is carried out on the text to be processed, namely the address of the Xiaoming family, the knowing word of the Xiaoming family and the address of the Domo are obtained, and because the word of the Domo is a word assistant, the's' is a structural assistant word, and 'you' is a person-to-person pronoun, so that 'Dou', 'Dou' and 'you' in the words are deleted to obtain the words to be recognized, namely 'Xiaoming', 'Home', 'Address' and 'know'.
With continued reference to fig. 2, in step S230, it is determined whether the word to be recognized is sensitive information by comparing the word features of the word to be recognized with the word features of the sensitive words in the sensitive word data set.
Wherein, the word characteristics at least comprise one of word property, word length and word frequency; furthermore, the words to be recognized and the sensitive words can be represented by at least one of the parts of speech, the word length and the word frequency; in order to facilitate machine recognition and better express words, word vectors constructed by part of speech, word length and word frequency can be adopted to express words; in one embodiment, the word vector may be represented using the following formula:
T i =C i (A i ,B i ) (1)
wherein, T i Representing a word vector; a. The i Representing parts of speech; b is i Represents word length; c i Representing the word frequency.
The word length refers to the length of a word and can be represented by a byte.
Part of speech refers to the type of word, including verbs, adjectives, numerals, quantifiers, prepositions, conjunctions, helpers, vocabularies, and the like.
The word frequency refers to the repetition degree of a word in a document or a corpus, and can be calculated by a formula TF = the number of occurrences of a word/the total number of words.
In an embodiment, in order to reduce the data amount of the text to be processed, the preprocessing in step S220 may be word segmentation and then filtering, specifically, as shown in fig. 3, the step S220 may include the following steps:
and S310, performing word segmentation on the text to be processed to obtain a word set to be processed.
The word segmentation can be performed on the text to be processed by adopting methods such as a Chinese lexical analysis system of a Chinese academy of sciences, a jieba word segmentation tool, word frequency grouping and the like, and the method is not limited here.
The word set to be processed includes all words after the word segmentation of the text to be processed, such as: after the word segmentation is carried out on the text to be processed, namely the address of the Xiaoming family, the address of the you, the knowledge of the Xiaoming family and the address of the Domo.
And S320, deleting the words with the word length smaller than the preset word length threshold value in the word set to be processed to obtain the words to be recognized.
Wherein, the word length preset threshold is determined according to the word characteristics of the sensitive words, such as: the big data of the sensitive word represents that the word length of the sensitive word is greater than or equal to 2, and then the preset word length threshold is set to 2; such as: in the case that the word set to be processed includes seven words, namely "Xiaoming", "Home", "Address", "you", "know", "do", since the word length of the words "of", "you" and "do" is less than 2, the words "of", "you" and "do" in the word set to be processed are deleted, and the words "Xiaoming", "Home", "Address", "know" to be recognized are obtained.
In one embodiment, in order to further reduce the data amount of the text to be processed, the words to be processed may be filtered according to a plurality of word features, as shown in fig. 4, the step S320 may include the following steps:
and step S410, acquiring word length preset thresholds corresponding to different parts of speech.
Wherein, the part of speech includes verb, adjective, number word, quantifier, preposition, conjunctive, auxiliary word, pseudonym, etc.
Different parts of speech correspond to different word lengths and preset threshold values; here, the word length preset threshold may be understood as a set, and the word length preset threshold corresponding to different parts of speech is determined according to big data of words of different parts of speech, such as: the big data of the adjectives indicates that the word length of the adjectives is greater than or equal to 3, and then the preset word length threshold of the adjectives is set to 3; the big data of the conjunctions indicates that the word length of the conjunctions is greater than or equal to 2, and then the preset threshold value of the word length of the conjunctions is set to 2.
And step S420, deleting the words in the word set to be processed, wherein the word length under each part of speech is smaller than the word length preset threshold corresponding to the part of speech, and obtaining the words to be recognized.
In step S420, more accurate screening is performed on the to-be-processed words in the to-be-processed word set through two word features of part of speech and word length, so that the data size of the to-be-processed words can be further reduced; such as: if the preset word length threshold of the adjectives is set to be 3, deleting the words with the word length smaller than 3 under the adjectives; and setting the word length preset threshold value of the conjunctions as 2, and deleting the words with the word length smaller than 2 under the conjunctions.
In one embodiment, for facilitating machine recognition, the word to be recognized and the sensitive word exist in the form of a word vector, as shown in fig. 5, and the step S230 further includes the following steps:
step S510, generating a word vector of the word to be recognized according to the word features of the word to be recognized.
Wherein, the word characteristics at least comprise one of word property, word length and word frequency. In one embodiment, for the convenience of machine recognition and better expression of words, word vectors constructed by the part of speech, word length and word frequency can be used for expressing words; in one embodiment, the word vector may be represented using the following formula:
T i =C i (A i ,B i ) (1)
wherein, T i Representing a word vector; a. The i Representing a part of speech; b is i Indicates word length; c i Representing the word frequency.
Step S520, calculating the similarity between the word vector of the word to be recognized and the word vector of the sensitive word in the sensitive word data set.
And obtaining the word vector of the sensitive word according to the word characteristics of the sensitive word.
The similarity may be calculated by cosine similarity, dice similarity, jaccard similarity, euclidean distance, and the like, which is not limited herein.
The generation mode of the word vector of the sensitive word is consistent with that of the word vector of the word to be recognized; when the word vector of the word to be recognized is represented by the word length in the word characteristics, the word vector of the sensitive word is also ensured by the word length in the word characteristics; when the word vector of the word to be recognized is represented by the word length, the word frequency and the word property in the word characteristics, the word vector of the sensitive word is also represented by the word length, the word frequency and the word property in the word characteristics; therefore, the similarity calculation between the word vector of the word to be recognized and the word vector of the sensitive word can be ensured.
In one embodiment, the similarity may be calculated using the euclidean distance, which is shown in the following equation (2):
wherein, V 1 For inputting phrases of text, i.e. T i =C i (A i ,B i );V 2 For sensitive words in the sensitive word data set, i.e. T i ′=(A i ′,B i ') to a host; z is a doubling factor, i.e. C i (ii) a The similarity calculation formula can be expressed as
In one embodiment, the similarity may be calculated by using cosine similarity, and the calculation formula is shown in the following formula (3):
wherein, T i (C i ) Is T i Remove C i The subsequent vector; is the standard vector dot product; | T i (C i )||||T i ' | | is vector T i (C i ) And T i The modulo product of'.
And step S530, determining whether the word to be recognized is sensitive information according to the similarity.
The greater the similarity is, the greater the possibility that the word to be recognized is the sensitive word is, and the smaller the similarity is, the less the possibility that the word to be recognized is the sensitive word is; in order to enable the judgment of whether the word to be recognized is sensitive information to be more convenient and faster, the judgment can be realized by setting a threshold value, in one implementation mode, a similarity threshold value can be set according to big data, and then the word to be recognized is determined to be sensitive information under the condition that the similarity between the word to be recognized and the sensitive word is greater than the similarity threshold value; and under the condition that the similarity between the word to be recognized and the sensitive word is smaller than a similarity threshold value, determining that the word to be recognized is not sensitive information.
In one embodiment, in order to further reduce the amount of computation, the sensitive word data set may also be filtered, as shown in fig. 6, and the method further includes the following steps:
and S610, screening the sensitive word data set according to the words to be screened to obtain the screened sensitive word data set.
And the words to be screened are words except the words to be identified in the text to be processed.
The screening can be used for screening the sensitive word data set through the word characteristics of the words to be screened, and can also be used for directly screening the sensitive word data set through the words to be screened; such as: and deleting the words with the prepositions and the auxiliary words in the characters to be recognized, so that the words with the prepositions and the auxiliary words in the sensitive word data set can be deleted according to the parts of speech of the prepositions and the auxiliary words in the characters to be recognized.
Step S620, comparing the word features of the word to be recognized with the word features of the sensitive words in the sensitive word data set, and determining whether the word to be recognized is sensitive information, including:
and comparing the word characteristics of the word to be recognized with the word characteristics of the sensitive words in the screened sensitive word data set to determine whether the word to be recognized is sensitive information.
The comparison between the word to be recognized and the sensitive word can be performed through one of the word characteristics, two of the word characteristics, or three of the word characteristics; in one embodiment, a word vector of the word to be recognized and a word vector of the sensitive word may be generated according to the word features, and a similarity between the word vector of the word to be recognized and the word vector of the sensitive word may be calculated to determine whether the word to be recognized is sensitive information.
In an embodiment, when the sensitive word data set is filtered according to the word to be filtered, the filtering may be performed according to a part of speech in the word features of the word to be filtered, and the step S610 further includes the following steps:
and deleting words with the same part of speech as the words to be screened from the sensitive word data set according to the part of speech of the words to be screened to obtain the screened sensitive word data set.
When the sensitive word data set is screened according to the word characteristics of the words to be screened, the sensitive word data set can be screened according to the part of speech of the screened words; such as: the sensitive words are prepositioned, mood-assisted and quantifier with little possibility, so that the word characteristics in the sensitive word data set are prepositioned, mood-assisted and quantifier word deletion.
Exemplary embodiments of the present disclosure also provide a device 700 for recognizing sensitive information of a text; referring to fig. 7, the apparatus 700 for recognizing sensitive information of text may include:
an obtaining module 710 configured to obtain a pre-established sensitive word data set;
the preprocessing module 720 is configured to preprocess the text to be processed to obtain the words to be recognized in the text to be processed;
the determining module 730 is configured to determine whether the word to be recognized is the sensitive information by comparing the word features of the word to be recognized with the word features of the sensitive words in the sensitive word data set.
In one embodiment, the preprocessing module 720 is configured to: performing word segmentation on a text to be processed to obtain a word set to be processed; and deleting the words with the word length smaller than the preset word length threshold in the word set to be processed to obtain the words to be recognized.
In one embodiment, the preprocessing module 720 is configured to: acquiring word length preset thresholds corresponding to different parts of speech; and deleting the words with the word length under each part of speech in the word set to be processed, which is smaller than the word length preset threshold corresponding to the part of speech, to obtain the words to be recognized.
In one embodiment, the determining module 730 is configured to: generating a word vector of the word to be recognized according to the word characteristics of the word to be recognized; calculating the similarity between the word vector of the word to be recognized and the word vector of the sensitive word in the sensitive word data set; the word vector of the sensitive word is obtained according to the word characteristics of the sensitive word; and determining whether the word to be recognized is sensitive information according to the similarity.
In one embodiment, the apparatus 700 for identifying sensitive information of text further includes a filtering module 740, and the filtering module 740 is configured to: screening the sensitive word data set according to the words to be screened to obtain a screened sensitive word data set; the words to be screened are words except the words to be identified in the text to be processed;
correspondingly, the determining module 730 is configured to: and comparing the word characteristics of the word to be recognized with the word characteristics of the sensitive words in the screened sensitive word data set to determine whether the word to be recognized is sensitive information.
In one embodiment, the screening module 740 is configured to: and deleting words with the same parts of speech as the words to be screened from the sensitive word data set according to the parts of speech of the words to be screened to obtain the screened sensitive word data set.
In one embodiment, the word characteristics include at least one of a part of speech, a word length, and a word frequency.
Exemplary embodiments of the present disclosure also provide a computer-readable storage medium, which may be implemented in the form of a program product, including program code for causing an electronic device to perform the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "exemplary method" section of this specification, when the program product is run on the electronic device. In an alternative embodiment, the program product may be embodied as a portable compact disc read only memory (CD-ROM) and include program code, and may be run on an electronic device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Exemplary embodiments of the present disclosure also provide an electronic device, which may include a processor and a memory. The memory stores executable instructions of the processor, such as may be program code. The processor executes the executable instructions to perform the text sensitive information identification method in the exemplary embodiment, such as may perform the method steps of fig. 2.
Referring now to FIG. 8, an electronic device in the form of a general purpose computing device is illustrated. It should be understood that the electronic device 800 shown in fig. 8 is merely an example and should not limit the functionality or scope of use of embodiments of the present disclosure.
As shown in fig. 8, the electronic device 800 may include: a processor 810, a memory 820, a bus 830, an I/O (input/output) interface 840, and a network adapter 850.
The memory 820 may include a volatile memory such as a RAM821, a cache unit 822, and a nonvolatile memory such as a ROM823. Memory 820 may also include one or more program modules 824, such program modules 824 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment. For example, program modules 824 may include modules in apparatus 700 that recognize sensitive information in text as described above.
The bus 830 is used to enable connection between various components of the electronic device 800, and may include a data bus, an address bus, and a control bus.
The electronic device 800 may communicate with one or more networks through the network adapter 850, for example, the network adapter 850 may provide a mobile communication solution such as 3G/4G/5G, or a wireless communication solution such as wireless local area network, bluetooth, near field communication, etc. The network adapter 850 may communicate with other modules of the electronic device 800 via the bus 830.
Although not shown in fig. 8, other hardware and/or software modules may also be provided in the electronic device 800, including but not limited to: displays, microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, according to exemplary embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the following claims.
Claims (10)
1. A method for recognizing sensitive information of a text is characterized by comprising the following steps:
acquiring a pre-established sensitive word data set;
preprocessing a text to be processed to obtain a word to be recognized in the text to be processed;
and determining whether the word to be recognized is sensitive information or not by comparing the word characteristics of the word to be recognized with the word characteristics of the sensitive words in the sensitive word data set.
2. The method for recognizing sensitive information of a text according to claim 1, wherein the preprocessing the text to be processed to obtain the words to be recognized in the text to be processed comprises:
performing word segmentation on the text to be processed to obtain a word set to be processed;
deleting the words with the word length smaller than a word length preset threshold value in the word set to be processed to obtain the words to be recognized.
3. The method for recognizing sensitive information of a text according to claim 2, wherein deleting words of which word lengths in the set of words to be recognized are smaller than a preset word length threshold to obtain the words to be recognized comprises:
acquiring word length preset thresholds corresponding to different parts of speech;
and deleting words of which the word length under each part of speech in the word set to be processed is smaller than a word length preset threshold corresponding to the part of speech to obtain the words to be recognized.
4. The method for recognizing sensitive information of a text according to claim 1, wherein the determining whether the word to be recognized is sensitive information by comparing the word feature of the word to be recognized with the word feature of a sensitive word in the sensitive word data set comprises:
generating a word vector of the word to be recognized according to the word features of the word to be recognized;
calculating the similarity between the word vector of the word to be recognized and the word vector of the sensitive word in the sensitive word data set; the word vector of the sensitive word is obtained according to the word feature of the sensitive word;
and determining whether the word to be recognized is sensitive information or not according to the similarity.
5. The method for identifying sensitive information of text according to claim 1, further comprising:
screening the sensitive word data set according to the words to be screened to obtain a screened sensitive word data set; the words to be screened are words in the text to be processed except the words to be identified;
the determining whether the word to be recognized is sensitive information by comparing the word features of the word to be recognized with the word features of the sensitive words in the sensitive word data set includes:
and comparing the word characteristics of the word to be recognized with the word characteristics of the sensitive words in the screened sensitive word data set to determine whether the word to be recognized is sensitive information.
6. The method for identifying sensitive information of a text according to claim 5, wherein the screening the sensitive word data set according to the word to be screened to obtain the screened sensitive word data set comprises:
and deleting words with the same part of speech as the word to be screened from the sensitive word data set according to the part of speech of the word to be screened to obtain the screened sensitive word data set.
7. The method of claim 1, wherein the word feature comprises at least one of a part of speech, a word length, and a word frequency.
8. An apparatus for recognizing sensitive information of a text, comprising:
the acquisition module is configured to acquire a pre-established sensitive word data set;
the preprocessing module is configured to preprocess a text to be processed to obtain a word to be recognized in the text to be processed;
the determining module is configured to determine whether the word to be recognized is sensitive information by comparing the word features of the word to be recognized with the word features of the sensitive words in the sensitive word data set.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.
10. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of any of claims 1 to 7 via execution of the executable instructions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211709838.9A CN115983262A (en) | 2022-12-29 | 2022-12-29 | Text sensitive information identification method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211709838.9A CN115983262A (en) | 2022-12-29 | 2022-12-29 | Text sensitive information identification method and device, storage medium and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115983262A true CN115983262A (en) | 2023-04-18 |
Family
ID=85966319
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211709838.9A Pending CN115983262A (en) | 2022-12-29 | 2022-12-29 | Text sensitive information identification method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115983262A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116701614A (en) * | 2023-08-02 | 2023-09-05 | 南京壹行科技有限公司 | Sensitive data model building method for intelligent text collection |
-
2022
- 2022-12-29 CN CN202211709838.9A patent/CN115983262A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116701614A (en) * | 2023-08-02 | 2023-09-05 | 南京壹行科技有限公司 | Sensitive data model building method for intelligent text collection |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107767870B (en) | Punctuation mark adding method and device and computer equipment | |
CN108776696B (en) | Node configuration method and device, storage medium and electronic equipment | |
CN109241286B (en) | Method and device for generating text | |
CN112256860A (en) | Semantic retrieval method, system, equipment and storage medium for customer service conversation content | |
US11321956B1 (en) | Sectionizing documents based on visual and language models | |
CN107680588B (en) | Intelligent voice navigation method, device and storage medium | |
CN111783450B (en) | Phrase extraction method and device in corpus text, storage medium and electronic equipment | |
CN110268472B (en) | Detection mechanism for automated dialog system | |
CN110807311B (en) | Method and device for generating information | |
CN113986864A (en) | Log data processing method and device, electronic equipment and storage medium | |
CN110705235B (en) | Information input method and device for business handling, storage medium and electronic equipment | |
CN111597800B (en) | Method, device, equipment and storage medium for obtaining synonyms | |
CN110874532A (en) | Method and device for extracting keywords of feedback information | |
CN110738056B (en) | Method and device for generating information | |
CN111460117B (en) | Method and device for generating intent corpus of conversation robot, medium and electronic equipment | |
CN113660541A (en) | News video abstract generation method and device | |
CN115983262A (en) | Text sensitive information identification method and device, storage medium and electronic equipment | |
CN111209367A (en) | Information searching method, information searching device, electronic equipment and storage medium | |
CN110472241B (en) | Method for generating redundancy-removed information sentence vector and related equipment | |
CN110705308B (en) | Voice information domain identification method and device, storage medium and electronic equipment | |
CN115858776B (en) | Variant text classification recognition method, system, storage medium and electronic equipment | |
CN108920715B (en) | Intelligent auxiliary method, device, server and storage medium for customer service | |
CN111161730A (en) | Voice instruction matching method, device, equipment and storage medium | |
CN116167382A (en) | Intention event extraction method and device, electronic equipment and storage medium | |
CN112711654B (en) | Chinese character interpretation technique generation method, system, equipment and medium for voice robot |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |