CN105183761B - Sensitive word replacing method and device - Google Patents

Sensitive word replacing method and device Download PDF

Info

Publication number
CN105183761B
CN105183761B CN201510446574.6A CN201510446574A CN105183761B CN 105183761 B CN105183761 B CN 105183761B CN 201510446574 A CN201510446574 A CN 201510446574A CN 105183761 B CN105183761 B CN 105183761B
Authority
CN
China
Prior art keywords
sensitive word
sensitive
word
replacement
dialect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510446574.6A
Other languages
Chinese (zh)
Other versions
CN105183761A (en
Inventor
张琦
刘锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Media Technology Beijing Co Ltd
Original Assignee
Netease Media Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Media Technology Beijing Co Ltd filed Critical Netease Media Technology Beijing Co Ltd
Priority to CN201510446574.6A priority Critical patent/CN105183761B/en
Publication of CN105183761A publication Critical patent/CN105183761A/en
Application granted granted Critical
Publication of CN105183761B publication Critical patent/CN105183761B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a sensitive word replacing method and device. The method comprises the following steps: receiving a target text; searching for sensitive words in the target text according to the sensitive word bank; determining a non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule, the non-sensitive word having a lower sensitivity than the sensitive word and being used to express a meaning that is the same as or similar to the sensitive word; and replacing the sensitive word with the non-sensitive word. Therefore, the method and the device of the invention ensure that when the user publishes the content to the Internet, even if the sensitive words are mixed in the text, the publishing enthusiasm of the user can be fully protected by properly processing the sensitive words, and the participation sense of the user is improved.

Description

Sensitive word replacing method and device
Technical Field
The embodiment of the invention relates to the technical field of communication, in particular to a sensitive word replacing method and device.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
The advent of the internet has greatly facilitated the distribution and dissemination of various information content among users. For example, network instant messaging systems are used by an increasing number of people because they can facilitate and quickly communicate between clients. In addition, the microblog and forum also have the characteristics of large number of client groups, convenience in publishing and watching information, large influence and the like. Therefore, there are often people who use various internet tools to send text messages including "sensitive words" in large quantities. For example, the sensitive words may include non-civilized words, sensitive words related to national security, and the like.
At present, the sensitivity of the target text is mostly identified manually, or a sensitive vocabulary is manually established, and the target text is subjected to matching query based on the sensitive vocabulary by a machine, so as to determine the sensitivity of the target text. In this case, when we are publishing contents to the internet, once the sensitive words are included in the text, the following two cases generally occur.
One situation is where the system directly prohibits the user from submitting the target text and prompts the user for inclusion of sensitive words in the target text.
Another situation is that the system allows the user to submit the target text, but before the target text is actually displayed on the internet, a "manual review link" is entered to confirm whether the words judged to be sensitive words by the system really appear in the text in a manual review manner. If the target text is considered to contain the sensitive words by the manual review, the text is not allowed to be published to the Internet, and conversely, if the target text is considered to not contain the sensitive words by the manual review, the text is displayed on the Internet.
Disclosure of Invention
However, in the prior art, once the target text is judged to contain the sensitive words, the user is absolutely prohibited from submitting the text to the system or from publishing the text to the internet, that is, the user does not have any idea that the user can publish himself, so that the publishing enthusiasm of the user is broken, and the participation sense of the user is reduced.
Therefore, in the prior art, how to improve the user experience when publishing content is a very annoying process.
Therefore, an improved method and an improved device for replacing sensitive words are needed, so that when a user publishes content to the internet, even if the sensitive words are mixed in the text, the publishing enthusiasm of the user can be fully protected by properly processing the sensitive words, and the participation sense of the user is improved.
In this context, embodiments of the present invention are intended to provide a sensitive word replacement method and apparatus.
In a first aspect of embodiments of the present invention, there is provided a sensitive word replacement method, including: receiving a target text; searching for sensitive words in the target text according to the sensitive word bank; determining a non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule, the non-sensitive word having a lower sensitivity than the sensitive word and being used to express a meaning that is the same as or similar to the sensitive word; and replacing the sensitive word with the non-sensitive word.
In a second aspect of embodiments of the present invention, there is provided a sensitive word replacing apparatus, including: a target text receiving unit for receiving a target text; the sensitive word searching unit is used for searching a sensitive word in the target text according to the sensitive word bank; a non-sensitive word determination unit configured to determine a non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule, the non-sensitive word having a lower sensitivity than the sensitive word and being used to express a meaning identical or similar to the sensitive word; and a non-sensitive word replacing unit, configured to replace the sensitive word with the non-sensitive word.
In a third aspect of embodiments of the present invention, there is provided a sensitive word replacing apparatus, including: a storage unit and a processing unit, the storage unit having stored thereon computer instructions that, when executed by the processing unit, perform the steps of: receiving a target text; searching for sensitive words in the target text according to the sensitive word bank; determining a non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule, the non-sensitive word having a lower sensitivity than the sensitive word and being used to express a meaning that is the same as or similar to the sensitive word; and replacing the sensitive word with the non-sensitive word.
In a fourth aspect of embodiments of the present invention, there is provided a computer program product comprising: program code for performing the following steps when executed on one or more computing devices: receiving a target text; searching for sensitive words in the target text according to the sensitive word bank; determining a non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule, the non-sensitive word having a lower sensitivity than the sensitive word and being used to express a meaning that is the same as or similar to the sensitive word; and replacing the sensitive word with the non-sensitive word.
According to the sensitive word replacing method and device provided by the embodiment of the invention, the sensitive word in the text can be consciously processed to be desensitized. The benefits of this are: on the aspect of users, the negative energy of the users is reduced, and social harmony is facilitated; on the aspect of the system, the workload of the work of manual review is reduced; in the aspect of culture, the humanistic care and social harmony of the software are embodied. Therefore, the method of the invention ensures that when the user publishes the content to the Internet, even if the sensitive words are mixed in the text, the publishing enthusiasm of the user can be fully protected by properly processing the sensitive words, and the participation sense of the user is improved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 schematically illustrates a framework diagram of an exemplary application scenario of an embodiment of the present invention;
FIG. 2 schematically illustrates a flow diagram of one embodiment of a sensitive word replacement method in an embodiment of the present invention;
FIG. 3 schematically shows a flow chart of an embodiment of the step of determining non-sensitive words in an embodiment of the present invention;
FIG. 4 schematically shows a flow chart of a first example of the step of determining non-sensitive words in an embodiment of the invention;
FIG. 5 schematically shows a flow chart of a second example of the step of determining non-sensitive words in an embodiment of the present invention;
FIG. 6 schematically shows a flow chart of a third example of the step of determining non-sensitive words in an embodiment of the present invention;
FIG. 7 is a flow chart that schematically illustrates yet another example of the step of determining non-sensitive words in an embodiment of the present invention;
fig. 8 schematically shows a schematic diagram of a sensitive word replacing apparatus according to an embodiment of the present invention.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to the embodiment of the invention, a sensitive word replacing method and device are provided.
In this document, it is to be understood that any number of elements in the figures are provided by way of illustration and not limitation, and any nomenclature is used for differentiation only and not in any limiting sense.
The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.
Summary of The Invention
The inventor finds that in the prior art, once the target text is judged to contain the sensitive words, the user is absolutely prohibited from submitting the text to the system or from publishing the text to the internet, that is, the user has no method to express his own idea, obviously, the publishing enthusiasm of the user is destroyed, and the participation sense of the user is reduced.
Based on the analysis of the above findings of the present inventors, the basic design idea of the present invention is: after receiving the target text submitted by the user, once the sensitive word is detected to be included in the target text, the sensitive word can be replaced by a non-sensitive word with the same or similar meaning, and then content publishing is continued.
Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.
Application scene overview
Fig. 1 schematically shows a framework diagram of an exemplary application scenario of an embodiment of the present invention.
Referring to fig. 1, the embodiment of the present invention may be applied to a content distribution system as shown in fig. 1, which includes a server 101, a client 102, and the like.
For example, a user may interact with the server 101 for content publication through a user interface interaction device (e.g., client 102) on the user device. Those skilled in the art will appreciate that the block diagram shown in FIG. 1 is merely one example in which embodiments of the present invention may be implemented. The scope of applicability of embodiments of the present invention is not limited in any way by this framework. For example, embodiments of the present invention may be equally applicable in a standalone application scenario, i.e., an application may be completed by relying only on the client 102 without interaction with the server 101.
It is noted that the user device herein can be any device now existing, developing or developed in the future that is capable of interacting with the server 101 through any form of wired or wireless connection (e.g., Wi-Fi, LAN, coaxial cable, cellular network, etc.). Including but not limited to: existing, developing or future developing, desktop computers, laptop computers, mobile terminals (including smart phones, non-smart phones, various tablet computers), and the like.
It should also be noted that the server 101 is only one example of an existing, developing, or future developing device capable of providing network published applications to users. The embodiments of the invention are not limited in any way in this respect.
It should be noted that the method according to the embodiment of the present invention may be executed by the client 102, and similarly, may also be executed by the server 101, and of course, may also be executed partially by the client 102 and partially by the server 101. It is apparent that the present invention is not limited in terms of executing the main body as long as the method disclosed in the embodiment of the present invention is executed.
Exemplary method
A sensitive word replacement method according to an exemplary embodiment of the present invention is described below with reference to fig. 2 in conjunction with the application scenario of fig. 1. It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.
Fig. 2 schematically shows a flowchart of an embodiment of a sensitive word replacement method in an embodiment of the present invention.
As shown in fig. 2, the sensitive word replacing method of this embodiment may specifically include:
in step S210, a target text is received.
In one example, it may be assumed that the execution subject of the sensitive word replacement method of the present embodiment is the client 102 shown in fig. 1.
For example, the target text may be text content input by a user through an input unit (e.g., a keyboard, a mouse, a trackball, a touch pad, a touch screen, a microphone, etc.) provided on the user device. Next, the user may initiate a command to issue the target text to the internet, for example, through the same or a different input unit (e.g., by pressing the shortcut key Ctrl + Enter through a keyboard or clicking a "send" or "confirm" button, etc. through a mouse), so that the client 102 can receive the target text and start to execute the sensitive word replacement method of the present embodiment.
Further, in another example, it may also be assumed that the execution subject of the sensitive word replacement method of the present embodiment is the server 101 shown in fig. 1.
At this time, after the client 102 receives the target text input by the user, the client 102 may then interact with the server 101 through a wired or wireless connection, transmit the target text input by the user to the server 101, enable the server 101 to receive the target text, and start to execute the sensitive word replacement method of the present embodiment.
In step S220, a sensitive word is searched for in the target text according to the sensitive word bank.
For example, a sensitive word library may be preset in the execution body of the sensitive word replacement method of the present embodiment.
For this purpose, a sensitive word library may be formed by analyzing a large amount of information in advance to summarize some sensitive words commonly used in the information, and stored in the client 102 or the server 101. For example, sensitive words may include non-civilized words, sensitive words related to national security, etc., as well as words intended for publicity, advertising, etc.
Of course, the foregoing sensitive word library is preset only for the purpose of example, and the embodiment of the present invention is not limited thereto. For example, the sensitive thesaurus may also be preset in the cloud server and downloaded to the client 102 or the content server 101 only at the time of use. In addition, in the client 102 or the content server 101, the sensitive thesaurus can be continuously updated through the cloud server. The cloud server updates the sensitive vocabulary list, so that the sensitive vocabulary list can be dynamically maintained, and the richness, the correctness and the real-time performance of the sensitive vocabulary are guaranteed. In addition, the word sensitive library can also have the self-learning ability, so that the recognition ability of the sensitive words can be optimized.
Next, the execution subject of the method may extract the information content in the target text for examination. Then, the executive body can check whether the information content contains the sensitive words stored in the sensitive word bank or not by referring to the sensitive word bank.
Once it is determined that the sensitive word exists in the target text, the method proceeds to step S230 and continues to execute. Conversely, if it is determined that no sensitive words are present in the target text, the executing agent will allow the target text to be posted to the Internet, and the method ends.
In step S230, a non-sensitive word corresponding to the sensitive word is determined according to a sensitive word replacement rule, the non-sensitive word having a lower sensitivity than the sensitive word and being used to express a meaning identical or similar to the sensitive word.
In response to determining that a sensitive word exists in the target text, the execution subject of the method may determine a non-sensitive word corresponding to the sensitive word according to a system default or user-defined sensitive word replacement rule.
FIG. 3 schematically shows a flowchart of an example of the step of determining non-sensitive words in an embodiment of the present invention.
As shown in fig. 3, the step S230 may include:
in step S310, a replacement vocabulary library is obtained according to the sensitive word replacement rule.
In an embodiment of the present invention, a plurality of sensitive word replacement rules may be provided to an execution subject of the method. For example, the sensitive word replacement rules may include semantic replacement rules based on semantic judgment, spelling replacement rules based on spelling processing, dialect replacement rules based on region matching, and the like.
Accordingly, different sensitive word replacement rules may correspond to different replacement vocabulary libraries. However, the embodiments of the present invention are not limited in any way in this respect. For example, one or more different replacement vocabulary libraries may also be defined for multiple sensitive word replacement rules. Furthermore, the processing such as duplicate removal can be further carried out in the replacement vocabulary library so as to save the storage space of the system.
Advantageously, in embodiments of the present invention, there may be a reasonable library of alternative words to support the alternative rules of semantic judgment, spelling and region matching described above. In addition, the alternative vocabulary library can also have the self-learning ability, so that the processing result of the sensitive words can be more optimized.
In step S320, the non-sensitive word is searched in the replacement vocabulary library according to the sensitive word.
After the replacement vocabulary library is obtained, the replacement vocabulary library may be utilized to find non-sensitive words corresponding to the sensitive words.
Obviously, the sensitive word and the non-sensitive word may be in one-to-one correspondence, or may be in one-to-many or many-to-one correspondence. When a plurality of non-sensitive words exist for a sensitive word, one used non-sensitive word can be automatically selected according to the using habit of the user, the used non-sensitive word can be randomly selected, or all candidate items are provided for the user and selected by the user.
In the following, the step of determining non-sensitive words in the sensitive word replacement method according to an embodiment of the present invention will be described in more detail in three different examples.
In a first example, it may be assumed that the sensitive word replacement rule is a semantic replacement rule based on semantic judgment.
The meaning of a literal symbol is semantic (semantic). Semantics can be simply regarded as meaning of a concept represented by a real-world object corresponding to a character symbol and a relationship between the meaning, and is interpretation and logical representation of the character symbol in a certain field.
In this example, the sensitive word may be subjected to replacement processing by performing semantic analysis on the sensitive word to find a non-sensitive word having the same or similar semantic as the sensitive word.
Fig. 4 schematically shows a flow chart of a first example of the step of determining non-sensitive words in an embodiment of the invention.
As shown in fig. 4, the step S310 may include:
in step S410, in response to that the sensitive word replacement rule is a semantic replacement rule, a semantic vocabulary library is obtained, which defines the correspondence between the sensitive words and the non-sensitive words, wherein the sensitive words and the non-sensitive words corresponding to each other constitute the same sentence component in the sentence.
For example, the correspondence between sensitive words and non-sensitive words is defined in a semantic vocabulary library, and in the correspondence of each pair of sensitive and non-sensitive words, both have semantically the same or similar meaning.
In step S420, the semantic vocabulary library is determined as the replacement vocabulary library.
With continued reference to fig. 4, this step S320 may include:
in step S430, semantic analysis is performed on the target text.
For example, a semantic model may be set or stored in advance in a device that executes a subject so that it can determine the semantics of the target text according to the semantic model. Specifically, the executing agent may learn and train the semantic model according to an application scenario of the current topic, and then pre-store the semantic model in a local and/or cloud. Then, after receiving the target text, the execution subject may search for a corresponding semantic model from the local and/or cloud, and determine the definition of the organization rule and the structural relationship between the character symbols in the target text according to the semantic model.
In step S440, the sentence component of the sensitive word itself is determined according to the result of the semantic analysis.
Next, sentence components of the sensitive words in the target text can be determined according to the organization rules and the structural relationship definition between the character symbols in the target text. For example, the sentence component may include: subjects, predicates, objects, determinants, subjects, complements, and the like.
In step S450, the non-sensitive word is selected in the semantic vocabulary library according to the sentence component of the sensitive word itself.
Once it is determined what sentence component in the target text the sensitive word constitutes, the appropriate non-sensitive word can be looked up in the semantic vocabulary library accordingly.
Specifically, the word object acted on by the sensitive word in the target text may be determined first according to the result of the semantic analysis; the non-sensitive word is then selected in the semantic vocabulary library based on the sentence component of the sensitive word itself and the meaning of the literal object.
In the following, the present example will be specifically explained taking two examples.
In the first example, it is assumed that the target text received in step S210 is "today' S traffic is too busy and the relevant department is too busy". Obviously, the target text includes two sensitive words "white dementia". By performing semantic analysis, it can be known that the first sensitive word "mr" is an adjective, used as the predicate of a sentence, for the adjective "traffic", and the second sensitive word "mr" is also an adjective, used as the predicate of a sentence, for the adjective "department". After the analysis results are obtained, for the first sensitive word, a non-sensitive adjective for the adjective noun "traffic", e.g., "horror", may be first selected from the semantic vocabulary library. However, if the adjective "terrorism" is used for the adjective "traffic" and is continuously used for the adjective "department", the sentence is not smooth enough, so that the audience cannot understand the meaning of the user. To this end, one may continue to select a non-sensitive adjective, e.g., "anergy," for the adjective noun "department" in the semantic vocabulary library. Thus, the target text can be converted into 'transport is too terrorised today and related departments are too disabled' in the subsequent steps.
In the second example, it is assumed that the target text received in step S210 is "how flights are delayed and balls are mixed" and "how gutter oil is eaten and balls are mixed". Obviously, the target text includes two sensitive words "mixed balls", respectively. By performing semantic analysis, it can be known that the first sensitive word "ball mixing" is a noun used as a separate word of a sentence to describe the user's feeling of flight delay, and the second sensitive word "ball mixing" is also a noun used as a separate word of a sentence to describe the user's feeling of eating gutter oil. After obtaining the above analysis result, for the first sensitive word, a non-sensitive noun for accommodating flight delay, for example, "xantho", may be first selected from the semantic vocabulary library. However, if the adjective "pale in nature" is used to continue to be used to adjectively eat illegal cooking oil, the adaptation degree is not so strong, so that the audience cannot accurately understand the meaning to be expressed by the user. To this end, one may continue to select a non-sensitive noun or noun phrase in the semantic vocabulary library for figuring out eating illegal cooking oil, e.g., "live hard". Therefore, the target texts can be respectively converted into 'how to delay and leave sky' and 'how to eat illegal cooking oil and live difficult' in the subsequent steps.
By further comparing the two examples above, it can be seen that the semantic analysis in the first example is more similar to a semantic analysis based on the sentence structure itself, while the semantic analysis in the second example is more similar to a semantic analysis based on the user context. There is a certain difference between the two, which is determined by the result of semantic analysis.
Obviously, in this example, through semantic replacement, the original sensitive word can be processed into a non-sensitive word with the same or similar meaning, so that the original meaning of the original target text is fully retained under the condition of removing the sensitivity, and the user is ensured to well express the own meaning without negative influence.
In a second example, it may be assumed that the sensitive word replacement rule is a spelling-based spelling replacement rule.
Spelling is the alphabetic and numeric representation of the language word. The letter expression mainly refers to pinyin, which is a process of spelling syllables, namely, according to the constitution rule of syllables of the Mandarin, initials, intermediate letters and finals are rapidly and continuously spliced and added with tones to form a syllable. The digital representation mainly means that when Chinese numbers are included in semantic words, Arabic numbers can be directly used for representing the Chinese numbers.
In this example, the desensitization process may be performed on a sensitive word by finding a letter and/or number combination that corresponds to the sensitive word.
Fig. 5 schematically shows a flow chart of a second example of the step of determining non-sensitive words in an embodiment of the invention.
As shown in fig. 5, the step S310 may include:
in step S510, in response to the sensitive word replacement rule being a spelling replacement rule, a spelling vocabulary library is obtained, the spelling vocabulary library defining a correspondence between a sensitive word and a non-sensitive word, wherein the non-sensitive word is a set of numeric characters corresponding to the sensitive word.
For example, the correspondence between a sensitive word and a non-sensitive word is defined in the spelling vocabulary library, and in each pair of the correspondence between a sensitive word and a non-sensitive word, the non-sensitive word may represent the meaning of the sensitive word by a set of letters and numbers.
For example, a non-sensitive word may be constructed in the following manner: which includes the first letter in the pinyin letters of each word in the sensitive word and sequentially combines the first letters in the pinyin for each word into a set of first letters in the order of each word in the sensitive word.
Alternatively, the non-sensitive word may be constructed in the following manner: which includes the first letter in the pinyin letters for each non-chinese number in the sensitive word and includes the arabic number for each chinese number in the sensitive word, and these first letters and arabic numbers are combined in order in the sensitive word as a set of numeric letters.
In step S520, the spelling vocabulary library is determined as the replacement vocabulary library.
With continued reference to fig. 5, this step S320 may include:
in step S530, a specific set of numeric characters is searched in the spelling vocabulary library according to the sensitive word.
In step S540, the specific set of alphanumeric characters is determined to be the non-sensitive word.
Next, the present example will be specifically described by taking an example.
For example, assume that the target text received in step S210 is "the person is too allergic". Obviously, the target text includes a sensitive word "metamorphosis". By looking up in the spelling vocabulary library, the non-sensitive set of alphanumeric characters corresponding to the sensitive word "allergy" can be found to be "BT". Thus, the target text can be converted into "this person is too BT" in the subsequent step.
Obviously, in the present example, through spelling replacement, the original sensitive word can be treated as a non-sensitive word capable of expressing the same or similar meaning, so that the original meaning of the original target text is fully preserved with the sensitivity removed.
In a third example, it may be assumed that the sensitive word replacement rule is a dialect replacement rule.
Dialects, regional dialects, are variants of languages due to regional differences, are branches of nationwide languages in different regions, and reflect regional imbalance in language development. The regional branches differentiated from the same language are called "dialects" if they are under incompletely differentiated social conditions and under psychological recognition of the same language.
In this example, the sensitivity of most of the public in a different region from the user to the sensitive word may be weakened by finding dialects corresponding to the sensitive word. In addition, the non-sensitive dialect corresponding to the sensitive dialect can be further searched, so that other public in the same region with the user can not be affected negatively and can feel the word meaning expressed by the user more closely.
Fig. 6 schematically shows a flow chart of a third example of the step of determining non-sensitive words in an embodiment of the invention.
As shown in fig. 6, the step S310 may include:
in step S610, in response to the sensitive word replacement rule being a dialect replacement rule, an Internet Protocol (IP) address of the user equipment is acquired.
Each Host (Host) on the internet must have a unique IP address. The IP protocol is the one that uses this address to transfer information between hosts, which is the basis on which the internet can operate. The length of the IP address is 32 bits (total 2^32 IP addresses), the IP address is divided into 4 sections, each section is 8 bits and is expressed by decimal numbers, the number range of each section is 0-255, and the sections are separated by periods. Such as 159.226.1.1. The IP address can be regarded as two parts, i.e. the network identification number and the host identification number, i.e. the IP address can be composed of two parts, one part is the network address and the other part is the host address.
In step S620, the geographical area where the user is located is determined according to the internet protocol address.
After obtaining the client's IP address, the client's approximate or detailed address can be easily located from tools such as an information database. From the approximate or detailed address, it can be determined which provincial city and municipality, even which city, district, county, etc. the user is located.
In step S630, a first dialect vocabulary library corresponding to the geographic area is obtained, where the first dialect vocabulary library defines a corresponding relationship between a sensitive word and a dialect synonym, where the dialect synonym has a lower sensitivity than the sensitive word and is a dialect vocabulary for expressing a meaning identical or similar to the sensitive word in the geographic area where the user is located.
For example, different dialect vocabulary libraries may be defined in advance for different geographic regions. Of course, to save storage space, the geographic area using the same dialect may be made to correspond to a dialect vocabulary library. Thus, after the geographic region of the user is determined, the dialect vocabulary library corresponding to the geographic region can be further searched.
In step S640, the first dialect vocabulary library is determined as the replacement vocabulary library.
With continued reference to fig. 6, this step S320 may include:
in step S650, a dialect synonym is searched in the first-party vocabulary library according to the sensitive word, and the dialect synonym is used as the non-sensitive word.
Next, the present example will be specifically described by taking an example.
For example, assume that the target text received in step S210 is "the person is too mentally" and the corresponding client IP address indicates that the client is located in sichuan province. Obviously, the target text includes a sensitive word "wisdom". By looking up in the dialect vocabulary library, the non-sensitive dialect corresponding to the sensitive word "wisdom" can be found to be "haar". In this way, the target text can be converted into "this person is too haar" in a subsequent step.
Obviously, in this example, by simple dialect replacement, the sensitivity of most of the public in a different region from the Sichuan user to the sensitive word can be weakened.
Alternatively, since even the weakened dialect vocabulary may still have sensitivity to some extent, the non-sensitive dialect corresponding to the sensitive dialect may be further searched, so that the negative impact of the sensitive dialect on other public in the same region as the user may be further eliminated.
For this reason, with continued reference to fig. 6, the step S320 may also include:
in step S660, searching for a dialect synonym in the first-party vocabulary library according to the sensitive word;
in step S670, obtaining a second dialect vocabulary library corresponding to the geographic area, the second dialect vocabulary library defining a corresponding relationship between dialect synonyms and dialect non-sensitive words, wherein the dialect non-sensitive words have lower sensitivity than the dialect synonyms and are dialect vocabularies for expressing the same or similar meaning as the dialect synonyms in the geographic area where the user is located; and
in step S680, a dialect non-sensitive word is searched in the second dialect vocabulary library according to the dialect synonym, and the dialect non-sensitive word is used as the non-sensitive word to replace the sensitive word.
Next, the present example will be specifically described by taking an example.
For example, assume that the target text received in step S210 is "the person is too mentally" and the corresponding client IP address indicates that the client is located in sichuan province. Obviously, the target text includes a sensitive word "wisdom". By looking up in the dialect vocabulary library, it can be found that the non-sensitive word corresponding to the sensitive word "wisdom" is "haar". However, because "har" still has some profanity, another non-sensitive dialect "silly" corresponding to the sensitive dialect "har" can be further looked up in the dialect vocabulary library. In this way, the target text can be converted into "this person is too much" in a subsequent step.
Obviously, compared with the previous example, in the present example, not only is the sensitivity of most public in different regions from the user in the Sichuan to the sensitive word weakened, but also the dialect vocabulary still having a certain sensitivity is further converted into insensitive dialect vocabulary, so that other public in the same region as the user in the Sichuan can positively and consciously understand the meaning to be expressed by the user in the Sichuan.
Although the above embodiment of the step of determining the sensitive word has been described by taking the alternative vocabulary library as an example, the present invention is not limited thereto. For example, in another embodiment of the present invention, when the sensitive word replacement rule is a spelling replacement rule based on spelling processing, the determination of the non-sensitive word can be directly performed without acquiring a replacement vocabulary library.
Fig. 7 schematically shows a flowchart of yet another example of the step of determining non-sensitive words in the embodiment of the present invention.
As shown in fig. 7, the step S230 may include:
in step S710, in response to the sensitive word replacement rule being a spelling replacement rule, a set of numeric letters corresponding to the sensitive word is determined.
In step S720, the set of alphanumeric characters is determined to be the non-sensitive word.
As mentioned above, spelling is a direct representation of the letters and/or numbers of the alphanumeric characters, and as such, the numeric and/or alphabetic representation of the sensitive words can be made directly without any pre-set vocabulary library.
In one example, the sensitive words may all be represented using a set of letters.
Specifically, in this step S710, the first letter in the pinyin letters of each character in the sensitive word may be determined first; then, the first letters in the pinyin of each character are sequentially combined into a first letter set as the numeric letter set according to the order of each character in the sensitive word.
For example, assume that the target text received in step S210 is "the person is too allergic". Obviously, the target text includes a sensitive word "metamorphosis". By performing pinyin analysis on the sensitive word, the first letter set corresponding to the sensitive word 'metamorphosis' can be found to be 'BT'. Thus, the target text can be converted into "this person is too BT" in the subsequent step.
In another example, where Chinese numbers are included in the sensitive word, the sensitive word may be represented using a collection of letters and numbers.
Specifically, in this step S710, it may be determined whether a chinese number is included in the sensitive word; then responding to the fact that Chinese numbers are included in the sensitive word, determining Arabic numbers corresponding to each Chinese number in the sensitive word, and determining the first letter in the pinyin letters of each non-Chinese number in the sensitive word; and finally, sequentially combining the Arabic numerals corresponding to each Chinese numeral and the initial letters in the pinyin of each character into the numeral letter set according to the sequence of each character in the sensitive word.
For example, it is assumed that the target text received in step S210 is "this person is really a fool". Obviously, the target text includes a sensitive word "duyio" including a Chinese number "two". By performing pinyin and numerical analysis on the sensitive word, the digital letter set corresponding to the sensitive word 'binary' can be found to be '2S'. Thus, the target text can be converted into "the person is really a 2S" in the subsequent step.
In step S240, the sensitive word is replaced with the non-sensitive word.
Finally, sensitive words may be replaced with non-sensitive words determined by any of the above-described manners.
In one embodiment of the present invention, since there may be a plurality of sensitive word replacement rules, preferably, different options may be further provided to the user, so that the user selects different sensitive word replacement rules as needed to meet the customization needs of the user.
Therefore, before step S230, the sensitive word replacing method of this embodiment may further include, for example:
in step S250, a plurality of replacement candidate rules are provided to the user.
In step S260, one replacement candidate rule selected by the user among the plurality of replacement candidate rules is received.
In step S270, the replacement candidate rule selected by the user is determined as the sensitive word replacement rule.
For example, a plurality of sensitive word replacement rules, such as semantic replacement rules based on semantic judgment, spelling replacement rules based on spelling processing, dialect replacement rules based on region matching, and the like, may be provided to a user through a graphical user interface at a user interface interaction device (e.g., client 102). Also, the sensitive word replacement rule that the user desires to use in step S230 is determined according to the selection operation performed by the user using the input device.
It should be noted that, as far as possible, the above description has been made by taking the example in which steps S250 to S270 are executed before step S230. However, the present invention is not limited thereto. Obviously, the steps S250 to S270 may also be located before the step S220, even before the step S210.
Through the technical scheme of the embodiment, the sensitive words in the text can be consciously processed to be desensitized. The benefits of this are: on the aspect of users, the negative energy of the users is reduced, and social harmony is facilitated; on the aspect of the system, the workload of the work of manual review is reduced; in the aspect of culture, the humanistic care and social harmony of the software are embodied. Therefore, the method of the invention ensures that when the user publishes the content to the Internet, even if the sensitive words are mixed in the text, the publishing enthusiasm of the user can be fully protected by properly processing the sensitive words, and the participation sense of the user is improved.
Exemplary device
Having described the method of an exemplary embodiment of the present invention, a sensitive word replacing apparatus according to another exemplary embodiment of the present invention is described next.
Fig. 8 schematically shows a schematic diagram of a sensitive word replacing apparatus according to an embodiment of the present invention. As shown in fig. 8, the apparatus 800 may include:
a target text receiving unit 810 for receiving a target text;
a sensitive word searching unit 820, configured to search a sensitive word in the target text according to a sensitive word bank;
a non-sensitive word determining unit 830, configured to determine a non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule, where the non-sensitive word has a lower sensitivity than the sensitive word and is used to express a meaning identical or similar to the sensitive word; and
a non-sensitive word replacing unit 840, configured to replace the sensitive word with the non-sensitive word.
In an embodiment of the present invention, in order to determine the non-sensitive word corresponding to the sensitive word according to the sensitive word replacement rule, the non-sensitive word determining unit 830 may obtain a replacement vocabulary library according to the sensitive word replacement rule; and searching the non-sensitive words in the replacement vocabulary library according to the sensitive words.
In a specific example, in order to obtain a replacement vocabulary library according to the sensitive word replacement rule, the non-sensitive word determining unit 830 may obtain a semantic vocabulary library in response to the sensitive word replacement rule being a semantic replacement rule, the semantic vocabulary library defining a correspondence between sensitive words and non-sensitive words, wherein the sensitive words and the non-sensitive words corresponding to each other constitute the same sentence component in a sentence; and determining the semantic vocabulary library as the replacement vocabulary library.
In this specific example, in order to search the non-sensitive word in the replacement vocabulary library according to the sensitive word, the non-sensitive word determination unit 830 may perform semantic analysis on the target text; determining sentence components of the sensitive words according to the semantic analysis result; and selecting the non-sensitive word from the semantic vocabulary library according to the sentence component of the sensitive word.
Specifically, in order to select the non-sensitive word in the semantic vocabulary library according to the sentence component of the sensitive word itself, the non-sensitive word determining unit 830 may determine the literal object acted on by the sensitive word in the target text according to the result of the semantic analysis; and selecting the non-sensitive word in the semantic vocabulary library according to the sentence component of the sensitive word and the meaning of the character object.
In another specific example, to obtain the replacement vocabulary library according to the sensitive word replacement rule, the non-sensitive word determining unit 830 may obtain a spelling vocabulary library, which defines a correspondence between the sensitive word and a non-sensitive word, in response to the sensitive word replacement rule being a spelling replacement rule, wherein the non-sensitive word is a set of numeric letters corresponding to the sensitive word; and determining the spelling vocabulary library as the replacement vocabulary library.
In this specific example, in order to search for the non-sensitive word in the replacement vocabulary library according to the sensitive word, the non-sensitive word determination unit 830 may search for a specific set of numeric characters in the spelling vocabulary library according to the sensitive word; and determining the particular set of alphanumeric characters as the non-sensitive word
In yet another specific example, to obtain a replacement vocabulary library according to the sensitive word replacement rule, the non-sensitive word determining unit 830 may obtain an Internet Protocol (IP) address of the user equipment in response to the sensitive word replacement rule being a dialect replacement rule; determining the geographical area where the user is located according to the internet protocol address; acquiring a first party term library corresponding to the geographic area, wherein the first party term library defines the corresponding relation between a sensitive word and a dialect synonym, and the dialect synonym has lower sensitivity than the sensitive word and is a dialect vocabulary used for expressing the same or similar meaning as the sensitive word in the geographic area where the user is located; and determining the first party vocabulary library as the replacement vocabulary library.
In this specific example, in order to search for the non-sensitive word in the replacement vocabulary library according to the sensitive word, the non-sensitive word determining unit 830 may search for a dialect synonym in the first-party vocabulary library according to the sensitive word, as the non-sensitive word.
Alternatively, in order to search for the non-sensitive word in the replacement vocabulary library according to the sensitive word, the non-sensitive word determining unit 830 may search for a dialect synonym in the first-party vocabulary library according to the sensitive word; obtaining a second dialect vocabulary library corresponding to the geographic area, wherein the second dialect vocabulary library defines the corresponding relation between dialect synonyms and dialect non-sensitive words, and the dialect non-sensitive words have lower sensitivity than the dialect synonyms and are dialect vocabularies which are used for expressing the same or similar meanings as the dialect synonyms in the geographic area where the user is located; and searching dialect non-sensitive words in the second dialect vocabulary library according to the dialect synonyms to serve as the non-sensitive words to replace the sensitive words.
In one embodiment of the present invention, in order to determine a non-sensitive word corresponding to a sensitive word according to a sensitive word replacement rule, the non-sensitive word determination unit 830 may determine a set of numeric characters corresponding to the sensitive word in response to the sensitive word replacement rule being a spelling replacement rule; and determining the set of numeric characters as the non-sensitive word.
In a specific example, to determine the set of numeric characters corresponding to the sensitive word, the non-sensitive word determination unit 830 may determine the first letter in the pinyin letters of each word in the sensitive word; and sequentially combining the first letters in the pinyin of each character into a first letter set according to the sequence of each character in the sensitive word, wherein the first letter set is used as the digital letter set.
In another specific example, to determine the set of alphanumeric characters corresponding to the sensitive word, the non-sensitive word determining unit 830 may determine whether chinese numbers are included in the sensitive word; in response to the sensitive word including Chinese numbers, determining an Arabic number corresponding to each Chinese number in the sensitive word and determining a first letter in a Pinyin letter of each non-Chinese number in the sensitive word; and sequentially combining Arabic numerals corresponding to each Chinese numeral and the first letter in the pinyin of each character into the number letter set according to the sequence of each character in the sensitive word.
With continued reference to fig. 8, the apparatus 800 may further include:
a candidate rule providing unit 850 for providing a plurality of alternative candidate rules to the user;
a user selection receiving unit 860 for receiving one replacement candidate rule selected by a user among the plurality of replacement candidate rules; and
a replacement rule determining unit 870, configured to determine a replacement candidate rule selected by the user as the sensitive word replacement rule.
The specific configuration and operation of each unit in the sensitive word replacing device 800 according to an embodiment of the present application have been described in detail in the sensitive word replacing method described above with reference to fig. 1 to 7, and thus, a repetitive description thereof will be omitted.
Exemplary device
Having described the method and apparatus of an exemplary embodiment of the present invention, a sensitive word replacing apparatus according to another exemplary embodiment of the present invention is described next.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
In some possible implementations, the sensitive word replacing apparatus according to the embodiment of the present invention may include at least one processing unit and at least one storage unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the steps of the sensitive word replacing method according to various exemplary embodiments of the present invention described in the above section "exemplary method" of the present specification. For example, the processing unit may perform the various steps as shown in fig. 2: in step S210, a target text is received; in step S220, searching for a sensitive word in the target text according to a sensitive word bank; in step S230, determining a non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule, the non-sensitive word having a lower sensitivity than the sensitive word and being used for expressing a meaning identical or similar to the sensitive word; and in step S240, replacing the sensitive word with the non-sensitive word.
Exemplary program product
In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product including program code for causing a user equipment to perform the steps in the sensitive word replacing method according to various exemplary embodiments of the present invention described in the above section "exemplary method" of this specification, when the program code runs on the user equipment, for example, the user equipment may perform the steps as shown in fig. 2: in step S210, a target text is received; in step S220, searching for a sensitive word in the target text according to a sensitive word bank; in step S230, determining a non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule, the non-sensitive word having a lower sensitivity than the sensitive word and being used for expressing a meaning identical or similar to the sensitive word; and in step S240, replacing the sensitive word with the non-sensitive word.
It should be noted that although in the above detailed description several units or sub-units of the means for sensitive word replacement are mentioned, this division is only illustrative and not mandatory. Indeed, the features and functions of two or more of the units described above may be embodied in one unit, according to embodiments of the invention. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.
Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (20)

1. A sensitive word replacement method, comprising:
receiving a target text;
searching for sensitive words in the target text according to the sensitive word bank;
determining a non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule, the non-sensitive word having a lower sensitivity than the sensitive word and being used to express a meaning that is the same as or similar to the sensitive word; and
replacing the sensitive word with the non-sensitive word;
wherein the sensitive words and the non-sensitive words are in one-to-one correspondence, one-to-many correspondence, or many-to-one correspondence; when a plurality of non-sensitive words exist for a sensitive word, automatically selecting one of the non-sensitive words according to the use habit of a user, or randomly selecting one of the non-sensitive words, or providing the non-sensitive words to the user and selecting the non-sensitive words by the user;
the sensitive word replacement rule is a semantic replacement rule, or a spelling replacement rule, or a dialect replacement rule; and is
Determining a word object acted by the sensitive word in the target text according to a semantic analysis result of the target text, and then selecting the non-sensitive word according to a sentence component of the sensitive word and the meaning of the word object;
wherein determining a non-sensitive word corresponding to the sensitive word according to the sensitive word replacement rule comprises:
acquiring a replacement vocabulary library according to the sensitive word replacement rule; and
searching the non-sensitive words in the replacement vocabulary library according to the sensitive words;
when the sensitive word replacement rule is a dialect replacement rule, acquiring a replacement vocabulary library according to the sensitive word replacement rule comprises:
in response to the sensitive word replacement rule being a dialect replacement rule, obtaining an Internet Protocol (IP) address of a user device;
determining the geographical area where the user is located according to the internet protocol address;
acquiring a first party term library corresponding to the geographic area, wherein the first party term library defines the corresponding relation between a sensitive word and a dialect synonym, and the dialect synonym has lower sensitivity than the sensitive word and is a dialect vocabulary used for expressing the same or similar meaning as the sensitive word in the geographic area where the user is located; and
determining the first dialect vocabulary library as the replacement vocabulary library;
wherein searching for the non-sensitive word in the replacement vocabulary library according to the sensitive word comprises:
searching a dialect synonym in the first dialect vocabulary library according to the sensitive word;
obtaining a second dialect vocabulary library corresponding to the geographic area, wherein the second dialect vocabulary library defines the corresponding relation between dialect synonyms and dialect non-sensitive words, and the dialect non-sensitive words have lower sensitivity than the dialect synonyms and are dialect vocabularies which are used for expressing the same or similar meanings as the dialect synonyms in the geographic area where the user is located; and
and searching dialect non-sensitive words in the second dialect vocabulary library according to the dialect synonyms, wherein the dialect non-sensitive words are used as the non-sensitive words to replace the sensitive words.
2. The method of claim 1, wherein when the sensitive word replacement rule is a semantic replacement rule, obtaining a replacement vocabulary library according to the sensitive word replacement rule comprises:
responding to the sensitive word replacement rule being a semantic replacement rule, acquiring a semantic vocabulary library, wherein the semantic vocabulary library defines the corresponding relation between the sensitive words and the non-sensitive words, and the sensitive words and the non-sensitive words which correspond to each other form the same sentence components in the sentence; and
determining the semantic vocabulary library as the replacement vocabulary library.
3. The method of claim 2, wherein looking up the non-sensitive word in the replacement vocabulary library from the sensitive word comprises:
performing semantic analysis on the target text;
determining sentence components of the sensitive words according to the semantic analysis result; and
selecting the non-sensitive word in the semantic vocabulary library according to the sentence component of the sensitive word.
4. The method of claim 3, wherein selecting the non-sensitive word in the semantic vocabulary library according to the sentence component of the sensitive word itself comprises:
determining a character object acted on by the sensitive word in the target text according to the result of the semantic analysis; and
and selecting the non-sensitive words in the semantic vocabulary library according to the sentence components of the sensitive words and the meanings of the character objects.
5. The method of claim 1, wherein, when the sensitive word replacement rule is a spelling replacement rule, obtaining a replacement vocabulary library according to the sensitive word replacement rule comprises:
in response to the sensitive word replacement rule being a spelling replacement rule, obtaining a spelling vocabulary library, the spelling vocabulary library defining a correspondence between a sensitive word and a non-sensitive word, wherein the non-sensitive word is a set of numeric characters corresponding to the sensitive word; and
determining the spelling vocabulary library as the replacement vocabulary library.
6. The method of claim 5, wherein looking up the non-sensitive word in the replacement vocabulary library from the sensitive word comprises:
searching a specific number letter set in the spelling word library according to the sensitive word; and
determining the particular set of numeric characters as the non-sensitive word.
7. The method of claim 1, wherein determining the non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule comprises:
in response to the sensitive word replacement rule being a spelling replacement rule, determining a set of numeric letters corresponding to the sensitive word; and
determining the set of numeric characters as the non-sensitive word.
8. The method of claim 7, wherein determining the set of alphanumeric characters corresponding to the sensitive word comprises:
determining the first letter in the pinyin letters of each character in the sensitive word; and
and sequentially combining the initial letters in the pinyin of each character into an initial letter set according to the sequence of each character in the sensitive word, wherein the initial letter set is used as the digital letter set.
9. The method of claim 7, wherein determining the set of alphanumeric characters corresponding to the sensitive word comprises:
judging whether Chinese numbers are included in the sensitive words or not;
in response to the sensitive word including Chinese numbers, determining an Arabic number corresponding to each Chinese number in the sensitive word and determining a first letter in a Pinyin letter of each non-Chinese number in the sensitive word; and
and sequentially combining Arabic numerals corresponding to each Chinese numeral and the first letter in the pinyin of each character into the numeral letter set according to the sequence of each character in the sensitive word.
10. The method of claim 1, further comprising:
providing a plurality of replacement candidate rules to a user;
receiving a replacement candidate rule selected by a user among the plurality of replacement candidate rules; and
and determining the replacement candidate rule selected by the user as the sensitive word replacement rule.
11. A sensitive word replacement apparatus comprising:
a target text receiving unit for receiving a target text;
the sensitive word searching unit is used for searching a sensitive word in the target text according to the sensitive word bank;
a non-sensitive word determination unit configured to determine a non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule, the non-sensitive word having a lower sensitivity than the sensitive word and being used to express a meaning identical or similar to the sensitive word; and
a non-sensitive word replacing unit, configured to replace the sensitive word with the non-sensitive word;
wherein the sensitive words and the non-sensitive words are in one-to-one correspondence, one-to-many correspondence, or many-to-one correspondence; when a plurality of non-sensitive words exist for a sensitive word, automatically selecting one of the non-sensitive words according to the use habit of a user, or randomly selecting one of the non-sensitive words, or providing the non-sensitive words to the user and selecting the non-sensitive words by the user;
the sensitive word replacement rule is a semantic replacement rule, or a spelling replacement rule, or a dialect replacement rule; and is
Determining a word object acted by the sensitive word in the target text according to a semantic analysis result of the target text, and then selecting the non-sensitive word according to a sentence component of the sensitive word and the meaning of the word object;
wherein, in order to determine a non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule, the non-sensitive word determining unit:
acquiring a replacement vocabulary library according to the sensitive word replacement rule; and is
Searching the non-sensitive words in the replacement vocabulary library according to the sensitive words;
wherein, in order to obtain a replacement vocabulary library according to the sensitive word replacement rule, when the sensitive word replacement rule is a dialect replacement rule, the non-sensitive word determination unit:
in response to the sensitive word replacement rule being a dialect replacement rule, obtaining an Internet Protocol (IP) address of a user device;
determining the geographical area where the user is located according to the internet protocol address;
acquiring a first party term library corresponding to the geographic area, wherein the first party term library defines the corresponding relation between a sensitive word and a dialect synonym, and the dialect synonym has lower sensitivity than the sensitive word and is a dialect vocabulary used for expressing the same or similar meaning as the sensitive word in the geographic area where the user is located; and is
Determining the first dialect vocabulary library as the replacement vocabulary library;
wherein, in order to search the non-sensitive word in the replacement vocabulary library according to the sensitive word, the non-sensitive word determining unit:
searching a dialect synonym in the first dialect vocabulary library according to the sensitive word;
obtaining a second dialect vocabulary library corresponding to the geographic area, wherein the second dialect vocabulary library defines the corresponding relation between dialect synonyms and dialect non-sensitive words, and the dialect non-sensitive words have lower sensitivity than the dialect synonyms and are dialect vocabularies which are used for expressing the same or similar meanings as the dialect synonyms in the geographic area where the user is located; and is
And searching dialect non-sensitive words in the second dialect vocabulary library according to the dialect synonyms, wherein the dialect non-sensitive words are used as the non-sensitive words to replace the sensitive words.
12. The apparatus of claim 11, wherein to obtain a replacement vocabulary library according to the sensitive word replacement rule, when the sensitive word replacement rule is a semantic replacement rule, the non-sensitive word determination unit:
responding to the sensitive word replacement rule being a semantic replacement rule, acquiring a semantic vocabulary library, wherein the semantic vocabulary library defines the corresponding relation between the sensitive words and the non-sensitive words, and the sensitive words and the non-sensitive words which correspond to each other form the same sentence components in the sentence; and is
Determining the semantic vocabulary library as the replacement vocabulary library.
13. The apparatus of claim 12, wherein to find the non-sensitive word in the replacement vocabulary library from the sensitive word, the non-sensitive word determination unit:
performing semantic analysis on the target text;
determining sentence components of the sensitive words according to the semantic analysis result; and is
Selecting the non-sensitive word in the semantic vocabulary library according to the sentence component of the sensitive word.
14. The apparatus of claim 13, wherein to select the non-sensitive word in the semantic vocabulary library according to a sentence component of the sensitive word itself, the non-sensitive word determination unit:
determining a character object acted on by the sensitive word in the target text according to the result of the semantic analysis; and is
And selecting the non-sensitive words in the semantic vocabulary library according to the sentence components of the sensitive words and the meanings of the character objects.
15. The apparatus of claim 11, wherein to obtain a replacement vocabulary library according to the sensitive word replacement rule, when the sensitive word replacement rule is a spelling replacement rule, the non-sensitive word determining unit:
in response to the sensitive word replacement rule being a spelling replacement rule, obtaining a spelling vocabulary library, the spelling vocabulary library defining a correspondence between a sensitive word and a non-sensitive word, wherein the non-sensitive word is a set of numeric characters corresponding to the sensitive word; and is
Determining the spelling vocabulary library as the replacement vocabulary library.
16. The apparatus of claim 15, wherein to find the non-sensitive word in the replacement vocabulary library from the sensitive word, the non-sensitive word determination unit:
searching a specific number letter set in the spelling word library according to the sensitive word; and is
Determining the particular set of numeric characters as the non-sensitive word.
17. The apparatus of claim 11, wherein to determine the non-sensitive word corresponding to the sensitive word according to a sensitive word replacement rule, the non-sensitive word determination unit:
in response to the sensitive word replacement rule being a spelling replacement rule, determining a set of numeric letters corresponding to the sensitive word; and is
Determining the set of numeric characters as the non-sensitive word.
18. The apparatus of claim 17, wherein to determine the set of alphanumeric characters corresponding to the sensitive word, the non-sensitive word determining unit:
determining the first letter in the pinyin letters of each character in the sensitive word; and is
And sequentially combining the initial letters in the pinyin of each character into an initial letter set according to the sequence of each character in the sensitive word, wherein the initial letter set is used as the digital letter set.
19. The apparatus of claim 17, wherein to determine the set of alphanumeric characters corresponding to the sensitive word, the non-sensitive word determining unit:
judging whether Chinese numbers are included in the sensitive words or not;
in response to the sensitive word including Chinese numbers, determining an Arabic number corresponding to each Chinese number in the sensitive word and determining a first letter in a Pinyin letter of each non-Chinese number in the sensitive word; and is
And sequentially combining Arabic numerals corresponding to each Chinese numeral and the first letter in the pinyin of each character into the numeral letter set according to the sequence of each character in the sensitive word.
20. The apparatus of claim 11, further comprising:
a candidate rule providing unit for providing a plurality of alternative candidate rules to a user;
a user selection receiving unit for receiving a replacement candidate rule selected by a user among the plurality of replacement candidate rules; and
and the replacement rule determining unit is used for determining the replacement candidate rule selected by the user as the sensitive word replacement rule.
CN201510446574.6A 2015-07-27 2015-07-27 Sensitive word replacing method and device Active CN105183761B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510446574.6A CN105183761B (en) 2015-07-27 2015-07-27 Sensitive word replacing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510446574.6A CN105183761B (en) 2015-07-27 2015-07-27 Sensitive word replacing method and device

Publications (2)

Publication Number Publication Date
CN105183761A CN105183761A (en) 2015-12-23
CN105183761B true CN105183761B (en) 2020-04-07

Family

ID=54905845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510446574.6A Active CN105183761B (en) 2015-07-27 2015-07-27 Sensitive word replacing method and device

Country Status (1)

Country Link
CN (1) CN105183761B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105574203A (en) * 2016-01-07 2016-05-11 沈文策 Information storage method and device
CN105808527A (en) * 2016-02-24 2016-07-27 北京百度网讯科技有限公司 Oriented translation method and device based on artificial intelligence
CN106372062A (en) * 2016-09-18 2017-02-01 长沙军鸽软件有限公司 Method and device for recognizing non-civilized terms in communication message
CN106453366A (en) * 2016-10-27 2017-02-22 北京锐安科技有限公司 Information transmission method and system, sending terminal and receiving terminal
CN107547513B (en) * 2017-07-14 2021-02-05 新华三信息安全技术有限公司 Message processing method, device, network equipment and storage medium
CN108228704B (en) * 2017-11-03 2021-07-13 创新先进技术有限公司 Method, device and equipment for identifying risk content
CN109962958B (en) * 2017-12-26 2022-05-03 阿里巴巴(中国)有限公司 Document processing method and device
CN108564950A (en) * 2018-02-28 2018-09-21 上海与德科技有限公司 Method, intelligent terminal and the computer storage media of speech-to-text
CN109213468B (en) * 2018-08-23 2020-04-28 阿里巴巴集团控股有限公司 Voice playing method and device
CN110472234A (en) * 2019-07-19 2019-11-19 平安科技(深圳)有限公司 Sensitive text recognition method, device, medium and computer equipment
CN110874398B (en) * 2020-01-14 2020-06-02 广东博智林机器人有限公司 Forbidden word processing method and device, electronic equipment and storage medium
CN111918173B (en) * 2020-07-22 2021-10-29 浙江大丰实业股份有限公司 Protection system of stage sound equipment and use method
CN112559776A (en) * 2020-12-21 2021-03-26 绿瘦健康产业集团有限公司 Sensitive information positioning method and system
CN112599212A (en) * 2021-02-26 2021-04-02 北京妙医佳健康科技集团有限公司 Data processing method
CN113033217B (en) * 2021-04-19 2023-09-15 广州欢网科技有限责任公司 Automatic shielding translation method and device for subtitle sensitive information
CN114706942B (en) * 2022-03-16 2023-11-24 马上消费金融股份有限公司 Text conversion model training method, text conversion device and electronic equipment
CN115963954A (en) * 2023-03-14 2023-04-14 北京中科智媒融媒体技术有限公司 Information publishing method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101136867A (en) * 2006-08-30 2008-03-05 腾讯科技(深圳)有限公司 Method and device for transmitting prompt message to client terminal of chat room
CN101470700A (en) * 2007-12-28 2009-07-01 日电(中国)有限公司 Text template generator, text generation equipment, text checking equipment and method thereof
CN101901325A (en) * 2010-07-21 2010-12-01 赵步 Copyright protection method
CN102339361A (en) * 2011-11-03 2012-02-01 厦门市智业软件工程有限公司 Method for monitoring sensitive words in segment quoting of electronic medical record
CN104317781A (en) * 2014-11-14 2015-01-28 移康智能科技(上海)有限公司 Sensitive word editor

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7546334B2 (en) * 2000-11-13 2009-06-09 Digital Doors, Inc. Data security system and method with adaptive filter
US20100082332A1 (en) * 2008-09-26 2010-04-01 Rite-Solutions, Inc. Methods and apparatus for protecting users from objectionable text

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101136867A (en) * 2006-08-30 2008-03-05 腾讯科技(深圳)有限公司 Method and device for transmitting prompt message to client terminal of chat room
CN101470700A (en) * 2007-12-28 2009-07-01 日电(中国)有限公司 Text template generator, text generation equipment, text checking equipment and method thereof
CN101901325A (en) * 2010-07-21 2010-12-01 赵步 Copyright protection method
CN102339361A (en) * 2011-11-03 2012-02-01 厦门市智业软件工程有限公司 Method for monitoring sensitive words in segment quoting of electronic medical record
CN104317781A (en) * 2014-11-14 2015-01-28 移康智能科技(上海)有限公司 Sensitive word editor

Also Published As

Publication number Publication date
CN105183761A (en) 2015-12-23

Similar Documents

Publication Publication Date Title
CN105183761B (en) Sensitive word replacing method and device
JP6942821B2 (en) Obtaining response information from multiple corpora
US10552539B2 (en) Dynamic highlighting of text in electronic documents
JP7499011B2 (en) Method and system for controlling user access utilizing content analysis of applications
CN106960030B (en) Information pushing method and device based on artificial intelligence
US9965459B2 (en) Providing contextual information associated with a source document using information from external reference documents
US20110219299A1 (en) Method and system of providing completion suggestion to a partial linguistic element
US20190026281A1 (en) Method and apparatus for providing information by using degree of association between reserved word and attribute language
CN111339295A (en) Method, apparatus, electronic device and computer readable medium for presenting information
JP2017134787A (en) Device, program, and method for analyzing topic evaluation in multiple areas
CN107111607A (en) The system and method detected for language
US20200043074A1 (en) Apparatus and method of recommending items based on areas
JP7172187B2 (en) INFORMATION DISPLAY METHOD, INFORMATION DISPLAY PROGRAM AND INFORMATION DISPLAY DEVICE
RU2595531C2 (en) Method and system for generating definition of word based on multiple sources
US20180046683A1 (en) Search word list providing device and method using same
RU2711123C2 (en) Method and system for computer processing of one or more quotes in digital texts for determination of their author
CN112735465B (en) Invalid information determination method and device, computer equipment and storage medium
CN116685966A (en) Adjusting query generation patterns
US10909154B2 (en) Search system, search method and search program
CN112445959A (en) Retrieval method, retrieval device, computer-readable medium and electronic device
KR102501625B1 (en) Method and system for controlling user access through content analysis of application
KR102378565B1 (en) Method and system for controlling user access through content analysis of application
JP7216241B1 (en) CHUNKING EXECUTION SYSTEM, CHUNKING EXECUTION METHOD, AND PROGRAM
US20240004909A1 (en) Information processing apparatus, information processing method, and non-transitory computer readable medium
JP6707484B2 (en) Understanding support method, understanding support device and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant