CN106156017A - Information identifying method and information identification system - Google Patents

Information identifying method and information identification system Download PDF

Info

Publication number
CN106156017A
CN106156017A CN201510128025.4A CN201510128025A CN106156017A CN 106156017 A CN106156017 A CN 106156017A CN 201510128025 A CN201510128025 A CN 201510128025A CN 106156017 A CN106156017 A CN 106156017A
Authority
CN
China
Prior art keywords
variation
word
feature words
mode
key word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510128025.4A
Other languages
Chinese (zh)
Inventor
刘克松
杨建武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University
Priority to CN201510128025.4A priority Critical patent/CN106156017A/en
Publication of CN106156017A publication Critical patent/CN106156017A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention proposes a kind of information identifying method and a kind of information identification system, and wherein, described information identifying method includes: obtained the Feature Words of described current data by Feature Words acquisition module;In keyword database, the key word being associated with described Feature Words is determined by key word relating module;Determine that module determines multiple variation words of described key word by variation word;By matching module, described Feature Words is mated, for according to matching result with each variation word in the plurality of variation word, it is determined whether described Feature Words is identified as described key word.By technical scheme, the sensitive information through variation can be detected exactly, consequently facilitating carry out effective to sensitive information and comprehensively detect, it is to avoid the missing inspection of sensitive information occurs.

Description

Information identifying method and information identification system
Technical field
The present invention relates to information discriminating technology field, in particular to a kind of information identifying method and A kind of information identification system.
Background technology
At present, along with developing rapidly of the Internet, user can utilize the Internet timely and conveniently to obtain Information, but, owing to the speed of transmission on Internet information is exceedingly fast, incline with violence, unhealthy color To information, the sensitive information such as uncivil information its spread in china rapidly in the Internet, thus affect The integrated environment of the Internet, is even gradually constituted social public security with the form of " content threat " Threaten.
For the problems referred to above, the solution in correlation technique be according to keyword database come to advertisement, Bad speech, the sensitive informations such as vocabulary that are discord carry out checking filters, to realize the letter in the Internet The management of breath, but, the solution in correlation technique can not check and filter out by through variation Sensitive information, thus cause the missing inspection of sensitive information.
Therefore, the sensitive information through variation is detected the most all-sidedly and accurately, it is to avoid sensitivity letter occurs The missing inspection of breath, becomes problem demanding prompt solution.
Summary of the invention
The present invention is based on the problems referred to above, it is proposed that a kind of new technical scheme, can examine exactly Measure the sensitive information through variation, thus realize carrying out effectively and all sidedly detecting to sensitive information, Avoid the occurrence of the missing inspection of sensitive information.
In view of this, an aspect of of the present present invention proposes a kind of information identifying method, including: by spy Levy word acquisition module and obtain the Feature Words of described current data;By key word relating module at key word Data base determines the key word being associated with described Feature Words;Determine that module determines institute by variation word State multiple variation words of key word;By matching module by described Feature Words and the plurality of variation word Each variation word mate, for according to matching result, it is determined whether by described Feature Words identification For described key word.
In this technical scheme, while obtaining the Feature Words of the data being currently needed for process, to pass Keyword carries out variation process, obtains multiple variation words of key word, then by the feature in current data Each variation word in word and multiple variation word mates, to determine whether to be identified as Feature Words closing Keyword, such that make Feature Words through variation process after, it is also possible to detect exactly through Cross the Feature Words after variation, it is to avoid to including the Feature Words missing inspection of sensitive information, it is achieved thereby that Carry out effectively and all sidedly detecting to the Feature Words including sensitive information, it is to avoid occur quick to including The Feature Words missing inspection of sense information, and then sensitive information propagation in the Internet can be efficiently controlled, Important support is provided for purifying the Internet space.
In technique scheme, it is preferable that described described currently by the acquisition of Feature Words acquisition module The Feature Words of data, specifically includes: by Feature Words acquisition module, described current data is carried out pre-place Reason, to obtain the described Feature Words of described current data, wherein, the mode of described pretreatment include with Under one or a combination set of at least: participle neighbour merges mode, background noise filter type, translator of English Mode, Chinese-traditional reduction mode.
In this technical scheme, owing to including various noise in the current data have sensitive information, So result of meeting effect characteristics word identification, therefore, in order to improve the accurate of the result of identification Feature Words Property, current data is carried out pretreatment, such that it is able to obtain spy in current data comprehensively and exactly Levy word, and then can comprehensively and efficiently identify out whether Feature Words includes sensitive information, it is to avoid go out Now to including the Feature Words missing inspection of sensitive information, it addition, the mode of pretreatment include but not limited to Under one or a combination set of at least: participle neighbour merges mode, background noise filter type, translator of English Mode, Chinese-traditional reduction mode, wherein, participle neighbour merges mode such as, has in current data Phrase " unstable " (implication is " invoice "), but the scheme of correlation technique is to current data Carry out participle obtain Feature Words time, " unstable " can be cut into two single Chinese characters " send out " and " wafing ", this has a Feature Words of sensitive information to cause None-identified to go out " invoice ", therefore, The present invention merges mode by participle neighbour current data is carried out pretreatment, it is possible to obtain and " send out Ticket " this Feature Words, thus the Feature Words avoiding acquisition is not comprehensive, it addition, background noise filters Mode be by current data various without semantic interference characters such as, #, *, % remove, English Literary composition interpretative system is that the translator of English in current data becomes Chinese, and Chinese-traditional reduction mode is ought Chinese-traditional in front data is reduced into simplified form of Chinese Character, thus, various by remove in current data Noise can ensure that the current data accuracy when carrying out participle to obtain Feature Words and comprehensive, keeps away Exempt from that the Feature Words missing inspection to including sensitive information occurs, exist such that it is able to efficiently control sensitive information Propagation in the Internet, and then provide important support for purifying the Internet space.
In technique scheme, it is preferable that described by variation word determine that module determines described key Multiple variation words of word, specifically include: determine that described key word is carried out by module by described variation word Chinese character pronunciation variation process and/or Chinese character pattern variation process, to obtain the described many of described key word Individual variation word, wherein, described Chinese character pronunciation variation process mode include: with nearly sound substitute mode, Liaison bonding substitute mode and letter abbreviations substitute mode, and the side that described Chinese character pattern variation processes Formula includes: nearly shape Chinese character substitute mode and Chinese character pattern disassemble mode.
In this technical scheme, by Chinese character pronunciation variation mode and/or Chinese character pattern variation mode pair Key word makes a variation, such that it is able to according to the spy in multiple variation word identification current datas of key word Levy whether word includes sensitive information, such that make Feature Words through variation process after, also Can accurately and all sidedly detect whether the Feature Words through variation includes sensitive information, wherein, Chinese character pronunciation variation mode refer to the variation to Chinese character pronunciation, this Chinese character pronunciation variation mode include but not It is limited to: with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations substitute mode, with nearly sound Substitute mode, such as, in " sending out drift ", " drift " replacement " ticket " is with nearly sound substitute mode, will " you make?" making " replacement in " " knows " to be that liaison bonds substitute mode, will " FP " generation It is letter abbreviations substitute mode for " invoice ", it addition, Chinese character pattern variation mode refers to Chinese Character The variation of shape, Chinese character pattern variation mode includes but not limited to: nearly shape Chinese character substitute mode and Chinese Character Shape disassembles mode, and such as, replacement " people " is nearly shape Chinese character substitute mode " will to enter the people ", will " sending out west to show " replacement " invoice " is that Chinese character pattern disassembles mode.
In technique scheme, it is preferable that described by matching module by described Feature Words with described Each variation word in multiple variation words mates, and specifically includes: use in described matching module Matching formula calculates described Feature Words and the coupling mark of described key word, wherein, described matching formula For:
s = Σ i δ ( f i ( w ) , f i ( t ) ) , i ∈ [ 1 , n ]
Wherein, s represents the sum of the described coupling mark of described Feature Words and described each variation word, n table Showing the quantity of the plurality of variation word, w represents described Feature Words, and t represents in the plurality of variation word I-th variation word, fiW () represents that the variation of described Feature Words maps, fiT () represents described i-th The variation of variation word maps, and δ represents that the described coupling of described Feature Words and described i-th variation word is divided Number.
In this technical scheme, use matching formula s=∑iδ(fi(w),fi(t)) i ∈ [1, n] calculates spy Levy the coupling mark of word and key word, such that it is able to according to the Feature Words calculated and the coupling of key word Mark, it is determined whether Feature Words is identified as key word, wherein, s represents Feature Words and each variation The sum of the coupling mark of word, n represents the quantity of multiple variation word, Feature Words described in w, and t represents multiple I-th variation word in variation word, fiW () represents that the variation of Feature Words maps, fiT () represents i-th Variation word variation map, δ represent Feature Words and i-th variation word coupling mark, such as, when When Feature Words and i-th variation word coupling, then the value of δ is 1, and otherwise the value of δ is 0, and by feature Word is added with all of variation i.e. δ of word matching result of key word, obtain Feature Words and key word Partition number s, if s is nonzero value, then comprises in explanation Feature Words and Keywords matching, i.e. Feature Words There is sensitive information, thus the current data at Feature Words place is filtered, and then comprehensively and exactly Purify the Internet space.
In technique scheme, it is preferable that described use matching formula calculate described Feature Words and After the coupling mark of described key word, also comprise determining that whether described coupling mark is in default In the range of partition number, wherein, when determining that described coupling mark is in described preset matching fraction range Time, described Feature Words is identified as described key word.
In this technical scheme, if the coupling mark of Feature Words and key word is in preset matching mark In the range of, it is determined whether Feature Words is identified as key word, such that it is able to more accurately detect Include the Feature Words of sensitive information, it is to avoid include the Feature Words missing inspection of sensitive information, can be effective Ground controls sensitive information propagation in the Internet, thus provides important for purifying the Internet space Hold.
Another aspect of the present invention proposes a kind of information identification system, including: Feature Words acquisition module, For obtaining the Feature Words of described current data;Key word relating module, in keyword database Middle determine the key word being associated with described Feature Words;Variation word determines module, is used for determining described pass Multiple variation words of keyword;Matching module, for by described Feature Words and the plurality of variation word Each variation word mates, for according to matching result, it is determined whether be identified as by described Feature Words Described key word.
In this technical scheme, while obtaining the Feature Words of the data being currently needed for process, to pass Keyword carries out variation process, obtains multiple variation words of key word, then by the feature in current data Each variation word in word and multiple variation word mates, to determine whether to be identified as Feature Words closing Keyword, such that make Feature Words through variation process after, it is also possible to detect exactly through Cross the Feature Words after variation, it is to avoid to including the Feature Words missing inspection of sensitive information, it is achieved thereby that Carry out effectively and all sidedly detecting to the Feature Words including sensitive information, it is to avoid occur quick to including The Feature Words missing inspection of sense information, and then sensitive information propagation in the Internet can be efficiently controlled, Important support is provided for purifying the Internet space.
In technique scheme, it is preferable that described Feature Words acquisition module specifically for: by spy Levy word acquisition module and described current data is carried out pretreatment, to obtain the described spy of described current data Levying word, wherein, the mode of described pretreatment includes at least one of or a combination thereof: participle neighbour closes And mode, background noise filter type, translator of English mode, Chinese-traditional reduction mode.
In this technical scheme, owing to including various noise in the current data have sensitive information, So result of meeting effect characteristics word identification, therefore, in order to improve the accurate of the result of identification Feature Words Property, current data is carried out pretreatment, such that it is able to obtain spy in current data comprehensively and exactly Levy word, and then can comprehensively and efficiently identify out whether Feature Words includes sensitive information, it is to avoid go out Now to including the Feature Words missing inspection of sensitive information, it addition, the mode of pretreatment include but not limited to Under one or a combination set of at least: participle neighbour merges mode, background noise filter type, translator of English Mode, Chinese-traditional reduction mode, wherein, participle neighbour merges mode such as, has in current data Phrase " unstable " (implication is " invoice "), but the scheme of correlation technique is to current data Carry out participle obtain Feature Words time, " unstable " can be cut into two single Chinese characters " send out " and " wafing ", this has a Feature Words of sensitive information to cause None-identified to go out " invoice ", therefore, The present invention merges mode by participle neighbour current data is carried out pretreatment, it is possible to obtain and " send out Ticket " this Feature Words, thus the Feature Words avoiding acquisition is not comprehensive, it addition, background noise filters Mode be by current data various without semantic interference characters such as, #, *, % remove, English Literary composition interpretative system is that the translator of English in current data becomes Chinese, and Chinese-traditional reduction mode is ought Chinese-traditional in front data is reduced into simplified form of Chinese Character, thus, various by remove in current data Noise can ensure that the current data accuracy when carrying out participle to obtain Feature Words and comprehensive, keeps away Exempt from that the Feature Words missing inspection to including sensitive information occurs, exist such that it is able to efficiently control sensitive information Propagation in the Internet, and then provide important support for purifying the Internet space.
In technique scheme, it is preferable that described variation word determine module specifically for: by institute State variation word and determine that module carries out Chinese character pronunciation variation process and/or Chinese Character deformation to described key word Different process, to obtain the plurality of variation word of described key word, wherein, described Chinese character pronunciation makes a variation The mode processed includes: with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations replacement side Formula, and the mode that described Chinese character pattern variation processes includes: nearly shape Chinese character substitute mode and Chinese Character Shape disassembles mode.
In this technical scheme, by Chinese character pronunciation variation mode and/or Chinese character pattern variation mode pair Key word makes a variation, such that it is able to according to the spy in multiple variation word identification current datas of key word Levy whether word includes sensitive information, such that make Feature Words through variation process after, also Can accurately and all sidedly detect whether the Feature Words through variation includes sensitive information, wherein, Chinese character pronunciation variation mode refer to the variation to Chinese character pronunciation, this Chinese character pronunciation variation mode include but not It is limited to: with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations substitute mode, with nearly sound Substitute mode, such as, in " sending out drift ", " drift " replacement " ticket " is with nearly sound substitute mode, will " you make?" making " replacement in " " knows " to be that liaison bonds substitute mode, will " FP " generation It is letter abbreviations substitute mode for " invoice ", it addition, Chinese character pattern variation mode refers to Chinese Character The variation of shape, Chinese character pattern variation mode includes but not limited to: nearly shape Chinese character substitute mode and Chinese Character Shape disassembles mode, and such as, replacement " people " is nearly shape Chinese character substitute mode " will to enter the people ", will " sending out west to show " replacement " invoice " is that Chinese character pattern disassembles mode.
In technique scheme, it is preferable that described matching module specifically for: use matching formula Calculating the coupling mark of described Feature Words and described key word, wherein, described matching formula is:
s = Σ i δ ( f i ( w ) , f i ( t ) ) , i ∈ [ 1 , n ]
Wherein, s represents the sum of the described coupling mark of described Feature Words and described each variation word, n table Showing the quantity of the plurality of variation word, w represents described Feature Words, and t represents in the plurality of variation word I-th variation word, fiW () represents that the variation of described Feature Words maps, fiT () represents described i-th The variation of variation word maps, and δ represents that the described coupling of described Feature Words and described i-th variation word is divided Number.
In this technical scheme, use matching formula s=∑iδ(fi(w),fi(t)) i ∈ [1, n] calculates spy Levy the coupling mark of word and key word, such that it is able to according to the Feature Words calculated and the coupling of key word Mark, it is determined whether Feature Words is identified as key word, wherein, s represents Feature Words and each variation The sum of the coupling mark of word, n represents the quantity of multiple variation word, Feature Words described in w, and t represents multiple I-th variation word in variation word, fiW () represents that the variation of Feature Words maps, fiT () represents i-th Variation word variation map, δ represent Feature Words and i-th variation word coupling mark, such as, when When Feature Words and i-th variation word coupling, then the value of δ is 1, and otherwise the value of δ is 0, and by feature Word is added with all of variation i.e. δ of word matching result of key word, obtain Feature Words and key word Partition number s, if s is nonzero value, then comprises in explanation Feature Words and Keywords matching, i.e. Feature Words There is sensitive information, thus the current data at Feature Words place is filtered, and then comprehensively and exactly Purify the Internet space.
In technique scheme, it is preferable that described matching module includes: identification module, described After using matching formula to calculate the coupling mark of described Feature Words and described key word, determine described Whether partition number is in preset matching fraction range, wherein, when determining that described coupling mark is in institute When stating in preset matching fraction range, described Feature Words is identified as described key word.
In this technical scheme, if the coupling mark of Feature Words and key word is in preset matching mark In the range of, it is determined whether Feature Words is identified as key word, such that it is able to more accurately detect Include the Feature Words of sensitive information, it is to avoid include the Feature Words missing inspection of sensitive information, can be effective Ground controls sensitive information propagation in the Internet, thus provides important for purifying the Internet space Hold.
By technical scheme, can detect exactly through variation and there is sensitive information Feature Words, thus realize carry out effectively and all sidedly detecting to the Feature Words with sensitive information, keep away Exempt to occur having the missing inspection of the Feature Words of sensitive information.
Accompanying drawing explanation
Fig. 1 shows the schematic flow sheet of information identifying method according to an embodiment of the invention;
Fig. 2 shows the structural representation of information identification system according to an embodiment of the invention;
Fig. 3 shows the principle schematic of information identification system according to an embodiment of the invention;
Fig. 4 shows the principle schematic of information identification system according to another embodiment of the invention.
Detailed description of the invention
In order to the above-mentioned purpose of the present invention, feature and advantage can be more clearly understood that, below in conjunction with attached The present invention is further described in detail by figure and detailed description of the invention.It should be noted that not In the case of conflict, the feature in embodiments herein and embodiment can be mutually combined.
Elaborate a lot of detail in the following description so that fully understanding the present invention, but, The present invention can implement to use other to be different from other modes described here, therefore, and the present invention Protection domain do not limited by following public specific embodiment.
Fig. 1 shows the schematic flow sheet of information identifying method according to an embodiment of the invention.
As it is shown in figure 1, information identifying method according to an embodiment of the invention, including:
Step 102, obtains the Feature Words of current data by Feature Words acquisition module.
Step 104, determines relevant to Feature Words by key word relating module in keyword database The key word of connection.
By variation word, step 106, determines that module determines multiple variation words of key word.
Step 108, is carried out Feature Words with each variation word in multiple variation words by matching module Coupling, for according to matching result, it is determined whether Feature Words is identified as key word.
In this technical scheme, while obtaining the Feature Words of the data being currently needed for process, to pass Keyword carries out variation process, obtains multiple variation words of key word, then by the feature in current data Each variation word in word and multiple variation word mates, to determine whether to be identified as Feature Words closing Keyword, such that make Feature Words through variation process after, it is also possible to detect exactly through Cross the Feature Words after variation, it is to avoid to including the Feature Words missing inspection of sensitive information, it is achieved thereby that Carry out effectively and all sidedly detecting to the Feature Words including sensitive information, it is to avoid occur quick to including The Feature Words missing inspection of sense information, and then sensitive information propagation in the Internet can be efficiently controlled, Important support is provided for purifying the Internet space.
In technique scheme, it is preferable that step 102 specifically includes: obtain mould by Feature Words Block carries out pretreatment to current data, to obtain the Feature Words of current data, wherein, and the side of pretreatment Formula includes at least one of or a combination thereof: participle neighbour merge mode, background noise filter type, Translator of English mode, Chinese-traditional reduction mode.
In this technical scheme, owing to including various noise in the current data have sensitive information, So result of meeting effect characteristics word identification, therefore, in order to improve the accurate of the result of identification Feature Words Property, current data is carried out pretreatment, such that it is able to obtain spy in current data comprehensively and exactly Levy word, and then can comprehensively and efficiently identify out whether Feature Words includes sensitive information, it is to avoid go out Now to including the Feature Words missing inspection of sensitive information, it addition, the mode of pretreatment include but not limited to Under one or a combination set of at least: participle neighbour merges mode, background noise filter type, translator of English Mode, Chinese-traditional reduction mode, wherein, participle neighbour merges mode such as, has in current data Phrase " unstable " (implication is " invoice "), but the scheme of correlation technique is to current data Carry out participle obtain Feature Words time, " unstable " can be cut into two single Chinese characters " send out " and " wafing ", this has a Feature Words of sensitive information to cause None-identified to go out " invoice ", therefore, The present invention merges mode by participle neighbour current data is carried out pretreatment, it is possible to obtain and " send out Ticket " this Feature Words, thus the Feature Words avoiding acquisition is not comprehensive, it addition, background noise filters Mode be by current data various without semantic interference characters such as, #, *, % remove, English Literary composition interpretative system is that the translator of English in current data becomes Chinese, and Chinese-traditional reduction mode is ought Chinese-traditional in front data is reduced into simplified form of Chinese Character, thus, various by remove in current data Noise can ensure that the current data accuracy when carrying out participle to obtain Feature Words and comprehensive, keeps away Exempt from that the Feature Words missing inspection to including sensitive information occurs, exist such that it is able to efficiently control sensitive information Propagation in the Internet, and then provide important support for purifying the Internet space.
In technique scheme, it is preferable that step 106 specifically includes: determine mould by variation word Block carries out Chinese character pronunciation variation process to key word and/or Chinese character pattern variation processes, to obtain key Multiple variation words of word, wherein, Chinese character pronunciation variation process mode include: with nearly sound substitute mode, Liaison bonding substitute mode and letter abbreviations substitute mode, and the mode bag that Chinese character pattern variation processes Include: nearly shape Chinese character substitute mode and Chinese character pattern disassemble mode.
In this technical scheme, by Chinese character pronunciation variation mode and/or Chinese character pattern variation mode pair Key word makes a variation, such that it is able to according to the spy in multiple variation word identification current datas of key word Levy whether word includes sensitive information, such that make Feature Words through variation process after, also Can accurately and all sidedly detect whether the Feature Words through variation includes sensitive information, wherein, Chinese character pronunciation variation mode refer to the variation to Chinese character pronunciation, this Chinese character pronunciation variation mode include but not It is limited to: with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations substitute mode, with nearly sound Substitute mode, such as, in " sending out drift ", " drift " replacement " ticket " is with nearly sound substitute mode, will " you make?" making " replacement in " " knows " to be that liaison bonds substitute mode, will " FP " generation It is letter abbreviations substitute mode for " invoice ", it addition, Chinese character pattern variation mode refers to Chinese Character The variation of shape, Chinese character pattern variation mode includes but not limited to: nearly shape Chinese character substitute mode and Chinese Character Shape disassembles mode, and such as, replacement " people " is nearly shape Chinese character substitute mode " will to enter the people ", will " sending out west to show " replacement " invoice " is that Chinese character pattern disassembles mode.
In technique scheme, it is preferable that by matching module by Feature Words and multiple variation words Each variation word mate, specifically include: in matching module use matching formula calculate feature The coupling mark of word and key word, wherein, matching formula is:
s = Σ i δ ( f i ( w ) , f i ( t ) ) , i ∈ [ 1 , n ]
Wherein, s represents the sum of the coupling mark of Feature Words and each variation word, and n represents multiple variation word Quantity, w represents the i-th variation word that Feature Words, t represent in multiple variation word, fiW () represents The variation of Feature Words maps, fiT () represents that the variation of i-th variation word maps, δ represents Feature Words and the The coupling mark of i variation word.
In this technical scheme, use matching formula s=∑iδ(fi(w),fi(t)) i ∈ [1, n] calculates spy Levy the coupling mark of word and key word, such that it is able to according to the Feature Words calculated and the coupling of key word Mark, it is determined whether Feature Words is identified as key word, wherein, s represents Feature Words and each variation The sum of the coupling mark of word, n represents the quantity of multiple variation word, w Feature Words, and t represents multiple variation I-th variation word in word, fiW () represents that the variation of Feature Words maps, fiT () represents i-th variation The variation of word maps, and δ represents Feature Words and the coupling mark of i-th variation word, such as, works as feature Word and i-th variation word coupling time, then the value of δ is 1, and otherwise the value of δ is 0, and by Feature Words with The all of variation i.e. δ of word matching result of key word is added, and the coupling obtaining Feature Words and key word is divided Number s, if s is nonzero value, then explanation Feature Words and Keywords matching, i.e. Feature Words in include quick Sense information, thus the current data at Feature Words place is filtered, and then purify comprehensively and exactly The Internet space.
In technique scheme, it is preferable that calculate Feature Words and key word using matching formula After coupling mark, also comprise determining that whether coupling mark is in preset matching fraction range, its In, when determining that coupling mark is in preset matching fraction range, Feature Words is identified as key word.
In this technical scheme, if the coupling mark of Feature Words and key word is in preset matching mark In the range of, it is determined whether Feature Words is identified as key word, such that it is able to more accurately detect Include the Feature Words of sensitive information, it is to avoid include the Feature Words missing inspection of sensitive information, can be effective Ground controls sensitive information propagation in the Internet, thus provides important for purifying the Internet space Hold.
Fig. 2 shows the structural representation of information identification system according to an embodiment of the invention.
As in figure 2 it is shown, information identification system 200 according to an embodiment of the invention, including: Feature Words acquisition module 202, for obtaining the Feature Words of current data;Key word relating module 204, For determining the key word being associated with Feature Words in keyword database;Variation word determines module 206, for determining multiple variation words of key word;Matching module 208, is used for Feature Words with many Each variation word in individual variation word mates, for according to matching result, it is determined whether by feature Word is identified as key word.
In this technical scheme, while obtaining the Feature Words of the data being currently needed for process, to pass Keyword carries out variation process, obtains multiple variation words of key word, then by the feature in current data Each variation word in word and multiple variation word mates, to determine whether to be identified as Feature Words closing Keyword, such that make Feature Words through variation process after, it is also possible to detect exactly through Cross the Feature Words after variation, it is to avoid to including the Feature Words missing inspection of sensitive information, it is achieved thereby that Carry out effectively and all sidedly detecting to the Feature Words including sensitive information, it is to avoid occur quick to including The Feature Words missing inspection of sense information, and then sensitive information propagation in the Internet can be efficiently controlled, Important support is provided for purifying the Internet space.
In technique scheme, it is preferable that Feature Words acquisition module 202 specifically for: by spy Levy word acquisition module 202 and current data is carried out pretreatment, to obtain the Feature Words of current data, its In, the mode of pretreatment includes at least one of or a combination thereof: participle neighbour merges mode, background Noise filtering mode, translator of English mode, Chinese-traditional reduction mode.
In this technical scheme, owing to including various noise in the current data have sensitive information, So result of meeting effect characteristics word identification, therefore, in order to improve the accurate of the result of identification Feature Words Property, current data is carried out pretreatment, such that it is able to obtain spy in current data comprehensively and exactly Levy word, and then can comprehensively and efficiently identify out whether Feature Words includes sensitive information, it is to avoid go out Now to including the Feature Words missing inspection of sensitive information, it addition, the mode of pretreatment include but not limited to Under one or a combination set of at least: participle neighbour merges mode, background noise filter type, translator of English Mode, Chinese-traditional reduction mode, wherein, participle neighbour merges mode such as, has in current data Phrase " unstable " (implication is " invoice "), but the scheme of correlation technique is to current data Carry out participle obtain Feature Words time, " unstable " can be cut into two single Chinese characters " send out " and " wafing ", this has a Feature Words of sensitive information to cause None-identified to go out " invoice ", therefore, The present invention merges mode by participle neighbour current data is carried out pretreatment, it is possible to obtain and " send out Ticket " this Feature Words, thus the Feature Words avoiding acquisition is not comprehensive, it addition, background noise filters Mode be by current data various without semantic interference characters such as, #, *, % remove, English Literary composition interpretative system is that the translator of English in current data becomes Chinese, and Chinese-traditional reduction mode is ought Chinese-traditional in front data is reduced into simplified form of Chinese Character, thus, various by remove in current data Noise can ensure that the current data accuracy when carrying out participle to obtain Feature Words and comprehensive, keeps away Exempt from that the Feature Words missing inspection to including sensitive information occurs, exist such that it is able to efficiently control sensitive information Propagation in the Internet, and then provide important support for purifying the Internet space.
In technique scheme, it is preferable that variation word determine module 206 specifically for: by become Dissenting words determines that key word is carried out at Chinese character pronunciation variation process and/or Chinese character pattern variation by module 206 Reason, to obtain multiple variation words of key word, wherein, the mode that Chinese character pronunciation variation processes includes: With nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations substitute mode, and Chinese character pattern The mode that variation processes includes: nearly shape Chinese character substitute mode and Chinese character pattern disassemble mode.
In this technical scheme, by Chinese character pronunciation variation mode and/or Chinese character pattern variation mode pair Key word makes a variation, such that it is able to according to the spy in multiple variation word identification current datas of key word Levy whether word includes sensitive information, such that make Feature Words through variation process after, also Can accurately and all sidedly detect whether the Feature Words through variation includes sensitive information, wherein, Chinese character pronunciation variation mode refer to the variation to Chinese character pronunciation, this Chinese character pronunciation variation mode include but not It is limited to: with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations substitute mode, with nearly sound Substitute mode, such as, in " sending out drift ", " drift " replacement " ticket " is with nearly sound substitute mode, will " you make?" making " replacement in " " knows " to be that liaison bonds substitute mode, will " FP " generation It is letter abbreviations substitute mode for " invoice ", it addition, Chinese character pattern variation mode refers to Chinese Character The variation of shape, Chinese character pattern variation mode includes but not limited to: nearly shape Chinese character substitute mode and Chinese Character Shape disassembles mode, and such as, replacement " people " is nearly shape Chinese character substitute mode " will to enter the people ", will " sending out west to show " replacement " invoice " is that Chinese character pattern disassembles mode.
In technique scheme, it is preferable that matching module 208 specifically for: use matching formula Calculating Feature Words and the coupling mark of key word, wherein, matching formula is:
s = Σ i δ ( f i ( w ) , f i ( t ) ) , i ∈ [ 1 , n ]
Wherein, s represents the sum of the coupling mark of Feature Words and each variation word, and n represents multiple variation word Quantity, w represents the i-th variation word that Feature Words, t represent in multiple variation word, fiW () represents The variation of Feature Words maps, fiT () represents that the variation of i-th variation word maps, δ represents Feature Words and the The coupling mark of i variation word.
In this technical scheme, use matching formula s=∑iδ(fi(w),fi(t)) i ∈ [1, n] calculates spy Levy the coupling mark of word and key word, such that it is able to according to the Feature Words calculated and the coupling of key word Mark, it is determined whether Feature Words is identified as key word, wherein, s represents Feature Words and each variation The sum of the coupling mark of word, n represents the quantity of multiple variation word, w Feature Words, and t represents multiple variation I-th variation word in word, fiW () represents that the variation of Feature Words maps, fiT () represents i-th variation The variation of word maps, and δ represents Feature Words and the coupling mark of i-th variation word, such as, works as feature Word and i-th variation word coupling time, then the value of δ is 1, and otherwise the value of δ is 0, and by Feature Words with The all of variation i.e. δ of word matching result of key word is added, and the coupling obtaining Feature Words and key word is divided Number s, if s is nonzero value, then explanation Feature Words and Keywords matching, i.e. Feature Words in include quick Sense information, thus the current data at Feature Words place is filtered, and then purify comprehensively and exactly The Internet space.
In technique scheme, it is preferable that matching module 208 includes: identification module 2082, After using matching formula to calculate the coupling mark of Feature Words and key word, whether determine coupling mark It is in preset matching fraction range, wherein, when determining that coupling mark is in preset matching fraction range Time interior, Feature Words is identified as key word.
In this technical scheme, if the coupling mark of Feature Words and key word is in preset matching mark In the range of, it is determined whether Feature Words is identified as key word, such that it is able to more accurately detect Include the Feature Words of sensitive information, it is to avoid include the Feature Words missing inspection of sensitive information, can be effective Ground controls sensitive information propagation in the Internet, thus provides important for purifying the Internet space Hold.
Fig. 3 shows the principle schematic of information identification system according to an embodiment of the invention.
As it is shown on figure 3, information identification system 300 according to an embodiment of the invention (is equivalent to The information identification system 200 of the embodiment shown in Fig. 2), including: Text Pretreatment module 302, The multidimensional variation mapping block 304 of sensitive information and matching module 306, be described in detail below these three Module:
1. Text Pretreatment module 302, owing to current data including various noise, and in order to Ensure that the Feature Words got in current data, more accurately with comprehensively, therefore, is obtaining Feature Words Before, by Text Pretreatment module 302, current data is carried out pretreatment, wherein, pretreatment Mode includes at least one of or a combination thereof: participle neighbour merge mode, background noise filter type, Translator of English mode, Chinese-traditional reduction mode.
Text Pretreatment module 302 is additionally operable to current data is carried out participle, due to dividing of correlation technique Word technology is typically based on learning correct word model and removes cutting text, regardless of machine learning model How, the training set used is usually standard works text.But, the variation of Feature Words belongs to language The improper expression of speech.Such as, the participle technique of correlation technique can (implication be by " unstable " " invoice ") it is cut into two single Chinese characters and " sends out " and " wafing ", cause filtration system to know Do not go out " invoice " this implicit word.Participle knot can be merged the most in the inventive solutions The word of next-door neighbour in Guo.
The normal accuracy using character itself not interfere with participle, but include the current of sensitive information , in order to hide the detection of conventional filter systems, often there are the various without semantic interference of deliberately interpolation in data Character such as@, #, &, *, %, these symbols are mixed in current data, so can have a strong impact on The accuracy of participle, therefore removed these interference characters before to current data participle.As for extensively Accuse key word " invoice " word, after introducing the variation of background noise symbol, be probably that " * * * sends out * * * ticket * * * ", asterisk here is the background noise needing to filter.And some Feature Words can use English to replace Change or Chinese-traditional is replaced.Such as English " government ", refer to " government " word, permissible Use translator of English mode that current data is carried out pretreatment, to substitute English, it addition, have with Chinese A little Feature Words can use Chinese-traditional to substitute simple Chinese, therefore, uses Chinese-traditional reduction mode pair Current data carries out pretreatment, so that Chinese-traditional is replaced into simple Chinese.
Therefore, by technique scheme, Text Pretreatment module 302 is except extracting Feature Words Outside, also remove because inserting, replacing the various noises introduced in current data, so that obtain Feature Words is more comprehensive and accurate.
2. multidimensional variation mapping block 304 for carrying out the variation of various dimensions by Feature Words and key word Map, specifically, keyword database obtains the key word relevant to Feature Words, and by same Nearly sound substitute mode, liaison bonding substitute mode, letter abbreviations substitute mode, nearly shape Chinese character replacement side Formula and Chinese character pattern are disassembled mode and key word are carried out the variation of various dimensions, thus obtain the change of key word Dissenting words, may thereby determine that whether variation word and Feature Words mate, to determine that whether Feature Words is as variation Key word corresponding to word.
Refer to Chinese character with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations substitute mode The variation of pronunciation.Such as, " drift " replacement " ticket " in " sending out drift ", the two unisonance;" university monk " In " monk " replacement " give birth to ", the two nearly sound;" you make?" make " replacement in " " to know ", The former is the liaison bonding of the latter." FP " replacement " invoice ", uses phrase acronym to replace. In order to detect this type of variation, by with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations Key word is made a variation by substitute mode, then phonetic does the replacement of nearly sound and liaison bonding.Nearly sound, or Person says approximation sound, fuzzy phoneme, is mainly reflected in the similar initial consonant of phonetic transcriptions of Chinese characters, the replacement of simple or compound vowel of a Chinese syllable.As: (z, ch), (c, ch), (s, sh), (l, n), (f, h), (r, l), (an, ang), (en, eng), (in, ing), (ian, iang), (uan, uang), Phonetic in bracket is similar pinyin pair;Phonetic [zhi, the dao] liaison " known " is bonded as [zhao], It is approximately again [zao], then transfers Chinese character to and " make ";The phonetic [fa, piao] of " invoice ", extracts head Letter abbreviations is " FP ".
Nearly shape Chinese character substitute mode and Chinese character pattern are disassembled mode and are referred to the variation to Chinese character pattern.Such as The replacement " people " that " enters the people ", " entering " and " people " the two font is similar." send out west to show " and replace " invoice " word, the latter disassembles as the former.This type of dividing by means of characters example also has: " the most completely Ren ten sun Waiting, speech logical sequence soil cloud converted in Ren all of a sudden mouth jin speech has Chuo cun of Rui worry white peony root speech to convert Rui, and then, Ren covers scholar's cloud Rolling scolds word.Later, Ren all of a sudden rice green grass or young crops Woo divided row clothing in Shen." to this, identify the radical of Chinese character, And be crucial with the combination of neighbour's Chinese character.It should be noted that some replace be i.e. with nearly sound replace be again near Shape is replaced, and replaces with such as " faction " " group of nation ", and " side " and " nation " sound similar shape are the most seemingly.
Chinese character, in enunciative variation, directly uses fuzzy phoneme in Pinyin coding to represent.And Chinese Character The variation of shape, then do not have to reflect the fuzzy shape of character shape coding similarity.Utilize the font shape of Chinese character The scheme structure of shape, as above (in) under, left (in) right structure, Chinese character is disassembled, by interior Similarity between portion's component units, to character shape coding, weighs the composition between word, and similarity Marking.
3. matching module 306 mates with key word for Feature Words under multidimensional variation maps, Specifically, matching formula is used to calculate Feature Words and the coupling mark of key word, wherein, matching formula For:
s = Σ i δ ( f i ( w ) , f i ( t ) ) , i ∈ [ 1 , n ]
Wherein, s represents the sum of the coupling mark of Feature Words and each variation word, and n represents multiple variation word Quantity, w represents the i-th variation word that Feature Words, t represent in multiple variation word, fiW () represents The variation of Feature Words maps, fiT () represents that the variation of i-th variation word maps, δ represents Feature Words and the The coupling mark of i variation word.
After using matching formula to calculate the coupling mark of Feature Words and key word, determine coupling mark Whether it is in preset matching fraction range, wherein, when determining that coupling mark is in preset matching mark In the range of time, Feature Words is identified as key word.Such as, when Feature Words and i-th variation word coupling Time, then the value of δ is 1, and otherwise the value of δ is 0, and by all of variation word of Feature Words Yu key word The i.e. δ of matching result is added, and obtains the coupling mark s of Feature Words and key word, if s is nonzero value, Then explanation Feature Words and Keywords matching, i.e. Feature Words include sensitive information, thus to Feature Words The current data at place filters.
Fig. 4 shows the principle schematic of information identification system according to another embodiment of the invention.
As shown in Figure 4, information identification system according to another embodiment of the invention, first to obtaining The current data got carries out pretreatment, including the various interference characters rejected in current data, then Use based on string matching or common segmenting method based on statistics current number after the pre-treatment According to middle acquisition Feature Words.When obtaining Feature Words, it is possible to obtain the Feature Words of specification exactly, to change Different Feature Words None-identified, wherein, Feature Words includes generic word and variation word.Current data is entered Result after row participle comprises generic word and individual character, the most adjacent individual character structure between two generic word Becoming an individual character string, the set of individual character string constitutes individual character trail, and the set of generic word constitutes common word set, Due to through the Feature Words of variation at Chinese word segmentation after be cut into several adjacent individual characters, so such as Really certain word comprises variability signals, then this word must be in individual character string, therefore, by common word set and list The word that word string is concentrated is Feature Words, then obtains the pass relevant to Feature Words in keyword database Keyword, mates Feature Words with key word under multidimensional variation maps, if Feature Words is with crucial Word mates under multidimensional variation maps, then include sensitive information, such that it is able to sentence in explanation Feature Words Determine current data includes sensitive information, such as current data be " company's Dai Kai right path is unstable, Huo is to paying a kuan ", through participle obtain " company/n generation/v opens/the v right path/n sends out/v wafts/v, / w huo/x to/v pays/v kuan/x " result.After participle, individual character neighbour merges, and obtains " generation Open ", " unstable ", " huo is to paying kuan ", Feature Words is mapped in multidimensional variation with key word Under when mating, can to the key word in keywords database and the Feature Words in current data, according to Replace with nearly sound substitute mode, liaison bonding substitute mode, letter abbreviations substitute mode, nearly shape Chinese character Mode and Chinese character pattern are disassembled the variations such as mode and are mapped.As above " unstable " in current data in example Word and key word " invoice " mate after unisonance variation maps;In current data " huo to pay Kuan " and key word " cash on delivery ", through unisonance variation map after mate.It addition, expand in variation Zhan Shi, can map according to the variation of sound, shape, keyword database is set up index.Actually detected Time, need to travel through all variations and map, until coupling, that is to say and choose the mapping that suitably makes a variation, Or not coupling, completes the identification work of variation sensitive information.
Technical scheme is described in detail above in association with accompanying drawing, by technical scheme, Can detect exactly through variation and there is the Feature Words of sensitive information, thus realizing quick to having The Feature Words of sense information carries out effectively and all sidedly detecting, it is to avoid the Feature Words with sensitive information occur Missing inspection.
These are only the preferred embodiments of the present invention, be not limited to the present invention, for ability For the technical staff in territory, the present invention can have various modifications and variations.All spirit in the present invention and Within principle, any modification, equivalent substitution and improvement etc. made, should be included in the guarantor of the present invention Within the scope of protecting.

Claims (10)

1. an information identifying method, it is characterised in that including:
The Feature Words of described current data is obtained by Feature Words acquisition module;
In keyword database, the pass being associated with described Feature Words is determined by key word relating module Keyword;
Determine that module determines multiple variation words of described key word by variation word;
Each variation word in described Feature Words and the plurality of variation word carried out by matching module Join, for according to matching result, it is determined whether described Feature Words is identified as described key word.
Information identifying method the most according to claim 1, it is characterised in that described by spy Levy word acquisition module and obtain the Feature Words of described current data, specifically include:
By Feature Words acquisition module, described current data is carried out pretreatment, to obtain described current number According to described Feature Words, wherein, the mode of described pretreatment includes at least one of or a combination thereof:
Participle neighbour merges mode, background noise filter type, translator of English mode, Chinese-traditional also Former mode.
Information identifying method the most according to claim 2, it is characterised in that described by becoming Dissenting words determines that module determines multiple variation words of described key word, specifically includes:
By described variation word determine module described key word is carried out Chinese character pronunciation variation process and/or Chinese character pattern variation processes, to obtain the plurality of variation word of described key word, wherein,
The mode that described Chinese character pronunciation variation processes includes: replace with nearly sound substitute mode, liaison bonding Mode and letter abbreviations substitute mode, and
The mode that described Chinese character pattern variation processes includes: nearly shape Chinese character substitute mode and Chinese character pattern are torn open Solution mode.
Information identifying method the most according to any one of claim 1 to 3, it is characterised in that Described by matching module, each variation word in described Feature Words and the plurality of variation word is carried out Join, specifically include:
Matching formula is used to calculate described Feature Words and the coupling of described key word in described matching module Mark, wherein, described matching formula is:
s = Σ i δ ( f i ( w ) , f i ( t ) ) , i ∈ [ 1 , n ]
Wherein, s represents the sum of the described coupling mark of described Feature Words and described each variation word, n table Showing the quantity of the plurality of variation word, w represents described Feature Words, and t represents in the plurality of variation word I-th variation word, fiW () represents that the variation of described Feature Words maps, fiT () represents described i-th The variation of variation word maps, and δ represents that the described coupling of described Feature Words and described i-th variation word is divided Number.
Information identifying method the most according to claim 4, it is characterised in that in described use After matching formula calculates the coupling mark of described Feature Words and described key word, also include:
Determine whether described coupling mark is in preset matching fraction range, wherein, described when determining When coupling mark is in described preset matching fraction range, described Feature Words is identified as described key Word.
6. an information identification system, it is characterised in that including:
Feature Words acquisition module, for obtaining the Feature Words of described current data;
Key word relating module, is associated with described Feature Words for determining in keyword database Key word;
Variation word determines module, for determining multiple variation words of described key word;
Matching module, for carrying out described Feature Words with each variation word in the plurality of variation word Coupling, for according to matching result, it is determined whether described Feature Words is identified as described key word.
Information identification system the most according to claim 6, it is characterised in that described Feature Words Acquisition module specifically for:
By Feature Words acquisition module, described current data is carried out pretreatment, to obtain described current number According to described Feature Words, wherein, the mode of described pretreatment includes at least one of or a combination thereof: Participle neighbour merges mode, background noise filter type, translator of English mode, Chinese-traditional reduction side Formula.
Information identification system the most according to claim 7, it is characterised in that described variation word Determine module specifically for:
By described variation word determine module described key word is carried out Chinese character pronunciation variation process and/or Chinese character pattern variation processes, to obtain the plurality of variation word of described key word, wherein, the described Chinese The mode that word pronunciation variation processes includes: with nearly sound substitute mode, liaison bonding substitute mode and letter Abbreviation substitute mode, and the mode that described Chinese character pattern variation processes includes: nearly shape Chinese character replacement side Formula and Chinese character pattern disassemble mode.
9. according to the information identification system according to any one of claim 6 to 8, it is characterised in that Described matching module specifically for:
Matching formula is used to calculate described Feature Words and the coupling mark of described key word, wherein, described Matching formula is:
s = Σ i δ ( f i ( w ) , f i ( t ) ) , i ∈ [ 1 , n ]
Wherein, s represents the sum of the described coupling mark of described Feature Words and described each variation word, n table Showing the quantity of the plurality of variation word, w represents described Feature Words, and t represents in the plurality of variation word I-th variation word, fiW () represents that the variation of described Feature Words maps, fiT () represents described i-th The variation of variation word maps, and δ represents that the described coupling of described Feature Words and described i-th variation word is divided Number.
Information identification system the most according to claim 9, it is characterised in that described coupling mould Block includes:
Identification module, calculates described Feature Words and the coupling of described key word at described use matching formula After mark, determine whether described coupling mark is in preset matching fraction range, wherein, when really Time in described coupling mark is in described preset matching fraction range calmly, described Feature Words is identified as institute State key word.
CN201510128025.4A 2015-03-23 2015-03-23 Information identifying method and information identification system Pending CN106156017A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510128025.4A CN106156017A (en) 2015-03-23 2015-03-23 Information identifying method and information identification system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510128025.4A CN106156017A (en) 2015-03-23 2015-03-23 Information identifying method and information identification system

Publications (1)

Publication Number Publication Date
CN106156017A true CN106156017A (en) 2016-11-23

Family

ID=58063302

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510128025.4A Pending CN106156017A (en) 2015-03-23 2015-03-23 Information identifying method and information identification system

Country Status (1)

Country Link
CN (1) CN106156017A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844508A (en) * 2016-12-27 2017-06-13 北京五八信息技术有限公司 deformation word recognition method and device
CN107341256A (en) * 2017-07-12 2017-11-10 深圳市乐唯科技开发有限公司 It is a kind of that the solution method that sensitive subjects filter in scene is exchanged based on information
CN107943954A (en) * 2017-11-24 2018-04-20 杭州安恒信息技术有限公司 Detection method, device and the electronic equipment of webpage sensitive information
CN108182246A (en) * 2017-12-28 2018-06-19 东软集团股份有限公司 Sensitive word detection filter method, device and computer equipment
CN108228704A (en) * 2017-11-03 2018-06-29 阿里巴巴集团控股有限公司 Identify method and device, the equipment of Risk Content
WO2018166099A1 (en) * 2017-03-17 2018-09-20 平安科技(深圳)有限公司 Information leakage detection method and device, server, and computer-readable storage medium
CN108804413A (en) * 2018-04-28 2018-11-13 百度在线网络技术(北京)有限公司 The recognition methods of text cheating and device
CN109408824A (en) * 2018-11-05 2019-03-01 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN109597987A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 A kind of text restoring method, device and electronic equipment
CN111078827A (en) * 2019-12-23 2020-04-28 上海米哈游天命科技有限公司 Keyword judgment method, device, equipment and medium
CN111092803A (en) * 2018-10-23 2020-05-01 阿里巴巴集团控股有限公司 Message processing method, device, system and storage medium
CN111612284A (en) * 2019-02-25 2020-09-01 阿里巴巴集团控股有限公司 Data processing method, device and equipment
CN112364153A (en) * 2020-11-10 2021-02-12 中数通信息有限公司 Keyword identification method and device based on interference characteristics
CN113468856A (en) * 2020-03-31 2021-10-01 阿里巴巴集团控股有限公司 Variant text generation method, variant text translation model training method, variant text classification device and variant text translation model training device
CN113657867A (en) * 2021-08-27 2021-11-16 广东智源机器人科技有限公司 Automatic reply control method, device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101082909A (en) * 2007-06-28 2007-12-05 腾讯科技(深圳)有限公司 Method and system for dividing Chinese sentences for recognizing deriving word
CN101719122A (en) * 2009-12-04 2010-06-02 中国人民解放军信息工程大学 Method for extracting Chinese named entity from text data
CN101729520A (en) * 2008-10-28 2010-06-09 北京大学 Method and device for detecting sensitive information
CN101876968A (en) * 2010-05-06 2010-11-03 复旦大学 Method for carrying out harmful content recognition on network text and short message service
US20110029301A1 (en) * 2009-07-31 2011-02-03 Samsung Electronics Co., Ltd. Method and apparatus for recognizing speech according to dynamic display
CN101976231A (en) * 2010-08-25 2011-02-16 孙强国 Network supervision method for multi-language short messages

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101082909A (en) * 2007-06-28 2007-12-05 腾讯科技(深圳)有限公司 Method and system for dividing Chinese sentences for recognizing deriving word
CN101729520A (en) * 2008-10-28 2010-06-09 北京大学 Method and device for detecting sensitive information
US20110029301A1 (en) * 2009-07-31 2011-02-03 Samsung Electronics Co., Ltd. Method and apparatus for recognizing speech according to dynamic display
CN101719122A (en) * 2009-12-04 2010-06-02 中国人民解放军信息工程大学 Method for extracting Chinese named entity from text data
CN101876968A (en) * 2010-05-06 2010-11-03 复旦大学 Method for carrying out harmful content recognition on network text and short message service
CN101976231A (en) * 2010-08-25 2011-02-16 孙强国 Network supervision method for multi-language short messages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘蔚琴: ""网络敏感信息监控系统研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844508A (en) * 2016-12-27 2017-06-13 北京五八信息技术有限公司 deformation word recognition method and device
WO2018166099A1 (en) * 2017-03-17 2018-09-20 平安科技(深圳)有限公司 Information leakage detection method and device, server, and computer-readable storage medium
CN107341256A (en) * 2017-07-12 2017-11-10 深圳市乐唯科技开发有限公司 It is a kind of that the solution method that sensitive subjects filter in scene is exchanged based on information
CN108228704B (en) * 2017-11-03 2021-07-13 创新先进技术有限公司 Method, device and equipment for identifying risk content
CN108228704A (en) * 2017-11-03 2018-06-29 阿里巴巴集团控股有限公司 Identify method and device, the equipment of Risk Content
CN107943954A (en) * 2017-11-24 2018-04-20 杭州安恒信息技术有限公司 Detection method, device and the electronic equipment of webpage sensitive information
CN107943954B (en) * 2017-11-24 2020-07-10 杭州安恒信息技术股份有限公司 Method and device for detecting webpage sensitive information and electronic equipment
CN108182246A (en) * 2017-12-28 2018-06-19 东软集团股份有限公司 Sensitive word detection filter method, device and computer equipment
CN108182246B (en) * 2017-12-28 2020-10-30 东软集团股份有限公司 Sensitive word detection and filtering method and device and computer equipment
CN108804413A (en) * 2018-04-28 2018-11-13 百度在线网络技术(北京)有限公司 The recognition methods of text cheating and device
CN111092803A (en) * 2018-10-23 2020-05-01 阿里巴巴集团控股有限公司 Message processing method, device, system and storage medium
CN109597987A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 A kind of text restoring method, device and electronic equipment
CN109408824A (en) * 2018-11-05 2019-03-01 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN109408824B (en) * 2018-11-05 2023-04-25 百度在线网络技术(北京)有限公司 Method and device for generating information
CN111612284A (en) * 2019-02-25 2020-09-01 阿里巴巴集团控股有限公司 Data processing method, device and equipment
CN111612284B (en) * 2019-02-25 2023-06-20 阿里巴巴集团控股有限公司 Data processing method, device and equipment
CN111078827A (en) * 2019-12-23 2020-04-28 上海米哈游天命科技有限公司 Keyword judgment method, device, equipment and medium
CN113468856A (en) * 2020-03-31 2021-10-01 阿里巴巴集团控股有限公司 Variant text generation method, variant text translation model training method, variant text classification device and variant text translation model training device
CN112364153A (en) * 2020-11-10 2021-02-12 中数通信息有限公司 Keyword identification method and device based on interference characteristics
CN113657867A (en) * 2021-08-27 2021-11-16 广东智源机器人科技有限公司 Automatic reply control method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106156017A (en) Information identifying method and information identification system
CN106777275B (en) Entity attribute and property value extracting method based on more granularity semantic chunks
CN107305768B (en) Error-prone character calibration method in voice interaction
CN103493041B (en) Use the automatic sentence evaluation device of shallow parsing device automatic evaluation sentence and error-detecting facility thereof and method
US6487532B1 (en) Apparatus and method for distinguishing similar-sounding utterances speech recognition
CN100358006C (en) Sound identifying method for geographic information and its application in navigation system
JP2005084681A (en) Method and system for semantic language modeling and reliability measurement
CN110188347A (en) Relation extraction method is recognized between a kind of knowledget opic of text-oriented
CN106294396A (en) Keyword expansion method and keyword expansion system
KR20140021838A (en) Method for detecting grammar error and apparatus thereof
CN104008123B (en) The method and system matched for Chinese Name
Darwish et al. Using Stem-Templates to Improve Arabic POS and Gender/Number Tagging.
US11386269B2 (en) Fault-tolerant information extraction
CN104485106B (en) Audio recognition method, speech recognition system and speech recognition apparatus
CN105183716B (en) A kind of intelligent interactive method based on abstract semantics
Gandhe et al. Using web text to improve keyword spotting in speech
CN106294315B (en) The natural language predicate verb recognition methods merged based on syntactic property with statistics
Jiang et al. Improvements on a trainable letter-to-sound converter
JP5097802B2 (en) Japanese automatic recommendation system and method using romaji conversion
CN109460554A (en) A kind of method and device of filtering shielding word
CN103049434B (en) A kind of alternative word identification system and identification method
Tachbelie et al. Morpheme-based automatic speech recognition for a morphologically rich language-Amharic.
Wang et al. Combining statistical and knowledge-based spoken language understanding in conditional models
KS et al. Automatic error detection and correction in malayalam
Tachbelie et al. Morpheme-based and factored language modeling for Amharic speech recognition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20161123