CN106156017A - Information identifying method and information identification system - Google Patents
Information identifying method and information identification system Download PDFInfo
- Publication number
- CN106156017A CN106156017A CN201510128025.4A CN201510128025A CN106156017A CN 106156017 A CN106156017 A CN 106156017A CN 201510128025 A CN201510128025 A CN 201510128025A CN 106156017 A CN106156017 A CN 106156017A
- Authority
- CN
- China
- Prior art keywords
- variation
- word
- feature words
- mode
- key word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention proposes a kind of information identifying method and a kind of information identification system, and wherein, described information identifying method includes: obtained the Feature Words of described current data by Feature Words acquisition module;In keyword database, the key word being associated with described Feature Words is determined by key word relating module;Determine that module determines multiple variation words of described key word by variation word;By matching module, described Feature Words is mated, for according to matching result with each variation word in the plurality of variation word, it is determined whether described Feature Words is identified as described key word.By technical scheme, the sensitive information through variation can be detected exactly, consequently facilitating carry out effective to sensitive information and comprehensively detect, it is to avoid the missing inspection of sensitive information occurs.
Description
Technical field
The present invention relates to information discriminating technology field, in particular to a kind of information identifying method and
A kind of information identification system.
Background technology
At present, along with developing rapidly of the Internet, user can utilize the Internet timely and conveniently to obtain
Information, but, owing to the speed of transmission on Internet information is exceedingly fast, incline with violence, unhealthy color
To information, the sensitive information such as uncivil information its spread in china rapidly in the Internet, thus affect
The integrated environment of the Internet, is even gradually constituted social public security with the form of " content threat "
Threaten.
For the problems referred to above, the solution in correlation technique be according to keyword database come to advertisement,
Bad speech, the sensitive informations such as vocabulary that are discord carry out checking filters, to realize the letter in the Internet
The management of breath, but, the solution in correlation technique can not check and filter out by through variation
Sensitive information, thus cause the missing inspection of sensitive information.
Therefore, the sensitive information through variation is detected the most all-sidedly and accurately, it is to avoid sensitivity letter occurs
The missing inspection of breath, becomes problem demanding prompt solution.
Summary of the invention
The present invention is based on the problems referred to above, it is proposed that a kind of new technical scheme, can examine exactly
Measure the sensitive information through variation, thus realize carrying out effectively and all sidedly detecting to sensitive information,
Avoid the occurrence of the missing inspection of sensitive information.
In view of this, an aspect of of the present present invention proposes a kind of information identifying method, including: by spy
Levy word acquisition module and obtain the Feature Words of described current data;By key word relating module at key word
Data base determines the key word being associated with described Feature Words;Determine that module determines institute by variation word
State multiple variation words of key word;By matching module by described Feature Words and the plurality of variation word
Each variation word mate, for according to matching result, it is determined whether by described Feature Words identification
For described key word.
In this technical scheme, while obtaining the Feature Words of the data being currently needed for process, to pass
Keyword carries out variation process, obtains multiple variation words of key word, then by the feature in current data
Each variation word in word and multiple variation word mates, to determine whether to be identified as Feature Words closing
Keyword, such that make Feature Words through variation process after, it is also possible to detect exactly through
Cross the Feature Words after variation, it is to avoid to including the Feature Words missing inspection of sensitive information, it is achieved thereby that
Carry out effectively and all sidedly detecting to the Feature Words including sensitive information, it is to avoid occur quick to including
The Feature Words missing inspection of sense information, and then sensitive information propagation in the Internet can be efficiently controlled,
Important support is provided for purifying the Internet space.
In technique scheme, it is preferable that described described currently by the acquisition of Feature Words acquisition module
The Feature Words of data, specifically includes: by Feature Words acquisition module, described current data is carried out pre-place
Reason, to obtain the described Feature Words of described current data, wherein, the mode of described pretreatment include with
Under one or a combination set of at least: participle neighbour merges mode, background noise filter type, translator of English
Mode, Chinese-traditional reduction mode.
In this technical scheme, owing to including various noise in the current data have sensitive information,
So result of meeting effect characteristics word identification, therefore, in order to improve the accurate of the result of identification Feature Words
Property, current data is carried out pretreatment, such that it is able to obtain spy in current data comprehensively and exactly
Levy word, and then can comprehensively and efficiently identify out whether Feature Words includes sensitive information, it is to avoid go out
Now to including the Feature Words missing inspection of sensitive information, it addition, the mode of pretreatment include but not limited to
Under one or a combination set of at least: participle neighbour merges mode, background noise filter type, translator of English
Mode, Chinese-traditional reduction mode, wherein, participle neighbour merges mode such as, has in current data
Phrase " unstable " (implication is " invoice "), but the scheme of correlation technique is to current data
Carry out participle obtain Feature Words time, " unstable " can be cut into two single Chinese characters " send out " and
" wafing ", this has a Feature Words of sensitive information to cause None-identified to go out " invoice ", therefore,
The present invention merges mode by participle neighbour current data is carried out pretreatment, it is possible to obtain and " send out
Ticket " this Feature Words, thus the Feature Words avoiding acquisition is not comprehensive, it addition, background noise filters
Mode be by current data various without semantic interference characters such as, #, *, % remove, English
Literary composition interpretative system is that the translator of English in current data becomes Chinese, and Chinese-traditional reduction mode is ought
Chinese-traditional in front data is reduced into simplified form of Chinese Character, thus, various by remove in current data
Noise can ensure that the current data accuracy when carrying out participle to obtain Feature Words and comprehensive, keeps away
Exempt from that the Feature Words missing inspection to including sensitive information occurs, exist such that it is able to efficiently control sensitive information
Propagation in the Internet, and then provide important support for purifying the Internet space.
In technique scheme, it is preferable that described by variation word determine that module determines described key
Multiple variation words of word, specifically include: determine that described key word is carried out by module by described variation word
Chinese character pronunciation variation process and/or Chinese character pattern variation process, to obtain the described many of described key word
Individual variation word, wherein, described Chinese character pronunciation variation process mode include: with nearly sound substitute mode,
Liaison bonding substitute mode and letter abbreviations substitute mode, and the side that described Chinese character pattern variation processes
Formula includes: nearly shape Chinese character substitute mode and Chinese character pattern disassemble mode.
In this technical scheme, by Chinese character pronunciation variation mode and/or Chinese character pattern variation mode pair
Key word makes a variation, such that it is able to according to the spy in multiple variation word identification current datas of key word
Levy whether word includes sensitive information, such that make Feature Words through variation process after, also
Can accurately and all sidedly detect whether the Feature Words through variation includes sensitive information, wherein,
Chinese character pronunciation variation mode refer to the variation to Chinese character pronunciation, this Chinese character pronunciation variation mode include but not
It is limited to: with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations substitute mode, with nearly sound
Substitute mode, such as, in " sending out drift ", " drift " replacement " ticket " is with nearly sound substitute mode, will
" you make?" making " replacement in " " knows " to be that liaison bonds substitute mode, will " FP " generation
It is letter abbreviations substitute mode for " invoice ", it addition, Chinese character pattern variation mode refers to Chinese Character
The variation of shape, Chinese character pattern variation mode includes but not limited to: nearly shape Chinese character substitute mode and Chinese Character
Shape disassembles mode, and such as, replacement " people " is nearly shape Chinese character substitute mode " will to enter the people ", will
" sending out west to show " replacement " invoice " is that Chinese character pattern disassembles mode.
In technique scheme, it is preferable that described by matching module by described Feature Words with described
Each variation word in multiple variation words mates, and specifically includes: use in described matching module
Matching formula calculates described Feature Words and the coupling mark of described key word, wherein, described matching formula
For:
Wherein, s represents the sum of the described coupling mark of described Feature Words and described each variation word, n table
Showing the quantity of the plurality of variation word, w represents described Feature Words, and t represents in the plurality of variation word
I-th variation word, fiW () represents that the variation of described Feature Words maps, fiT () represents described i-th
The variation of variation word maps, and δ represents that the described coupling of described Feature Words and described i-th variation word is divided
Number.
In this technical scheme, use matching formula s=∑iδ(fi(w),fi(t)) i ∈ [1, n] calculates spy
Levy the coupling mark of word and key word, such that it is able to according to the Feature Words calculated and the coupling of key word
Mark, it is determined whether Feature Words is identified as key word, wherein, s represents Feature Words and each variation
The sum of the coupling mark of word, n represents the quantity of multiple variation word, Feature Words described in w, and t represents multiple
I-th variation word in variation word, fiW () represents that the variation of Feature Words maps, fiT () represents i-th
Variation word variation map, δ represent Feature Words and i-th variation word coupling mark, such as, when
When Feature Words and i-th variation word coupling, then the value of δ is 1, and otherwise the value of δ is 0, and by feature
Word is added with all of variation i.e. δ of word matching result of key word, obtain Feature Words and key word
Partition number s, if s is nonzero value, then comprises in explanation Feature Words and Keywords matching, i.e. Feature Words
There is sensitive information, thus the current data at Feature Words place is filtered, and then comprehensively and exactly
Purify the Internet space.
In technique scheme, it is preferable that described use matching formula calculate described Feature Words and
After the coupling mark of described key word, also comprise determining that whether described coupling mark is in default
In the range of partition number, wherein, when determining that described coupling mark is in described preset matching fraction range
Time, described Feature Words is identified as described key word.
In this technical scheme, if the coupling mark of Feature Words and key word is in preset matching mark
In the range of, it is determined whether Feature Words is identified as key word, such that it is able to more accurately detect
Include the Feature Words of sensitive information, it is to avoid include the Feature Words missing inspection of sensitive information, can be effective
Ground controls sensitive information propagation in the Internet, thus provides important for purifying the Internet space
Hold.
Another aspect of the present invention proposes a kind of information identification system, including: Feature Words acquisition module,
For obtaining the Feature Words of described current data;Key word relating module, in keyword database
Middle determine the key word being associated with described Feature Words;Variation word determines module, is used for determining described pass
Multiple variation words of keyword;Matching module, for by described Feature Words and the plurality of variation word
Each variation word mates, for according to matching result, it is determined whether be identified as by described Feature Words
Described key word.
In this technical scheme, while obtaining the Feature Words of the data being currently needed for process, to pass
Keyword carries out variation process, obtains multiple variation words of key word, then by the feature in current data
Each variation word in word and multiple variation word mates, to determine whether to be identified as Feature Words closing
Keyword, such that make Feature Words through variation process after, it is also possible to detect exactly through
Cross the Feature Words after variation, it is to avoid to including the Feature Words missing inspection of sensitive information, it is achieved thereby that
Carry out effectively and all sidedly detecting to the Feature Words including sensitive information, it is to avoid occur quick to including
The Feature Words missing inspection of sense information, and then sensitive information propagation in the Internet can be efficiently controlled,
Important support is provided for purifying the Internet space.
In technique scheme, it is preferable that described Feature Words acquisition module specifically for: by spy
Levy word acquisition module and described current data is carried out pretreatment, to obtain the described spy of described current data
Levying word, wherein, the mode of described pretreatment includes at least one of or a combination thereof: participle neighbour closes
And mode, background noise filter type, translator of English mode, Chinese-traditional reduction mode.
In this technical scheme, owing to including various noise in the current data have sensitive information,
So result of meeting effect characteristics word identification, therefore, in order to improve the accurate of the result of identification Feature Words
Property, current data is carried out pretreatment, such that it is able to obtain spy in current data comprehensively and exactly
Levy word, and then can comprehensively and efficiently identify out whether Feature Words includes sensitive information, it is to avoid go out
Now to including the Feature Words missing inspection of sensitive information, it addition, the mode of pretreatment include but not limited to
Under one or a combination set of at least: participle neighbour merges mode, background noise filter type, translator of English
Mode, Chinese-traditional reduction mode, wherein, participle neighbour merges mode such as, has in current data
Phrase " unstable " (implication is " invoice "), but the scheme of correlation technique is to current data
Carry out participle obtain Feature Words time, " unstable " can be cut into two single Chinese characters " send out " and
" wafing ", this has a Feature Words of sensitive information to cause None-identified to go out " invoice ", therefore,
The present invention merges mode by participle neighbour current data is carried out pretreatment, it is possible to obtain and " send out
Ticket " this Feature Words, thus the Feature Words avoiding acquisition is not comprehensive, it addition, background noise filters
Mode be by current data various without semantic interference characters such as, #, *, % remove, English
Literary composition interpretative system is that the translator of English in current data becomes Chinese, and Chinese-traditional reduction mode is ought
Chinese-traditional in front data is reduced into simplified form of Chinese Character, thus, various by remove in current data
Noise can ensure that the current data accuracy when carrying out participle to obtain Feature Words and comprehensive, keeps away
Exempt from that the Feature Words missing inspection to including sensitive information occurs, exist such that it is able to efficiently control sensitive information
Propagation in the Internet, and then provide important support for purifying the Internet space.
In technique scheme, it is preferable that described variation word determine module specifically for: by institute
State variation word and determine that module carries out Chinese character pronunciation variation process and/or Chinese Character deformation to described key word
Different process, to obtain the plurality of variation word of described key word, wherein, described Chinese character pronunciation makes a variation
The mode processed includes: with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations replacement side
Formula, and the mode that described Chinese character pattern variation processes includes: nearly shape Chinese character substitute mode and Chinese Character
Shape disassembles mode.
In this technical scheme, by Chinese character pronunciation variation mode and/or Chinese character pattern variation mode pair
Key word makes a variation, such that it is able to according to the spy in multiple variation word identification current datas of key word
Levy whether word includes sensitive information, such that make Feature Words through variation process after, also
Can accurately and all sidedly detect whether the Feature Words through variation includes sensitive information, wherein,
Chinese character pronunciation variation mode refer to the variation to Chinese character pronunciation, this Chinese character pronunciation variation mode include but not
It is limited to: with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations substitute mode, with nearly sound
Substitute mode, such as, in " sending out drift ", " drift " replacement " ticket " is with nearly sound substitute mode, will
" you make?" making " replacement in " " knows " to be that liaison bonds substitute mode, will " FP " generation
It is letter abbreviations substitute mode for " invoice ", it addition, Chinese character pattern variation mode refers to Chinese Character
The variation of shape, Chinese character pattern variation mode includes but not limited to: nearly shape Chinese character substitute mode and Chinese Character
Shape disassembles mode, and such as, replacement " people " is nearly shape Chinese character substitute mode " will to enter the people ", will
" sending out west to show " replacement " invoice " is that Chinese character pattern disassembles mode.
In technique scheme, it is preferable that described matching module specifically for: use matching formula
Calculating the coupling mark of described Feature Words and described key word, wherein, described matching formula is:
Wherein, s represents the sum of the described coupling mark of described Feature Words and described each variation word, n table
Showing the quantity of the plurality of variation word, w represents described Feature Words, and t represents in the plurality of variation word
I-th variation word, fiW () represents that the variation of described Feature Words maps, fiT () represents described i-th
The variation of variation word maps, and δ represents that the described coupling of described Feature Words and described i-th variation word is divided
Number.
In this technical scheme, use matching formula s=∑iδ(fi(w),fi(t)) i ∈ [1, n] calculates spy
Levy the coupling mark of word and key word, such that it is able to according to the Feature Words calculated and the coupling of key word
Mark, it is determined whether Feature Words is identified as key word, wherein, s represents Feature Words and each variation
The sum of the coupling mark of word, n represents the quantity of multiple variation word, Feature Words described in w, and t represents multiple
I-th variation word in variation word, fiW () represents that the variation of Feature Words maps, fiT () represents i-th
Variation word variation map, δ represent Feature Words and i-th variation word coupling mark, such as, when
When Feature Words and i-th variation word coupling, then the value of δ is 1, and otherwise the value of δ is 0, and by feature
Word is added with all of variation i.e. δ of word matching result of key word, obtain Feature Words and key word
Partition number s, if s is nonzero value, then comprises in explanation Feature Words and Keywords matching, i.e. Feature Words
There is sensitive information, thus the current data at Feature Words place is filtered, and then comprehensively and exactly
Purify the Internet space.
In technique scheme, it is preferable that described matching module includes: identification module, described
After using matching formula to calculate the coupling mark of described Feature Words and described key word, determine described
Whether partition number is in preset matching fraction range, wherein, when determining that described coupling mark is in institute
When stating in preset matching fraction range, described Feature Words is identified as described key word.
In this technical scheme, if the coupling mark of Feature Words and key word is in preset matching mark
In the range of, it is determined whether Feature Words is identified as key word, such that it is able to more accurately detect
Include the Feature Words of sensitive information, it is to avoid include the Feature Words missing inspection of sensitive information, can be effective
Ground controls sensitive information propagation in the Internet, thus provides important for purifying the Internet space
Hold.
By technical scheme, can detect exactly through variation and there is sensitive information
Feature Words, thus realize carry out effectively and all sidedly detecting to the Feature Words with sensitive information, keep away
Exempt to occur having the missing inspection of the Feature Words of sensitive information.
Accompanying drawing explanation
Fig. 1 shows the schematic flow sheet of information identifying method according to an embodiment of the invention;
Fig. 2 shows the structural representation of information identification system according to an embodiment of the invention;
Fig. 3 shows the principle schematic of information identification system according to an embodiment of the invention;
Fig. 4 shows the principle schematic of information identification system according to another embodiment of the invention.
Detailed description of the invention
In order to the above-mentioned purpose of the present invention, feature and advantage can be more clearly understood that, below in conjunction with attached
The present invention is further described in detail by figure and detailed description of the invention.It should be noted that not
In the case of conflict, the feature in embodiments herein and embodiment can be mutually combined.
Elaborate a lot of detail in the following description so that fully understanding the present invention, but,
The present invention can implement to use other to be different from other modes described here, therefore, and the present invention
Protection domain do not limited by following public specific embodiment.
Fig. 1 shows the schematic flow sheet of information identifying method according to an embodiment of the invention.
As it is shown in figure 1, information identifying method according to an embodiment of the invention, including:
Step 102, obtains the Feature Words of current data by Feature Words acquisition module.
Step 104, determines relevant to Feature Words by key word relating module in keyword database
The key word of connection.
By variation word, step 106, determines that module determines multiple variation words of key word.
Step 108, is carried out Feature Words with each variation word in multiple variation words by matching module
Coupling, for according to matching result, it is determined whether Feature Words is identified as key word.
In this technical scheme, while obtaining the Feature Words of the data being currently needed for process, to pass
Keyword carries out variation process, obtains multiple variation words of key word, then by the feature in current data
Each variation word in word and multiple variation word mates, to determine whether to be identified as Feature Words closing
Keyword, such that make Feature Words through variation process after, it is also possible to detect exactly through
Cross the Feature Words after variation, it is to avoid to including the Feature Words missing inspection of sensitive information, it is achieved thereby that
Carry out effectively and all sidedly detecting to the Feature Words including sensitive information, it is to avoid occur quick to including
The Feature Words missing inspection of sense information, and then sensitive information propagation in the Internet can be efficiently controlled,
Important support is provided for purifying the Internet space.
In technique scheme, it is preferable that step 102 specifically includes: obtain mould by Feature Words
Block carries out pretreatment to current data, to obtain the Feature Words of current data, wherein, and the side of pretreatment
Formula includes at least one of or a combination thereof: participle neighbour merge mode, background noise filter type,
Translator of English mode, Chinese-traditional reduction mode.
In this technical scheme, owing to including various noise in the current data have sensitive information,
So result of meeting effect characteristics word identification, therefore, in order to improve the accurate of the result of identification Feature Words
Property, current data is carried out pretreatment, such that it is able to obtain spy in current data comprehensively and exactly
Levy word, and then can comprehensively and efficiently identify out whether Feature Words includes sensitive information, it is to avoid go out
Now to including the Feature Words missing inspection of sensitive information, it addition, the mode of pretreatment include but not limited to
Under one or a combination set of at least: participle neighbour merges mode, background noise filter type, translator of English
Mode, Chinese-traditional reduction mode, wherein, participle neighbour merges mode such as, has in current data
Phrase " unstable " (implication is " invoice "), but the scheme of correlation technique is to current data
Carry out participle obtain Feature Words time, " unstable " can be cut into two single Chinese characters " send out " and
" wafing ", this has a Feature Words of sensitive information to cause None-identified to go out " invoice ", therefore,
The present invention merges mode by participle neighbour current data is carried out pretreatment, it is possible to obtain and " send out
Ticket " this Feature Words, thus the Feature Words avoiding acquisition is not comprehensive, it addition, background noise filters
Mode be by current data various without semantic interference characters such as, #, *, % remove, English
Literary composition interpretative system is that the translator of English in current data becomes Chinese, and Chinese-traditional reduction mode is ought
Chinese-traditional in front data is reduced into simplified form of Chinese Character, thus, various by remove in current data
Noise can ensure that the current data accuracy when carrying out participle to obtain Feature Words and comprehensive, keeps away
Exempt from that the Feature Words missing inspection to including sensitive information occurs, exist such that it is able to efficiently control sensitive information
Propagation in the Internet, and then provide important support for purifying the Internet space.
In technique scheme, it is preferable that step 106 specifically includes: determine mould by variation word
Block carries out Chinese character pronunciation variation process to key word and/or Chinese character pattern variation processes, to obtain key
Multiple variation words of word, wherein, Chinese character pronunciation variation process mode include: with nearly sound substitute mode,
Liaison bonding substitute mode and letter abbreviations substitute mode, and the mode bag that Chinese character pattern variation processes
Include: nearly shape Chinese character substitute mode and Chinese character pattern disassemble mode.
In this technical scheme, by Chinese character pronunciation variation mode and/or Chinese character pattern variation mode pair
Key word makes a variation, such that it is able to according to the spy in multiple variation word identification current datas of key word
Levy whether word includes sensitive information, such that make Feature Words through variation process after, also
Can accurately and all sidedly detect whether the Feature Words through variation includes sensitive information, wherein,
Chinese character pronunciation variation mode refer to the variation to Chinese character pronunciation, this Chinese character pronunciation variation mode include but not
It is limited to: with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations substitute mode, with nearly sound
Substitute mode, such as, in " sending out drift ", " drift " replacement " ticket " is with nearly sound substitute mode, will
" you make?" making " replacement in " " knows " to be that liaison bonds substitute mode, will " FP " generation
It is letter abbreviations substitute mode for " invoice ", it addition, Chinese character pattern variation mode refers to Chinese Character
The variation of shape, Chinese character pattern variation mode includes but not limited to: nearly shape Chinese character substitute mode and Chinese Character
Shape disassembles mode, and such as, replacement " people " is nearly shape Chinese character substitute mode " will to enter the people ", will
" sending out west to show " replacement " invoice " is that Chinese character pattern disassembles mode.
In technique scheme, it is preferable that by matching module by Feature Words and multiple variation words
Each variation word mate, specifically include: in matching module use matching formula calculate feature
The coupling mark of word and key word, wherein, matching formula is:
Wherein, s represents the sum of the coupling mark of Feature Words and each variation word, and n represents multiple variation word
Quantity, w represents the i-th variation word that Feature Words, t represent in multiple variation word, fiW () represents
The variation of Feature Words maps, fiT () represents that the variation of i-th variation word maps, δ represents Feature Words and the
The coupling mark of i variation word.
In this technical scheme, use matching formula s=∑iδ(fi(w),fi(t)) i ∈ [1, n] calculates spy
Levy the coupling mark of word and key word, such that it is able to according to the Feature Words calculated and the coupling of key word
Mark, it is determined whether Feature Words is identified as key word, wherein, s represents Feature Words and each variation
The sum of the coupling mark of word, n represents the quantity of multiple variation word, w Feature Words, and t represents multiple variation
I-th variation word in word, fiW () represents that the variation of Feature Words maps, fiT () represents i-th variation
The variation of word maps, and δ represents Feature Words and the coupling mark of i-th variation word, such as, works as feature
Word and i-th variation word coupling time, then the value of δ is 1, and otherwise the value of δ is 0, and by Feature Words with
The all of variation i.e. δ of word matching result of key word is added, and the coupling obtaining Feature Words and key word is divided
Number s, if s is nonzero value, then explanation Feature Words and Keywords matching, i.e. Feature Words in include quick
Sense information, thus the current data at Feature Words place is filtered, and then purify comprehensively and exactly
The Internet space.
In technique scheme, it is preferable that calculate Feature Words and key word using matching formula
After coupling mark, also comprise determining that whether coupling mark is in preset matching fraction range, its
In, when determining that coupling mark is in preset matching fraction range, Feature Words is identified as key word.
In this technical scheme, if the coupling mark of Feature Words and key word is in preset matching mark
In the range of, it is determined whether Feature Words is identified as key word, such that it is able to more accurately detect
Include the Feature Words of sensitive information, it is to avoid include the Feature Words missing inspection of sensitive information, can be effective
Ground controls sensitive information propagation in the Internet, thus provides important for purifying the Internet space
Hold.
Fig. 2 shows the structural representation of information identification system according to an embodiment of the invention.
As in figure 2 it is shown, information identification system 200 according to an embodiment of the invention, including:
Feature Words acquisition module 202, for obtaining the Feature Words of current data;Key word relating module 204,
For determining the key word being associated with Feature Words in keyword database;Variation word determines module
206, for determining multiple variation words of key word;Matching module 208, is used for Feature Words with many
Each variation word in individual variation word mates, for according to matching result, it is determined whether by feature
Word is identified as key word.
In this technical scheme, while obtaining the Feature Words of the data being currently needed for process, to pass
Keyword carries out variation process, obtains multiple variation words of key word, then by the feature in current data
Each variation word in word and multiple variation word mates, to determine whether to be identified as Feature Words closing
Keyword, such that make Feature Words through variation process after, it is also possible to detect exactly through
Cross the Feature Words after variation, it is to avoid to including the Feature Words missing inspection of sensitive information, it is achieved thereby that
Carry out effectively and all sidedly detecting to the Feature Words including sensitive information, it is to avoid occur quick to including
The Feature Words missing inspection of sense information, and then sensitive information propagation in the Internet can be efficiently controlled,
Important support is provided for purifying the Internet space.
In technique scheme, it is preferable that Feature Words acquisition module 202 specifically for: by spy
Levy word acquisition module 202 and current data is carried out pretreatment, to obtain the Feature Words of current data, its
In, the mode of pretreatment includes at least one of or a combination thereof: participle neighbour merges mode, background
Noise filtering mode, translator of English mode, Chinese-traditional reduction mode.
In this technical scheme, owing to including various noise in the current data have sensitive information,
So result of meeting effect characteristics word identification, therefore, in order to improve the accurate of the result of identification Feature Words
Property, current data is carried out pretreatment, such that it is able to obtain spy in current data comprehensively and exactly
Levy word, and then can comprehensively and efficiently identify out whether Feature Words includes sensitive information, it is to avoid go out
Now to including the Feature Words missing inspection of sensitive information, it addition, the mode of pretreatment include but not limited to
Under one or a combination set of at least: participle neighbour merges mode, background noise filter type, translator of English
Mode, Chinese-traditional reduction mode, wherein, participle neighbour merges mode such as, has in current data
Phrase " unstable " (implication is " invoice "), but the scheme of correlation technique is to current data
Carry out participle obtain Feature Words time, " unstable " can be cut into two single Chinese characters " send out " and
" wafing ", this has a Feature Words of sensitive information to cause None-identified to go out " invoice ", therefore,
The present invention merges mode by participle neighbour current data is carried out pretreatment, it is possible to obtain and " send out
Ticket " this Feature Words, thus the Feature Words avoiding acquisition is not comprehensive, it addition, background noise filters
Mode be by current data various without semantic interference characters such as, #, *, % remove, English
Literary composition interpretative system is that the translator of English in current data becomes Chinese, and Chinese-traditional reduction mode is ought
Chinese-traditional in front data is reduced into simplified form of Chinese Character, thus, various by remove in current data
Noise can ensure that the current data accuracy when carrying out participle to obtain Feature Words and comprehensive, keeps away
Exempt from that the Feature Words missing inspection to including sensitive information occurs, exist such that it is able to efficiently control sensitive information
Propagation in the Internet, and then provide important support for purifying the Internet space.
In technique scheme, it is preferable that variation word determine module 206 specifically for: by become
Dissenting words determines that key word is carried out at Chinese character pronunciation variation process and/or Chinese character pattern variation by module 206
Reason, to obtain multiple variation words of key word, wherein, the mode that Chinese character pronunciation variation processes includes:
With nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations substitute mode, and Chinese character pattern
The mode that variation processes includes: nearly shape Chinese character substitute mode and Chinese character pattern disassemble mode.
In this technical scheme, by Chinese character pronunciation variation mode and/or Chinese character pattern variation mode pair
Key word makes a variation, such that it is able to according to the spy in multiple variation word identification current datas of key word
Levy whether word includes sensitive information, such that make Feature Words through variation process after, also
Can accurately and all sidedly detect whether the Feature Words through variation includes sensitive information, wherein,
Chinese character pronunciation variation mode refer to the variation to Chinese character pronunciation, this Chinese character pronunciation variation mode include but not
It is limited to: with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations substitute mode, with nearly sound
Substitute mode, such as, in " sending out drift ", " drift " replacement " ticket " is with nearly sound substitute mode, will
" you make?" making " replacement in " " knows " to be that liaison bonds substitute mode, will " FP " generation
It is letter abbreviations substitute mode for " invoice ", it addition, Chinese character pattern variation mode refers to Chinese Character
The variation of shape, Chinese character pattern variation mode includes but not limited to: nearly shape Chinese character substitute mode and Chinese Character
Shape disassembles mode, and such as, replacement " people " is nearly shape Chinese character substitute mode " will to enter the people ", will
" sending out west to show " replacement " invoice " is that Chinese character pattern disassembles mode.
In technique scheme, it is preferable that matching module 208 specifically for: use matching formula
Calculating Feature Words and the coupling mark of key word, wherein, matching formula is:
Wherein, s represents the sum of the coupling mark of Feature Words and each variation word, and n represents multiple variation word
Quantity, w represents the i-th variation word that Feature Words, t represent in multiple variation word, fiW () represents
The variation of Feature Words maps, fiT () represents that the variation of i-th variation word maps, δ represents Feature Words and the
The coupling mark of i variation word.
In this technical scheme, use matching formula s=∑iδ(fi(w),fi(t)) i ∈ [1, n] calculates spy
Levy the coupling mark of word and key word, such that it is able to according to the Feature Words calculated and the coupling of key word
Mark, it is determined whether Feature Words is identified as key word, wherein, s represents Feature Words and each variation
The sum of the coupling mark of word, n represents the quantity of multiple variation word, w Feature Words, and t represents multiple variation
I-th variation word in word, fiW () represents that the variation of Feature Words maps, fiT () represents i-th variation
The variation of word maps, and δ represents Feature Words and the coupling mark of i-th variation word, such as, works as feature
Word and i-th variation word coupling time, then the value of δ is 1, and otherwise the value of δ is 0, and by Feature Words with
The all of variation i.e. δ of word matching result of key word is added, and the coupling obtaining Feature Words and key word is divided
Number s, if s is nonzero value, then explanation Feature Words and Keywords matching, i.e. Feature Words in include quick
Sense information, thus the current data at Feature Words place is filtered, and then purify comprehensively and exactly
The Internet space.
In technique scheme, it is preferable that matching module 208 includes: identification module 2082,
After using matching formula to calculate the coupling mark of Feature Words and key word, whether determine coupling mark
It is in preset matching fraction range, wherein, when determining that coupling mark is in preset matching fraction range
Time interior, Feature Words is identified as key word.
In this technical scheme, if the coupling mark of Feature Words and key word is in preset matching mark
In the range of, it is determined whether Feature Words is identified as key word, such that it is able to more accurately detect
Include the Feature Words of sensitive information, it is to avoid include the Feature Words missing inspection of sensitive information, can be effective
Ground controls sensitive information propagation in the Internet, thus provides important for purifying the Internet space
Hold.
Fig. 3 shows the principle schematic of information identification system according to an embodiment of the invention.
As it is shown on figure 3, information identification system 300 according to an embodiment of the invention (is equivalent to
The information identification system 200 of the embodiment shown in Fig. 2), including: Text Pretreatment module 302,
The multidimensional variation mapping block 304 of sensitive information and matching module 306, be described in detail below these three
Module:
1. Text Pretreatment module 302, owing to current data including various noise, and in order to
Ensure that the Feature Words got in current data, more accurately with comprehensively, therefore, is obtaining Feature Words
Before, by Text Pretreatment module 302, current data is carried out pretreatment, wherein, pretreatment
Mode includes at least one of or a combination thereof: participle neighbour merge mode, background noise filter type,
Translator of English mode, Chinese-traditional reduction mode.
Text Pretreatment module 302 is additionally operable to current data is carried out participle, due to dividing of correlation technique
Word technology is typically based on learning correct word model and removes cutting text, regardless of machine learning model
How, the training set used is usually standard works text.But, the variation of Feature Words belongs to language
The improper expression of speech.Such as, the participle technique of correlation technique can (implication be by " unstable "
" invoice ") it is cut into two single Chinese characters and " sends out " and " wafing ", cause filtration system to know
Do not go out " invoice " this implicit word.Participle knot can be merged the most in the inventive solutions
The word of next-door neighbour in Guo.
The normal accuracy using character itself not interfere with participle, but include the current of sensitive information
, in order to hide the detection of conventional filter systems, often there are the various without semantic interference of deliberately interpolation in data
Character such as@, #, &, *, %, these symbols are mixed in current data, so can have a strong impact on
The accuracy of participle, therefore removed these interference characters before to current data participle.As for extensively
Accuse key word " invoice " word, after introducing the variation of background noise symbol, be probably that " * * * sends out * * * ticket
* * * ", asterisk here is the background noise needing to filter.And some Feature Words can use English to replace
Change or Chinese-traditional is replaced.Such as English " government ", refer to " government " word, permissible
Use translator of English mode that current data is carried out pretreatment, to substitute English, it addition, have with Chinese
A little Feature Words can use Chinese-traditional to substitute simple Chinese, therefore, uses Chinese-traditional reduction mode pair
Current data carries out pretreatment, so that Chinese-traditional is replaced into simple Chinese.
Therefore, by technique scheme, Text Pretreatment module 302 is except extracting Feature Words
Outside, also remove because inserting, replacing the various noises introduced in current data, so that obtain
Feature Words is more comprehensive and accurate.
2. multidimensional variation mapping block 304 for carrying out the variation of various dimensions by Feature Words and key word
Map, specifically, keyword database obtains the key word relevant to Feature Words, and by same
Nearly sound substitute mode, liaison bonding substitute mode, letter abbreviations substitute mode, nearly shape Chinese character replacement side
Formula and Chinese character pattern are disassembled mode and key word are carried out the variation of various dimensions, thus obtain the change of key word
Dissenting words, may thereby determine that whether variation word and Feature Words mate, to determine that whether Feature Words is as variation
Key word corresponding to word.
Refer to Chinese character with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations substitute mode
The variation of pronunciation.Such as, " drift " replacement " ticket " in " sending out drift ", the two unisonance;" university monk "
In " monk " replacement " give birth to ", the two nearly sound;" you make?" make " replacement in " " to know ",
The former is the liaison bonding of the latter." FP " replacement " invoice ", uses phrase acronym to replace.
In order to detect this type of variation, by with nearly sound substitute mode, liaison bonding substitute mode and letter abbreviations
Key word is made a variation by substitute mode, then phonetic does the replacement of nearly sound and liaison bonding.Nearly sound, or
Person says approximation sound, fuzzy phoneme, is mainly reflected in the similar initial consonant of phonetic transcriptions of Chinese characters, the replacement of simple or compound vowel of a Chinese syllable.As:
(z, ch), (c, ch), (s, sh), (l, n), (f, h), (r, l),
(an, ang), (en, eng), (in, ing), (ian, iang), (uan, uang),
Phonetic in bracket is similar pinyin pair;Phonetic [zhi, the dao] liaison " known " is bonded as [zhao],
It is approximately again [zao], then transfers Chinese character to and " make ";The phonetic [fa, piao] of " invoice ", extracts head
Letter abbreviations is " FP ".
Nearly shape Chinese character substitute mode and Chinese character pattern are disassembled mode and are referred to the variation to Chinese character pattern.Such as
The replacement " people " that " enters the people ", " entering " and " people " the two font is similar." send out west to show " and replace
" invoice " word, the latter disassembles as the former.This type of dividing by means of characters example also has: " the most completely Ren ten sun
Waiting, speech logical sequence soil cloud converted in Ren all of a sudden mouth jin speech has Chuo cun of Rui worry white peony root speech to convert Rui, and then, Ren covers scholar's cloud
Rolling scolds word.Later, Ren all of a sudden rice green grass or young crops Woo divided row clothing in Shen." to this, identify the radical of Chinese character,
And be crucial with the combination of neighbour's Chinese character.It should be noted that some replace be i.e. with nearly sound replace be again near
Shape is replaced, and replaces with such as " faction " " group of nation ", and " side " and " nation " sound similar shape are the most seemingly.
Chinese character, in enunciative variation, directly uses fuzzy phoneme in Pinyin coding to represent.And Chinese Character
The variation of shape, then do not have to reflect the fuzzy shape of character shape coding similarity.Utilize the font shape of Chinese character
The scheme structure of shape, as above (in) under, left (in) right structure, Chinese character is disassembled, by interior
Similarity between portion's component units, to character shape coding, weighs the composition between word, and similarity
Marking.
3. matching module 306 mates with key word for Feature Words under multidimensional variation maps,
Specifically, matching formula is used to calculate Feature Words and the coupling mark of key word, wherein, matching formula
For:
Wherein, s represents the sum of the coupling mark of Feature Words and each variation word, and n represents multiple variation word
Quantity, w represents the i-th variation word that Feature Words, t represent in multiple variation word, fiW () represents
The variation of Feature Words maps, fiT () represents that the variation of i-th variation word maps, δ represents Feature Words and the
The coupling mark of i variation word.
After using matching formula to calculate the coupling mark of Feature Words and key word, determine coupling mark
Whether it is in preset matching fraction range, wherein, when determining that coupling mark is in preset matching mark
In the range of time, Feature Words is identified as key word.Such as, when Feature Words and i-th variation word coupling
Time, then the value of δ is 1, and otherwise the value of δ is 0, and by all of variation word of Feature Words Yu key word
The i.e. δ of matching result is added, and obtains the coupling mark s of Feature Words and key word, if s is nonzero value,
Then explanation Feature Words and Keywords matching, i.e. Feature Words include sensitive information, thus to Feature Words
The current data at place filters.
Fig. 4 shows the principle schematic of information identification system according to another embodiment of the invention.
As shown in Figure 4, information identification system according to another embodiment of the invention, first to obtaining
The current data got carries out pretreatment, including the various interference characters rejected in current data, then
Use based on string matching or common segmenting method based on statistics current number after the pre-treatment
According to middle acquisition Feature Words.When obtaining Feature Words, it is possible to obtain the Feature Words of specification exactly, to change
Different Feature Words None-identified, wherein, Feature Words includes generic word and variation word.Current data is entered
Result after row participle comprises generic word and individual character, the most adjacent individual character structure between two generic word
Becoming an individual character string, the set of individual character string constitutes individual character trail, and the set of generic word constitutes common word set,
Due to through the Feature Words of variation at Chinese word segmentation after be cut into several adjacent individual characters, so such as
Really certain word comprises variability signals, then this word must be in individual character string, therefore, by common word set and list
The word that word string is concentrated is Feature Words, then obtains the pass relevant to Feature Words in keyword database
Keyword, mates Feature Words with key word under multidimensional variation maps, if Feature Words is with crucial
Word mates under multidimensional variation maps, then include sensitive information, such that it is able to sentence in explanation Feature Words
Determine current data includes sensitive information, such as current data be " company's Dai Kai right path is unstable,
Huo is to paying a kuan ", through participle obtain " company/n generation/v opens/the v right path/n sends out/v wafts/v,
/ w huo/x to/v pays/v kuan/x " result.After participle, individual character neighbour merges, and obtains " generation
Open ", " unstable ", " huo is to paying kuan ", Feature Words is mapped in multidimensional variation with key word
Under when mating, can to the key word in keywords database and the Feature Words in current data, according to
Replace with nearly sound substitute mode, liaison bonding substitute mode, letter abbreviations substitute mode, nearly shape Chinese character
Mode and Chinese character pattern are disassembled the variations such as mode and are mapped.As above " unstable " in current data in example
Word and key word " invoice " mate after unisonance variation maps;In current data " huo to pay
Kuan " and key word " cash on delivery ", through unisonance variation map after mate.It addition, expand in variation
Zhan Shi, can map according to the variation of sound, shape, keyword database is set up index.Actually detected
Time, need to travel through all variations and map, until coupling, that is to say and choose the mapping that suitably makes a variation,
Or not coupling, completes the identification work of variation sensitive information.
Technical scheme is described in detail above in association with accompanying drawing, by technical scheme,
Can detect exactly through variation and there is the Feature Words of sensitive information, thus realizing quick to having
The Feature Words of sense information carries out effectively and all sidedly detecting, it is to avoid the Feature Words with sensitive information occur
Missing inspection.
These are only the preferred embodiments of the present invention, be not limited to the present invention, for ability
For the technical staff in territory, the present invention can have various modifications and variations.All spirit in the present invention and
Within principle, any modification, equivalent substitution and improvement etc. made, should be included in the guarantor of the present invention
Within the scope of protecting.
Claims (10)
1. an information identifying method, it is characterised in that including:
The Feature Words of described current data is obtained by Feature Words acquisition module;
In keyword database, the pass being associated with described Feature Words is determined by key word relating module
Keyword;
Determine that module determines multiple variation words of described key word by variation word;
Each variation word in described Feature Words and the plurality of variation word carried out by matching module
Join, for according to matching result, it is determined whether described Feature Words is identified as described key word.
Information identifying method the most according to claim 1, it is characterised in that described by spy
Levy word acquisition module and obtain the Feature Words of described current data, specifically include:
By Feature Words acquisition module, described current data is carried out pretreatment, to obtain described current number
According to described Feature Words, wherein, the mode of described pretreatment includes at least one of or a combination thereof:
Participle neighbour merges mode, background noise filter type, translator of English mode, Chinese-traditional also
Former mode.
Information identifying method the most according to claim 2, it is characterised in that described by becoming
Dissenting words determines that module determines multiple variation words of described key word, specifically includes:
By described variation word determine module described key word is carried out Chinese character pronunciation variation process and/or
Chinese character pattern variation processes, to obtain the plurality of variation word of described key word, wherein,
The mode that described Chinese character pronunciation variation processes includes: replace with nearly sound substitute mode, liaison bonding
Mode and letter abbreviations substitute mode, and
The mode that described Chinese character pattern variation processes includes: nearly shape Chinese character substitute mode and Chinese character pattern are torn open
Solution mode.
Information identifying method the most according to any one of claim 1 to 3, it is characterised in that
Described by matching module, each variation word in described Feature Words and the plurality of variation word is carried out
Join, specifically include:
Matching formula is used to calculate described Feature Words and the coupling of described key word in described matching module
Mark, wherein, described matching formula is:
Wherein, s represents the sum of the described coupling mark of described Feature Words and described each variation word, n table
Showing the quantity of the plurality of variation word, w represents described Feature Words, and t represents in the plurality of variation word
I-th variation word, fiW () represents that the variation of described Feature Words maps, fiT () represents described i-th
The variation of variation word maps, and δ represents that the described coupling of described Feature Words and described i-th variation word is divided
Number.
Information identifying method the most according to claim 4, it is characterised in that in described use
After matching formula calculates the coupling mark of described Feature Words and described key word, also include:
Determine whether described coupling mark is in preset matching fraction range, wherein, described when determining
When coupling mark is in described preset matching fraction range, described Feature Words is identified as described key
Word.
6. an information identification system, it is characterised in that including:
Feature Words acquisition module, for obtaining the Feature Words of described current data;
Key word relating module, is associated with described Feature Words for determining in keyword database
Key word;
Variation word determines module, for determining multiple variation words of described key word;
Matching module, for carrying out described Feature Words with each variation word in the plurality of variation word
Coupling, for according to matching result, it is determined whether described Feature Words is identified as described key word.
Information identification system the most according to claim 6, it is characterised in that described Feature Words
Acquisition module specifically for:
By Feature Words acquisition module, described current data is carried out pretreatment, to obtain described current number
According to described Feature Words, wherein, the mode of described pretreatment includes at least one of or a combination thereof:
Participle neighbour merges mode, background noise filter type, translator of English mode, Chinese-traditional reduction side
Formula.
Information identification system the most according to claim 7, it is characterised in that described variation word
Determine module specifically for:
By described variation word determine module described key word is carried out Chinese character pronunciation variation process and/or
Chinese character pattern variation processes, to obtain the plurality of variation word of described key word, wherein, the described Chinese
The mode that word pronunciation variation processes includes: with nearly sound substitute mode, liaison bonding substitute mode and letter
Abbreviation substitute mode, and the mode that described Chinese character pattern variation processes includes: nearly shape Chinese character replacement side
Formula and Chinese character pattern disassemble mode.
9. according to the information identification system according to any one of claim 6 to 8, it is characterised in that
Described matching module specifically for:
Matching formula is used to calculate described Feature Words and the coupling mark of described key word, wherein, described
Matching formula is:
Wherein, s represents the sum of the described coupling mark of described Feature Words and described each variation word, n table
Showing the quantity of the plurality of variation word, w represents described Feature Words, and t represents in the plurality of variation word
I-th variation word, fiW () represents that the variation of described Feature Words maps, fiT () represents described i-th
The variation of variation word maps, and δ represents that the described coupling of described Feature Words and described i-th variation word is divided
Number.
Information identification system the most according to claim 9, it is characterised in that described coupling mould
Block includes:
Identification module, calculates described Feature Words and the coupling of described key word at described use matching formula
After mark, determine whether described coupling mark is in preset matching fraction range, wherein, when really
Time in described coupling mark is in described preset matching fraction range calmly, described Feature Words is identified as institute
State key word.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510128025.4A CN106156017A (en) | 2015-03-23 | 2015-03-23 | Information identifying method and information identification system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510128025.4A CN106156017A (en) | 2015-03-23 | 2015-03-23 | Information identifying method and information identification system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106156017A true CN106156017A (en) | 2016-11-23 |
Family
ID=58063302
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510128025.4A Pending CN106156017A (en) | 2015-03-23 | 2015-03-23 | Information identifying method and information identification system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106156017A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106844508A (en) * | 2016-12-27 | 2017-06-13 | 北京五八信息技术有限公司 | deformation word recognition method and device |
CN107341256A (en) * | 2017-07-12 | 2017-11-10 | 深圳市乐唯科技开发有限公司 | It is a kind of that the solution method that sensitive subjects filter in scene is exchanged based on information |
CN107943954A (en) * | 2017-11-24 | 2018-04-20 | 杭州安恒信息技术有限公司 | Detection method, device and the electronic equipment of webpage sensitive information |
CN108182246A (en) * | 2017-12-28 | 2018-06-19 | 东软集团股份有限公司 | Sensitive word detection filter method, device and computer equipment |
CN108228704A (en) * | 2017-11-03 | 2018-06-29 | 阿里巴巴集团控股有限公司 | Identify method and device, the equipment of Risk Content |
WO2018166099A1 (en) * | 2017-03-17 | 2018-09-20 | 平安科技(深圳)有限公司 | Information leakage detection method and device, server, and computer-readable storage medium |
CN108804413A (en) * | 2018-04-28 | 2018-11-13 | 百度在线网络技术(北京)有限公司 | The recognition methods of text cheating and device |
CN109408824A (en) * | 2018-11-05 | 2019-03-01 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating information |
CN109597987A (en) * | 2018-10-25 | 2019-04-09 | 阿里巴巴集团控股有限公司 | A kind of text restoring method, device and electronic equipment |
CN111078827A (en) * | 2019-12-23 | 2020-04-28 | 上海米哈游天命科技有限公司 | Keyword judgment method, device, equipment and medium |
CN111092803A (en) * | 2018-10-23 | 2020-05-01 | 阿里巴巴集团控股有限公司 | Message processing method, device, system and storage medium |
CN111612284A (en) * | 2019-02-25 | 2020-09-01 | 阿里巴巴集团控股有限公司 | Data processing method, device and equipment |
CN112364153A (en) * | 2020-11-10 | 2021-02-12 | 中数通信息有限公司 | Keyword identification method and device based on interference characteristics |
CN113468856A (en) * | 2020-03-31 | 2021-10-01 | 阿里巴巴集团控股有限公司 | Variant text generation method, variant text translation model training method, variant text classification device and variant text translation model training device |
CN113657867A (en) * | 2021-08-27 | 2021-11-16 | 广东智源机器人科技有限公司 | Automatic reply control method, device, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101082909A (en) * | 2007-06-28 | 2007-12-05 | 腾讯科技(深圳)有限公司 | Method and system for dividing Chinese sentences for recognizing deriving word |
CN101719122A (en) * | 2009-12-04 | 2010-06-02 | 中国人民解放军信息工程大学 | Method for extracting Chinese named entity from text data |
CN101729520A (en) * | 2008-10-28 | 2010-06-09 | 北京大学 | Method and device for detecting sensitive information |
CN101876968A (en) * | 2010-05-06 | 2010-11-03 | 复旦大学 | Method for carrying out harmful content recognition on network text and short message service |
US20110029301A1 (en) * | 2009-07-31 | 2011-02-03 | Samsung Electronics Co., Ltd. | Method and apparatus for recognizing speech according to dynamic display |
CN101976231A (en) * | 2010-08-25 | 2011-02-16 | 孙强国 | Network supervision method for multi-language short messages |
-
2015
- 2015-03-23 CN CN201510128025.4A patent/CN106156017A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101082909A (en) * | 2007-06-28 | 2007-12-05 | 腾讯科技(深圳)有限公司 | Method and system for dividing Chinese sentences for recognizing deriving word |
CN101729520A (en) * | 2008-10-28 | 2010-06-09 | 北京大学 | Method and device for detecting sensitive information |
US20110029301A1 (en) * | 2009-07-31 | 2011-02-03 | Samsung Electronics Co., Ltd. | Method and apparatus for recognizing speech according to dynamic display |
CN101719122A (en) * | 2009-12-04 | 2010-06-02 | 中国人民解放军信息工程大学 | Method for extracting Chinese named entity from text data |
CN101876968A (en) * | 2010-05-06 | 2010-11-03 | 复旦大学 | Method for carrying out harmful content recognition on network text and short message service |
CN101976231A (en) * | 2010-08-25 | 2011-02-16 | 孙强国 | Network supervision method for multi-language short messages |
Non-Patent Citations (1)
Title |
---|
刘蔚琴: ""网络敏感信息监控系统研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106844508A (en) * | 2016-12-27 | 2017-06-13 | 北京五八信息技术有限公司 | deformation word recognition method and device |
WO2018166099A1 (en) * | 2017-03-17 | 2018-09-20 | 平安科技(深圳)有限公司 | Information leakage detection method and device, server, and computer-readable storage medium |
CN107341256A (en) * | 2017-07-12 | 2017-11-10 | 深圳市乐唯科技开发有限公司 | It is a kind of that the solution method that sensitive subjects filter in scene is exchanged based on information |
CN108228704B (en) * | 2017-11-03 | 2021-07-13 | 创新先进技术有限公司 | Method, device and equipment for identifying risk content |
CN108228704A (en) * | 2017-11-03 | 2018-06-29 | 阿里巴巴集团控股有限公司 | Identify method and device, the equipment of Risk Content |
CN107943954A (en) * | 2017-11-24 | 2018-04-20 | 杭州安恒信息技术有限公司 | Detection method, device and the electronic equipment of webpage sensitive information |
CN107943954B (en) * | 2017-11-24 | 2020-07-10 | 杭州安恒信息技术股份有限公司 | Method and device for detecting webpage sensitive information and electronic equipment |
CN108182246A (en) * | 2017-12-28 | 2018-06-19 | 东软集团股份有限公司 | Sensitive word detection filter method, device and computer equipment |
CN108182246B (en) * | 2017-12-28 | 2020-10-30 | 东软集团股份有限公司 | Sensitive word detection and filtering method and device and computer equipment |
CN108804413A (en) * | 2018-04-28 | 2018-11-13 | 百度在线网络技术(北京)有限公司 | The recognition methods of text cheating and device |
CN111092803A (en) * | 2018-10-23 | 2020-05-01 | 阿里巴巴集团控股有限公司 | Message processing method, device, system and storage medium |
CN109597987A (en) * | 2018-10-25 | 2019-04-09 | 阿里巴巴集团控股有限公司 | A kind of text restoring method, device and electronic equipment |
CN109408824A (en) * | 2018-11-05 | 2019-03-01 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating information |
CN109408824B (en) * | 2018-11-05 | 2023-04-25 | 百度在线网络技术(北京)有限公司 | Method and device for generating information |
CN111612284A (en) * | 2019-02-25 | 2020-09-01 | 阿里巴巴集团控股有限公司 | Data processing method, device and equipment |
CN111612284B (en) * | 2019-02-25 | 2023-06-20 | 阿里巴巴集团控股有限公司 | Data processing method, device and equipment |
CN111078827A (en) * | 2019-12-23 | 2020-04-28 | 上海米哈游天命科技有限公司 | Keyword judgment method, device, equipment and medium |
CN113468856A (en) * | 2020-03-31 | 2021-10-01 | 阿里巴巴集团控股有限公司 | Variant text generation method, variant text translation model training method, variant text classification device and variant text translation model training device |
CN112364153A (en) * | 2020-11-10 | 2021-02-12 | 中数通信息有限公司 | Keyword identification method and device based on interference characteristics |
CN113657867A (en) * | 2021-08-27 | 2021-11-16 | 广东智源机器人科技有限公司 | Automatic reply control method, device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106156017A (en) | Information identifying method and information identification system | |
CN106777275B (en) | Entity attribute and property value extracting method based on more granularity semantic chunks | |
CN107305768B (en) | Error-prone character calibration method in voice interaction | |
CN103493041B (en) | Use the automatic sentence evaluation device of shallow parsing device automatic evaluation sentence and error-detecting facility thereof and method | |
US6487532B1 (en) | Apparatus and method for distinguishing similar-sounding utterances speech recognition | |
CN100358006C (en) | Sound identifying method for geographic information and its application in navigation system | |
JP2005084681A (en) | Method and system for semantic language modeling and reliability measurement | |
CN110188347A (en) | Relation extraction method is recognized between a kind of knowledget opic of text-oriented | |
CN106294396A (en) | Keyword expansion method and keyword expansion system | |
KR20140021838A (en) | Method for detecting grammar error and apparatus thereof | |
CN104008123B (en) | The method and system matched for Chinese Name | |
Darwish et al. | Using Stem-Templates to Improve Arabic POS and Gender/Number Tagging. | |
US11386269B2 (en) | Fault-tolerant information extraction | |
CN104485106B (en) | Audio recognition method, speech recognition system and speech recognition apparatus | |
CN105183716B (en) | A kind of intelligent interactive method based on abstract semantics | |
Gandhe et al. | Using web text to improve keyword spotting in speech | |
CN106294315B (en) | The natural language predicate verb recognition methods merged based on syntactic property with statistics | |
Jiang et al. | Improvements on a trainable letter-to-sound converter | |
JP5097802B2 (en) | Japanese automatic recommendation system and method using romaji conversion | |
CN109460554A (en) | A kind of method and device of filtering shielding word | |
CN103049434B (en) | A kind of alternative word identification system and identification method | |
Tachbelie et al. | Morpheme-based automatic speech recognition for a morphologically rich language-Amharic. | |
Wang et al. | Combining statistical and knowledge-based spoken language understanding in conditional models | |
KS et al. | Automatic error detection and correction in malayalam | |
Tachbelie et al. | Morpheme-based and factored language modeling for Amharic speech recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161123 |