CN109657243A - Sensitive information recognition methods, system, equipment and storage medium - Google Patents

Sensitive information recognition methods, system, equipment and storage medium Download PDF

Info

Publication number
CN109657243A
CN109657243A CN201811544301.5A CN201811544301A CN109657243A CN 109657243 A CN109657243 A CN 109657243A CN 201811544301 A CN201811544301 A CN 201811544301A CN 109657243 A CN109657243 A CN 109657243A
Authority
CN
China
Prior art keywords
sensitive information
term vector
word
text sentence
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811544301.5A
Other languages
Chinese (zh)
Inventor
王东
沙韬伟
罗竞佳
邓金秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Manyun Software Technology Co Ltd
Original Assignee
Jiangsu Manyun Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Manyun Software Technology Co Ltd filed Critical Jiangsu Manyun Software Technology Co Ltd
Priority to CN201811544301.5A priority Critical patent/CN109657243A/en
Publication of CN109657243A publication Critical patent/CN109657243A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Abstract

The present invention provides a kind of sensitive information recognition methods, system, equipment and storage mediums, this method comprises: text sentence to be identified is carried out word cutting, obtain each composition word;The term vector of each composition word is searched in trained term vector library;The term vector of each composition word is calculated into average value, obtains average vector value;Average vector value is inputted into trained sensitive information identification model, obtains sensitive information probability value;Judge whether text sentence includes sensitive information according to sensitive information probability value.By using the solution of the present invention, sensitivity classification is carried out based on vectorization text, can quickly and efficiently identify whether text sentence includes sensitive information, improves the accuracy rate of text identification;The identification that present invention could apply to comment in various types of forums, comprising that can delete corresponding comment when sensitive information in text sentence, the present invention also can be applied to the identification of the sensitive information of other scenes.

Description

Sensitive information recognition methods, system, equipment and storage medium
Technical field
The present invention relates to text recognition technique field more particularly to a kind of sensitive information recognition methods, system, equipment and deposit Storage media.
Background technique
It in the management of network forum, needs that some sensitive informations are identified and deleted, guarantees that forum's atmosphere is positive. Sensitive information for instance it can be possible that the illegal irregularity such as negative, reaction, yellow, violence information.The underlying attribute for data of posting There are text, expression, number, character etc., data format is very mixed and disorderly, and semanteme is abundant, if the data that will directly post are input to It can compare in existing sensitive information identification model and be difficult with, and effect is poor.
There are mainly two types of existing sensitive information identifying schemes, the first is matched for the sensitive word of violence, and this method is made It is bigger at accidentally injuring, it is likely that will not to be that the text of sensitive information is also identified as sensitive information.Another kind is conventional participle classification, Multiple words will be in short divided into, Bayes's classification is then carried out by word frequency.This scheme more falls behind short sentence recognition effect, For the short sentence of some only three or four words, since participle anterior-posterior length is shorter, Bayes classifier cannot have to be divided well Class is not as a result, and can not obtain accurate sensitive information recognition result well using the correlation of word.
Summary of the invention
For the problems of the prior art, the purpose of the present invention is to provide a kind of sensitive information recognition methods, system, set Standby and storage medium carries out sensitivity classification based on vectorization text, can quickly and efficiently identify whether text sentence wraps Include sensitive information.
The embodiment of the present invention provides a kind of sensitive information recognition methods, and described method includes following steps:
Text sentence to be identified is subjected to word cutting, obtains each composition word;
The term vector of each composition word is searched in trained term vector library;
The term vector of each composition word is calculated into average value, obtains average vector value;
The average vector value is inputted into trained sensitive information identification model, obtains sensitive information probability value, and root Judge whether the text sentence includes sensitive information according to the sensitive information probability value.
Optionally, described that text sentence to be identified is subjected to word cutting, include the following steps:
Word cutting is carried out to the text sentence to be identified using Jieba segmenting method.
Optionally, the trained term vector library includes multiple term vectors based on GloVe training.
Optionally, the trained term vector library includes the term vector of multiple default sensitive words, described trained When searching the term vector of each composition word in term vector library, for what is do not found in the trained term vector library Word is formed, using default term vector.
Optionally, the term vector by each composition word calculates average value, including by each composition word Term vector carries out column average.
It optionally, further include that whether the text sentence including sensitive information is as training set known to acquisition is multiple, using instruction Practice the step of collecting training sensitive information identification model.
Optionally, described using training set training sensitive information identification model, include the following steps:
Respectively to it is each it is known whether include that the text sentence of sensitive information carries out word cutting, it is corresponding to obtain each text sentence Each composition word;
The term vector of each composition word is searched in trained term vector library;
The term vector of the composition word of each text sentence is calculated into average value, obtains the average vector of each text sentence Value;
Whether include sensitive information according to each text sentence, is that the average vector value of each text sentence adds label;
Using the average vector value of each text sentence and the label training sensitive information identification model.
Optionally, described to judge whether the text sentence includes sensitive information according to the sensitive information probability value, packet Include following steps:
Judge whether the sensitive information probability value is greater than preset threshold, if it is, text sentence includes sensitive letter Breath.
The embodiment of the present invention also provides a kind of sensitive information identifying system, applied to the sensitive information recognition methods, The system comprises:
Text word segmentation module obtains each composition word for text sentence to be identified to be carried out word cutting;
Term vector enquiry module, for searching the term vector of each composition word in trained term vector library;
Average vector value computing module, for the term vector of each composition word to be calculated average value, obtain it is average to Magnitude;
Sensitive information identification module is obtained for the average vector value to be inputted trained sensitive information identification model Judge whether the text sentence includes sensitive information to sensitive information probability value, and according to the sensitive information probability value.
The embodiment of the present invention also provides a kind of sensitive information identification equipment, comprising:
Processor;
Memory, wherein being stored with the executable instruction of the processor;
Wherein, the processor is configured to execute the sensitive information identification side via the executable instruction is executed The step of method.
The embodiment of the present invention also provides a kind of computer readable storage medium, and for storing program, described program is performed Described in Shi Shixian the step of sensitive information recognition methods.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.
Sensitive information recognition methods, system, equipment and storage medium provided by the present invention have the advantage that
The present invention solves the problems of the prior art, carries out cutting to text sentence first and obtains each composition word, so Average computation is carried out according to the term vector of each composition word afterwards, obtains average vector value, sensitive journey is carried out based on vectorization text Degree classification, can quickly and efficiently identify whether text sentence includes sensitive information, improves the accuracy rate of text identification;The present invention It can be applied to the identification commented in various types of forums, can be commented corresponding when including sensitive information in text sentence By deletion, the present invention also can be applied to the identification of the sensitive information of other scenes.
Detailed description of the invention
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon.
Fig. 1 is the flow chart of the sensitive information recognition methods of one embodiment of the invention;
Fig. 2 is the flow chart of the training sensitive information identification model of one embodiment of the invention;
Fig. 3 is the structural schematic diagram of the sensitive information identifying system of one embodiment of the invention;
Fig. 4 is the schematic diagram of the sensitive information identification equipment of one embodiment of the invention;
Fig. 5 is the schematic diagram of the computer readable storage medium of one embodiment of the invention.
Specific embodiment
Example embodiment is described more fully with reference to the drawings.However, example embodiment can be with a variety of shapes Formula is implemented, and is not understood as limited to example set forth herein;On the contrary, thesing embodiments are provided so that the disclosure will more Fully and completely, and by the design of example embodiment comprehensively it is communicated to those skilled in the art.Described feature, knot Structure or characteristic can be incorporated in any suitable manner in one or more embodiments.
In addition, attached drawing is only the schematic illustrations of the disclosure, it is not necessarily drawn to scale.Identical attached drawing mark in figure Note indicates same or similar part, thus will omit repetition thereof.Some block diagrams shown in the drawings are function Energy entity, not necessarily must be corresponding with physically or logically independent entity.These function can be realized using software form Energy entity, or these functional entitys are realized in one or more hardware modules or integrated circuit, or at heterogeneous networks and/or place These functional entitys are realized in reason device device and/or microcontroller device.
As shown in Figure 1, in order to solve the above-mentioned technical problem, the embodiment of the present invention provides a kind of sensitive information recognition methods, Described method includes following steps:
S110: text sentence to be identified is subjected to word cutting, obtains each composition word;
S120: the term vector of each composition word is searched in trained term vector library;
S130: the term vector of each composition word is calculated into average value, obtains average vector value;
S140: inputting trained sensitive information identification model for the average vector value, obtain sensitive information probability value, And judge whether the text sentence includes sensitive information according to the sensitive information probability value.
Therefore, the present invention pass through first step S100 to text sentence carry out cutting obtain each composition word, then pass through Step S200 searches the term vector of each composition word, is then averaged by step S300 according to the term vector of each composition word It calculates, obtains average vector value, vectorization text is based on according to step S400 and carries out sensitivity classification, it can be quickly and efficiently Identify whether text sentence includes sensitive information, improves the accuracy rate of text identification.
In this embodiment, in the step S100, text sentence to be identified is subjected to word cutting, including use Jieba Segmenting method carries out word cutting to the text sentence to be identified.Jieba segmentation methods, which have been used, realizes height based on prefix dictionary The word figure of effect scans, and generates all possible directed acyclic graphs for generating word situation and being constituted of Chinese character in sentence, then use dynamic Maximum probability path is searched in planning, the maximum cutting combination based on word frequency is found out, so that original text sentence cutting to be identified Word is formed at several.In addition, in the step s 100, can also reject simultaneously including modal particle without sincere word, by expression Symbol is converted to Chinese, and the complex form of Chinese characters is converted to simplified Chinese character, by lowercase upper etc..Modal particle is rejected to exist Interior can be searched and be deleted by modes such as canonical matchings without sincere word.
In this embodiment, the trained term vector library includes multiple term vectors based on GloVe training.GloVe Full name Global Vectors for Word Representation, it is the vocabulary based on global word frequency statistics Sign tool, a word can be expressed as the vector being made of real number by it, these vectors capture one between word A little feature of semanteme, such as similitude, analogy etc..Further, the trained term vector library includes multiple default sensitivities The term vector of word.The sensitive information recognition methods of the embodiment of the present invention can also include the steps that the sensitive term vector of training.Specifically Ground, the training sensitivity term vector include the following steps:
Multiple preset sensitive words are acquired in advance;Count the co-occurrence matrix of each sensitive word;According to being total to for each sensitive word Existing matrix trains the term vector of sensitive word using GloVe method.
In the step S200, in trained sensitive term vector library, the term vector of each sensitive word can be found, And the composition word for not found in the trained term vector library, then using default term vector, such as with identical length Null vector under degree is substituted.
In the step S300 of the embodiment, the term vector by each composition word calculates average value, including will The term vector of each composition word carries out column average.Column average is respectively by each position of the term vector of each composition word herein Numerical value of the same position of the term vector of the numerical value set and other composition words are averaged, finally obtain one with each word to Measure the identical average vector of length, the numerical value of each in the average vector is same position in the term vector of each composition word Numerical value average value.
It is described to judge whether the text sentence includes sensitivity according to the sensitive information probability value in the step S400 Information includes the following steps:
Judge whether the sensitive information probability value is greater than preset threshold, if it is, text sentence includes sensitive letter Breath.
In this embodiment, the sensitive information recognition methods further includes whether acquisition is multiple known including sensitive information Text sentence is as training set, the step of using training set training sensitive information identification model.
As shown in Fig. 2, specifically, it is described using training set training sensitive information identification model, include the following steps:
S210: respectively to it is each it is known whether include sensitive information text sentence carry out word cutting, obtain each text sentence The corresponding each composition word of son;
S220: the term vector of each composition word is searched in trained term vector library;
S230: the term vector of the composition word of each text sentence is calculated into average value, obtains being averaged for each text sentence Vector value;
Whether S240: including sensitive information according to each text sentence, is that the average vector value of each text sentence is added Label;
S250: using the average vector value and the label training sensitive information identification model of each text sentence.It is described Sensitive information identification model can use sensitive information identification model in the prior art, such as convolutional neural networks model, branch Hold vector machine, DeepFM model etc..
Method of the invention can be applied to the sensitive information identification that user in forum delivers content, for example, being applied to goods Driver comments on content and carries out text identification in vehicle driver forum.It, can artificial selection one first at the sensitive term vector library of training The sensitive word often occurred in forum a bit, these sensitive words may because of method of the invention application scenarios it is different and Difference, then obtains the co-occurrence matrix of the sensitive word of selection, and training term vector obtains trained term vector library.
In training sensitive information identification model, the text sentence of some comments can be chosen from forum, and artificial Whether the text sentence for carrying out identifying that these are chosen includes sensitive information.Then the text sentence of each comment is cut Word obtains the average vector value of the text sentence of each comment according to step S210~S230, then using each after addition label The average vector value training sensitive information identification model of the text sentence of a comment.Training completion and sensitivity in sensitive term vector library After the completion of the training of information identification model, it can known automatically using trained sensitive term vector library and sensitive information identification model Comment content in another matter text, substantially increases the efficiency and accuracy rate of text identification.Know using sensitive information identification model After not obtaining sensitive information probability, if the probability value is greater than preset threshold value, the corresponding comment of sensitive information probability It for sensitivity comment, needs to delete, shield or alert user, the otherwise corresponding comment of the sensitive information probability is commented on to be normal.
Therefore, the identification that present invention could apply to comment in various types of forums includes sensitivity in text sentence Corresponding comment can be deleted when information, however, the present invention is not limited thereto, in other alternative embodiments, the present invention can also To be applied to the identification of the sensitive information of other scenes.
As shown in figure 3, the embodiment of the present invention also provides a kind of sensitive information identifying system, applied to the sensitive information Recognition methods, the system comprises:
Text word segmentation module M100 obtains each composition word for text sentence to be identified to be carried out word cutting;
Term vector enquiry module M200, for searched in trained term vector library it is each it is described composition word word to Amount;
Average vector value computing module M300 is put down for the term vector of each composition word to be calculated average value Equal vector value;
Sensitive information identification module M400 identifies mould for the average vector value to be inputted trained sensitive information Type obtains sensitive information probability value, and judges whether the text sentence includes sensitive letter according to the sensitive information probability value Breath.
Therefore, the present invention pass through first text word segmentation module M100 to text sentence carry out cutting obtain each composition word, Then the term vector that each composition word is searched by term vector enquiry module M200, then passes through average vector value computing module M300 carries out average computation according to the term vector of each composition word, average vector value is obtained, according to sensitive information identification module M400 is based on vectorization text and carries out sensitivity classification, can quickly and efficiently identify whether text sentence includes sensitive letter Breath, improves the accuracy rate of text identification.
In the embodiment, text word segmentation module M100 can use Jieba segmenting method.The sensitive information identifying system It can also include term vector library training module M500, term vector library training module M500 for acquiring multiple preset sensitivities in advance Word;Count the co-occurrence matrix of each sensitive word;According to the co-occurrence matrix of each sensitive word using GloVe method training sensitive word Term vector.Further, the sensitive information identifying system can also include sensitive information identification model training module M600, use In acquire it is multiple it is known whether include that the text sentence of sensitive information is used as training set, identified using training set training sensitive information The step of model, specific training sensitive information identification model, can use step S210~S250 as above.
The embodiment of the present invention also provides a kind of sensitive information identification equipment, including processor;Memory, wherein storing State the executable instruction of processor;Wherein, the processor is configured to described to execute via the executable instruction is executed The step of sensitive information recognition methods.
Therefore, sensitive information of the invention identification equipment considers the attachment coefficient on ground when rainy day and non-rainy day not Together, establish the rainy day assessment of risks condition different with the non-rainy day, judged according to brake disc water removal function enabling signal be currently No is the rainy day, by processor according to whether selecting corresponding assessment of risks condition for the rainy day, and then judges whether to meet current The trigger condition of emergency braking, the validity of emergency braking when the rainy day can be improved.
Person of ordinary skill in the field it is understood that various aspects of the invention can be implemented as system, method or Program product.Therefore, various aspects of the invention can be embodied in the following forms, it may be assumed that complete hardware embodiment, complete The embodiment combined in terms of full Software Implementation (including firmware, microcode etc.) or hardware and software, can unite here Referred to as " circuit ", " module " or " platform ".
The electronic equipment 600 of this embodiment according to the present invention is described referring to Fig. 4.The electronics that Fig. 4 is shown Equipment 600 is only an example, should not function to the embodiment of the present invention and use scope bring any restrictions.
As shown in figure 4, electronic equipment 600 is showed in the form of universal computing device.The combination of electronic equipment 600 can wrap Include but be not limited to: at least one processing unit 610, at least one storage unit 620, connection different platform combination (including storage Unit 620 and processing unit 610) bus 630, display unit 640 etc..
Wherein, the storage unit is stored with program code, and said program code can be held by the processing unit 610 Row, so that the processing unit 610 executes described in this specification above-mentioned electronic prescription circulation processing method part according to this The step of inventing various illustrative embodiments.For example, the processing unit 610 can execute step as shown in fig. 1.
The storage unit 620 may include the readable medium of volatile memory cell form, such as random access memory Unit (RAM) 6201 and/or cache memory unit 6202 can further include read-only memory unit (ROM) 6203.
The storage unit 620 can also include program/practical work with one group of (at least one) program module 6205 Tool 6204, such program module 6205 includes but is not limited to: operating system, one or more application program, other programs It may include the realization of network environment in module and program data, each of these examples or certain combination.
Bus 630 can be to indicate one of a few class bus structures or a variety of, including storage unit bus or storage Cell controller, peripheral bus, graphics acceleration port, processing unit use any bus structures in a variety of bus structures Local bus.
Electronic equipment 600 can also be with one or more external equipments 700 (such as keyboard, sensing equipment, bluetooth equipment Deng) communication, can also be enabled a user to one or more equipment interact with the electronic equipment 600 communicate, and/or with make Any equipment (such as the router, modulation /demodulation that the electronic equipment 600 can be communicated with one or more of the other calculating equipment Device etc.) communication.This communication can be carried out by input/output (I/O) interface 650.Also, electronic equipment 600 can be with By network adapter 660 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, Such as internet) communication.Network adapter 660 can be communicated by bus 630 with other modules of electronic equipment 600.It should Understand, although not shown in the drawings, other hardware and/or software module can be used in conjunction with electronic equipment 600, including but unlimited In: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and number According to backup storage platform etc..
The embodiment of the present invention also provides a kind of computer readable storage medium, and for storing program, described program is performed Described in Shi Shixian the step of sensitive information recognition methods.In some possible embodiments, various aspects of the invention are also It can be implemented as a kind of form of program product comprising program code, when described program product is run on the terminal device, Said program code is described for executing the terminal device in this specification above-mentioned electronic prescription circulation processing method part Various illustrative embodiments according to the present invention the step of.
Refering to what is shown in Fig. 5, describing the program product for realizing the above method of embodiment according to the present invention 800, can using portable compact disc read only memory (CD-ROM) and including program code, and can in terminal device, Such as it is run on PC.However, program product of the invention is without being limited thereto, in this document, readable storage medium storing program for executing can be with To be any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or It is in connection.
Described program product can be using any combination of one or more readable mediums.Readable medium can be readable letter Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example can be but be not limited to electricity, magnetic, optical, electromagnetic, infrared ray or System, device or the device of semiconductor, or any above combination.The more specific example of readable storage medium storing program for executing is (non exhaustive List) include: electrical connection with one or more conducting wires, portable disc, hard disk, random access memory (RAM), read-only Memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read only memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.
The computer readable storage medium may include in a base band or the data as the propagation of carrier wave a part are believed Number, wherein carrying readable program code.The data-signal of this propagation can take various forms, including but not limited to electromagnetism Signal, optical signal or above-mentioned any appropriate combination.Readable storage medium storing program for executing can also be any other than readable storage medium storing program for executing Readable medium, the readable medium can send, propagate or transmit for by instruction execution system, device or device use or Person's program in connection.The program code for including on readable storage medium storing program for executing can transmit with any suitable medium, packet Include but be not limited to wireless, wired, optical cable, RF etc. or above-mentioned any appropriate combination.
The program for executing operation of the present invention can be write with any combination of one or more programming languages Code, described program design language include object oriented program language-Java, C++ etc., further include conventional Procedural programming language-such as " C " language or similar programming language.Program code can be fully in user It calculates and executes in equipment, partly executes on a user device, being executed as an independent software package, partially in user's calculating Upper side point is executed on a remote computing or is executed in remote computing device or server completely.It is being related to far Journey calculates in the situation of equipment, and remote computing device can pass through the network of any kind, including local area network (LAN) or wide area network (WAN), it is connected to user calculating equipment, or, it may be connected to external computing device (such as utilize ISP To be connected by internet).
In conclusion compared with prior art, sensitive information recognition methods provided by the present invention, system, equipment and deposit Storage media has the advantage that
The present invention solves the problems of the prior art, carries out cutting to text sentence first and obtains each composition word, so Average computation is carried out according to the term vector of each composition word afterwards, obtains average vector value, sensitive journey is carried out based on vectorization text Degree classification, can quickly and efficiently identify whether text sentence includes sensitive information, improves the accuracy rate of text identification;The present invention It can be applied to the identification commented in various types of forums, can be commented corresponding when including sensitive information in text sentence By deletion, the present invention also can be applied to the identification of the sensitive information of other scenes.
The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that Specific implementation of the invention is only limited to these instructions.For those of ordinary skill in the art to which the present invention belongs, exist Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to of the invention Protection scope.

Claims (11)

1. a kind of sensitive information recognition methods, which comprises the steps of:
Text sentence to be identified is subjected to word cutting, obtains each composition word;
The term vector of each composition word is searched in trained term vector library;
The term vector of each composition word is calculated into average value, obtains average vector value;
The average vector value is inputted into trained sensitive information identification model, obtains sensitive information probability value, and according to institute It states sensitive information probability value and judges whether the text sentence includes sensitive information.
2. sensitive information recognition methods according to claim 1, which is characterized in that it is described by text sentence to be identified into Row word cutting, includes the following steps:
Word cutting is carried out to the text sentence to be identified using Jieba segmenting method.
3. sensitive information recognition methods according to claim 1, which is characterized in that the trained term vector library includes Multiple term vectors based on GloVe training.
4. sensitive information recognition methods according to claim 1, which is characterized in that the trained term vector library includes The term vector of multiple default sensitive words, it is described searched in trained term vector library it is each it is described composition word term vector when, For the composition word not found in the trained term vector library, using default term vector.
5. sensitive information recognition methods according to claim 1, which is characterized in that the word by each composition word Vector calculates average value, including the term vector of each composition word is carried out column average.
6. sensitive information recognition methods according to claim 1, which is characterized in that further include whether being wrapped known to acquisition is multiple The text sentence of sensitive information is included as training set, the step of using training set training sensitive information identification model.
7. sensitive information recognition methods according to claim 6, which is characterized in that described quick using training set training Feel information identification model, includes the following steps:
Respectively to it is each it is known whether include that the text sentence of sensitive information carries out word cutting, it is corresponding each to obtain each text sentence A composition word;
The term vector of each composition word is searched in trained term vector library;
The term vector of the composition word of each text sentence is calculated into average value, obtains the average vector value of each text sentence;
Whether include sensitive information according to each text sentence, is that the average vector value of each text sentence adds label;
Using the average vector value of each text sentence and the label training sensitive information identification model.
8. sensitive information recognition methods according to claim 1, which is characterized in that described according to the sensitive information probability Value judges whether the text sentence includes sensitive information, includes the following steps:
Judge whether the sensitive information probability value is greater than preset threshold, if it is, text sentence includes sensitive information.
9. a kind of sensitive information identifying system, which is characterized in that be applied to sensitive letter described in any item of the claim 1 to 8 Recognition methods is ceased, the system comprises:
Text word segmentation module obtains each composition word for text sentence to be identified to be carried out word cutting;
Term vector enquiry module, for searching the term vector of each composition word in trained term vector library;
Average vector value computing module obtains average vector value for the term vector of each composition word to be calculated average value;
Sensitive information identification module obtains quick for the average vector value to be inputted trained sensitive information identification model Feel informational probability value, and judges whether the text sentence includes sensitive information according to the sensitive information probability value.
10. a kind of sensitive information identifies equipment characterized by comprising
Processor;
Memory, wherein being stored with the executable instruction of the processor;
Wherein, the processor is configured to come described in any one of perform claim requirement 1 to 8 via the execution executable instruction Sensitive information recognition methods the step of.
11. a kind of computer readable storage medium, for storing program, which is characterized in that described program is performed realization power Benefit require any one of 1 to 8 described in sensitive information recognition methods the step of.
CN201811544301.5A 2018-12-17 2018-12-17 Sensitive information recognition methods, system, equipment and storage medium Pending CN109657243A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811544301.5A CN109657243A (en) 2018-12-17 2018-12-17 Sensitive information recognition methods, system, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811544301.5A CN109657243A (en) 2018-12-17 2018-12-17 Sensitive information recognition methods, system, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN109657243A true CN109657243A (en) 2019-04-19

Family

ID=66113290

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811544301.5A Pending CN109657243A (en) 2018-12-17 2018-12-17 Sensitive information recognition methods, system, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109657243A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134966A (en) * 2019-05-21 2019-08-16 中电健康云科技有限公司 A kind of sensitive information determines method and device
CN110222182A (en) * 2019-06-06 2019-09-10 腾讯科技(深圳)有限公司 A kind of statement classification method and relevant device
CN110469753A (en) * 2019-07-16 2019-11-19 盐城师范学院 A kind of digital content dispensing device
CN110633577A (en) * 2019-08-22 2019-12-31 阿里巴巴集团控股有限公司 Text desensitization method and device
CN110674414A (en) * 2019-09-20 2020-01-10 北京字节跳动网络技术有限公司 Target information identification method, device, equipment and storage medium
CN110727880A (en) * 2019-10-18 2020-01-24 西安电子科技大学 Sensitive corpus detection method based on word bank and word vector model
CN110737818A (en) * 2019-09-06 2020-01-31 平安科技(深圳)有限公司 Network release data processing method and device, computer equipment and storage medium
CN111143884A (en) * 2019-12-31 2020-05-12 北京懿医云科技有限公司 Data desensitization method and device, electronic equipment and storage medium
CN111159354A (en) * 2019-12-31 2020-05-15 中国银行股份有限公司 Sensitive information detection method, device, equipment and system
CN111275327A (en) * 2020-01-19 2020-06-12 深圳前海微众银行股份有限公司 Resource allocation method, device, equipment and storage medium
CN111782811A (en) * 2020-07-03 2020-10-16 湖南大学 E-government affair sensitive text detection method based on convolutional neural network and support vector machine
CN112560472A (en) * 2019-09-26 2021-03-26 腾讯科技(深圳)有限公司 Method and device for identifying sensitive information
CN112732912A (en) * 2020-12-30 2021-04-30 平安科技(深圳)有限公司 Sensitive tendency expression detection method, device, equipment and storage medium
CN113128220A (en) * 2021-04-30 2021-07-16 北京奇艺世纪科技有限公司 Text distinguishing method and device, electronic equipment and storage medium
CN113538002A (en) * 2020-04-14 2021-10-22 北京沃东天骏信息技术有限公司 Method and device for auditing texts
CN113569046A (en) * 2021-07-19 2021-10-29 北京华宇元典信息服务有限公司 Judgment document character relation identification method and device and electronic equipment
CN114239591A (en) * 2021-12-01 2022-03-25 马上消费金融股份有限公司 Sensitive word recognition method and device
CN115544240A (en) * 2022-11-24 2022-12-30 闪捷信息科技有限公司 Text sensitive information identification method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160321243A1 (en) * 2014-01-10 2016-11-03 Cluep Inc. Systems, devices, and methods for automatic detection of feelings in text
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN107679075A (en) * 2017-08-25 2018-02-09 北京德塔精要信息技术有限公司 Method for monitoring network and equipment
CN108984530A (en) * 2018-07-23 2018-12-11 北京信息科技大学 A kind of detection method and detection system of network sensitive content

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160321243A1 (en) * 2014-01-10 2016-11-03 Cluep Inc. Systems, devices, and methods for automatic detection of feelings in text
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN107679075A (en) * 2017-08-25 2018-02-09 北京德塔精要信息技术有限公司 Method for monitoring network and equipment
CN108984530A (en) * 2018-07-23 2018-12-11 北京信息科技大学 A kind of detection method and detection system of network sensitive content

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134966A (en) * 2019-05-21 2019-08-16 中电健康云科技有限公司 A kind of sensitive information determines method and device
CN110222182A (en) * 2019-06-06 2019-09-10 腾讯科技(深圳)有限公司 A kind of statement classification method and relevant device
CN110222182B (en) * 2019-06-06 2022-12-27 腾讯科技(深圳)有限公司 Statement classification method and related equipment
CN110469753A (en) * 2019-07-16 2019-11-19 盐城师范学院 A kind of digital content dispensing device
CN110633577B (en) * 2019-08-22 2023-08-29 创新先进技术有限公司 Text desensitization method and device
CN110633577A (en) * 2019-08-22 2019-12-31 阿里巴巴集团控股有限公司 Text desensitization method and device
CN110737818A (en) * 2019-09-06 2020-01-31 平安科技(深圳)有限公司 Network release data processing method and device, computer equipment and storage medium
CN110737818B (en) * 2019-09-06 2024-02-27 平安科技(深圳)有限公司 Network release data processing method, device, computer equipment and storage medium
CN110674414A (en) * 2019-09-20 2020-01-10 北京字节跳动网络技术有限公司 Target information identification method, device, equipment and storage medium
CN112560472B (en) * 2019-09-26 2023-07-11 腾讯科技(深圳)有限公司 Method and device for identifying sensitive information
CN112560472A (en) * 2019-09-26 2021-03-26 腾讯科技(深圳)有限公司 Method and device for identifying sensitive information
CN110727880B (en) * 2019-10-18 2022-06-17 西安电子科技大学 Sensitive corpus detection method based on word bank and word vector model
CN110727880A (en) * 2019-10-18 2020-01-24 西安电子科技大学 Sensitive corpus detection method based on word bank and word vector model
CN111143884B (en) * 2019-12-31 2022-07-12 北京懿医云科技有限公司 Data desensitization method and device, electronic equipment and storage medium
CN111159354A (en) * 2019-12-31 2020-05-15 中国银行股份有限公司 Sensitive information detection method, device, equipment and system
CN111143884A (en) * 2019-12-31 2020-05-12 北京懿医云科技有限公司 Data desensitization method and device, electronic equipment and storage medium
CN111275327A (en) * 2020-01-19 2020-06-12 深圳前海微众银行股份有限公司 Resource allocation method, device, equipment and storage medium
CN113538002A (en) * 2020-04-14 2021-10-22 北京沃东天骏信息技术有限公司 Method and device for auditing texts
CN111782811A (en) * 2020-07-03 2020-10-16 湖南大学 E-government affair sensitive text detection method based on convolutional neural network and support vector machine
CN112732912B (en) * 2020-12-30 2024-04-09 平安科技(深圳)有限公司 Sensitive trend expression detection method, device, equipment and storage medium
CN112732912A (en) * 2020-12-30 2021-04-30 平安科技(深圳)有限公司 Sensitive tendency expression detection method, device, equipment and storage medium
CN113128220A (en) * 2021-04-30 2021-07-16 北京奇艺世纪科技有限公司 Text distinguishing method and device, electronic equipment and storage medium
CN113128220B (en) * 2021-04-30 2023-07-18 北京奇艺世纪科技有限公司 Text discrimination method, text discrimination device, electronic equipment and storage medium
CN113569046A (en) * 2021-07-19 2021-10-29 北京华宇元典信息服务有限公司 Judgment document character relation identification method and device and electronic equipment
CN114239591B (en) * 2021-12-01 2023-08-18 马上消费金融股份有限公司 Sensitive word recognition method and device
CN114239591A (en) * 2021-12-01 2022-03-25 马上消费金融股份有限公司 Sensitive word recognition method and device
CN115544240A (en) * 2022-11-24 2022-12-30 闪捷信息科技有限公司 Text sensitive information identification method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109657243A (en) Sensitive information recognition methods, system, equipment and storage medium
CN107992596B (en) Text clustering method, text clustering device, server and storage medium
US11379696B2 (en) Pedestrian re-identification method, computer device and readable medium
CN109145680B (en) Method, device and equipment for acquiring obstacle information and computer storage medium
CN108549656B (en) Statement analysis method and device, computer equipment and readable medium
US9224071B2 (en) Unsupervised object class discovery via bottom up multiple class learning
CN109636047A (en) User activity prediction model training method, system, equipment and storage medium
CN110209828B (en) Case query method, case query device, computer device and storage medium
CN109325201A (en) Generation method, device, equipment and the storage medium of entity relationship data
US20190026355A1 (en) Information processing device and information processing method
CN111898642A (en) Key point detection method and device, electronic equipment and storage medium
CN109582954A (en) Method and apparatus for output information
CN111800289B (en) Communication network fault analysis method and device
Chen et al. A human-centered multiple instance learning framework for semantic video retrieval
CN116257663A (en) Abnormality detection and association analysis method and related equipment for unmanned ground vehicle
CN116662839A (en) Associated big data cluster analysis method and device based on multidimensional intelligent acquisition
WO2023038722A1 (en) Entry detection and recognition for custom forms
CN114579963A (en) User behavior analysis method, system, device and medium based on data mining
CN112084783B (en) Entity identification method and system based on civil aviation non-civilized passengers
CN116186594B (en) Method for realizing intelligent detection of environment change trend based on decision network combined with big data
CN112528658A (en) Hierarchical classification method and device, electronic equipment and storage medium
CN112364912A (en) Information classification method, device, equipment and storage medium
CN112417996A (en) Information processing method and device for industrial drawing, electronic equipment and storage medium
CN113779202B (en) Named entity recognition method and device, computer equipment and storage medium
CN111339760A (en) Method and device for training lexical analysis model, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190419