CN112560472A

CN112560472A - Method and device for identifying sensitive information

Info

Publication number: CN112560472A
Application number: CN201910918780.0A
Authority: CN
Inventors: 赵妍妍; 罗观柱; 秦兵
Original assignee: Harbin Institute of Technology; Tencent Technology Shenzhen Co Ltd
Current assignee: Harbin Institute of Technology; Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2021-03-26
Anticipated expiration: 2039-09-26
Also published as: CN112560472B

Abstract

The application relates to a method and a device for identifying sensitive information, and belongs to the field of information processing. The method comprises the following steps: acquiring word vectors of m words included in text information to be recognized, wherein the word vector of a first word is a semantic representation of the first word, the first word is one of the m words, and m is an integer greater than 1; generating a hidden layer vector of the first word based on a word vector of the first word and a word vector of a second word, the second word being a word that is adjacent to the first word in front of and behind, the hidden layer vector of the first word being a semantic representation of the first word and a semantic representation of context information; and identifying whether the text information is sensitive information according to the hidden vector of the m words. The method and the device can improve the precision of identifying the sensitive information.

Description

Method and device for identifying sensitive information

Technical Field

The present application relates to the field of information processing, and in particular, to a method and an apparatus for identifying sensitive information.

Background

Today, with highly developed informatization, the internet has become one of important media for people to acquire information, and brings great convenience to the work and life of people. The conversation service robot is used as a product of a new internet technology, provides question and answer service and intelligent operation for human beings by using a powerful knowledge base and computing power, and is widely concerned by academic circles and industrial circles. Although various technical researches of the current conversation service robot have made certain progress, due to the characteristics of openness and instantaneity, conversation contents also become an important way for lawbreakers to transmit sensitive information such as bad-custom filth and the like.

To prevent the propagation of sensitive information, it is currently possible to identify the sensitive information. A sensitive dictionary is set in advance, and the sensitive dictionary comprises a large number of sensitive words. And detecting whether sensitive words in the sensitive dictionary are included in the text information to be recognized, and if the sensitive words are included, recognizing the text information as the sensitive information, wherein the text information can be a sentence of character information or a plurality of sentences of character information consisting of a plurality of words.

In the process of implementing the present application, the inventors found that the above manner has at least the following defects:

at present, a sensitive dictionary is relied on for recognition when sensitive information is recognized, and when new sensitive words which are not in the sensitive dictionary appear in text information, whether the text information is sensitive information or not can not be recognized, so that the precision of recognizing the sensitive information is reduced.

Disclosure of Invention

The embodiment of the application provides a method and a device for identifying sensitive information, so as to improve the accuracy of identifying sensitive information. The technical scheme is as follows:

in one aspect, a method of identifying sensitive information is provided, the method comprising:

acquiring word vectors of m words included in text information to be recognized, wherein the word vector of a first word is a semantic representation of the first word, the first word is one of the m words, and m is an integer greater than 1;

generating a hidden layer vector of the first word based on a word vector of the first word and a word vector of a second word, the second word being a word that is adjacent to the first word in front of and behind, the hidden layer vector of the first word being a semantic representation of the first word and a semantic representation of context information;

and identifying whether the text information is sensitive information according to the hidden vector of the m words.

Optionally, the generating a hidden layer vector of the first word based on the word vector of the first word and the word vector of the second word includes:

inputting a word vector of each of the m words to a context information classification model in an order of the each word in the text information, the context information classification model for generating a hidden layer vector of the first word based on a word vector of the first word and a word vector of a second word;

obtaining a hidden layer vector of each word output by the context classification model.

Optionally, the identifying whether the text information is sensitive information according to the hidden vectors of the m words includes:

setting a first weight of each word in the m words according to the hidden vector of the m words, wherein the first weight of a word is used for representing the contribution of the word to the text information as sensitive information;

acquiring an information matrix of the text information according to the first weight of each word and the hidden vector of each word;

and determining whether the text information is sensitive information according to the information matrix of the text information.

Optionally, the setting a first weight of each of the m words according to the hidden vector of the m words includes:

inputting the implicit vectors of the m words to a weight assignment model for setting a first weight of each of the m words based on the implicit vectors of the m words;

and acquiring a first weight of each word output by the weight distribution model.

Optionally, the determining whether the text information is sensitive information according to the information matrix of the text information includes:

inputting the information matrix of the text information into a dimension reduction model, wherein the dimension reduction model is used for carrying out dimension reduction processing on the information matrix of the text information to obtain a two-dimensional vector of the text information;

and determining whether the text information is sensitive information or not through a classification function according to the two-dimensional vector of the text information.

Optionally, before the obtaining the information matrix of the text information according to the first weight of each word and the hidden layer vector of each word, the method further includes:

adjusting the first weight of each word according to a first dictionary and a second dictionary to obtain a second weight of each word, wherein the first dictionary is used for storing sensitive words, and the second dictionary is used for storing words with the frequency of occurrence of the sensitive words meeting preset conditions;

the obtaining of the information matrix of the text information according to the first weight of each word and the hidden vector of each word includes:

and acquiring an information matrix of the text information according to the second weight of each word and the hidden layer vector of each word.

Optionally, the adjusting the first weight of each word according to the first dictionary and the second dictionary to obtain the second weight of each word includes:

when the first word is a sensitive word in the first dictionary, increasing the first weight of the first word to obtain a second weight;

when the first word is a word in the second dictionary, reducing the first weight of the first word to obtain a second weight.

In another aspect, the present application provides an apparatus for identifying sensitive information, the apparatus comprising:

the device comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring word vectors of m words included in text information to be recognized, the word vector of a first word is semantic representation of the first word, the first word is one of the m words, and m is an integer greater than 1;

a generating module, configured to generate a hidden layer vector of the first word based on a word vector of the first word and a word vector of a second word, where the second word is a word that is adjacent to the first word in front of and behind, and the hidden layer vector of the first word is a semantic representation of the first word and a semantic representation of context information;

and the identification module is used for identifying whether the text information is sensitive information according to the hidden vector of the m words.

Optionally, the generating module is configured to:

Optionally, the identification module is configured to:

Optionally, the apparatus further comprises:

the adjusting module is used for adjusting the first weight of each word according to a first dictionary and a second dictionary to obtain a second weight of each word, wherein the first dictionary is used for storing sensitive words, and the second dictionary is used for storing words of which the frequency of occurrence of the sensitive words simultaneously meets preset conditions;

optionally, the identification module is configured to:

Optionally, the adjusting module is configured to:

In another aspect, the present application provides an electronic device comprising at least one processor and at least one memory, the at least one memory storing at least one instruction, the at least one instruction being loaded and executed by the at least one processor to implement the above-mentioned method.

In another aspect, the present application provides a computer-readable storage medium for storing at least one instruction which is loaded and executed by a processor to implement the above-described method.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

by obtaining word vectors of m words included in the text information to be recognized, the word vector of the first word is a semantic representation of the first word, and the first word is one of the m words. And generating a hidden layer vector of the first word based on the word vector of the first word and the word vector of the second word, wherein the second word is a word adjacent to the first word in front and back, and the hidden layer vector of the first word is a semantic representation of the first word and a semantic representation of context information. And identifying whether the text information is sensitive information according to the hidden vectors of the m words. Therefore, whether the text information is sensitive information or not is identified based on the semantic information of the whole text information, and whether the text information is sensitive information or not can be identified even if the text information does not include newly appeared sensitive words, so that the accuracy of identifying the sensitive information is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart of a method for identifying sensitive information according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a word vector provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of hidden layer vectors provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of assigning first weights provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of adjusting first weights according to an embodiment of the present application;

FIG. 6 is a flowchart of a training method provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an apparatus for identifying sensitive information according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a terminal according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The conversation service robot is a computer program that carries out conversation via conversation or text, and can simulate human conversation. The conversation service robot can be used for customer service or information acquisition and other applications. When people interact with the conversation service robot, sensitive information including pornography, obscences and the like can be input into the conversation service robot.

In order to prevent the propagation of the sensitive information, it is necessary to recognize whether the text information input into the conversation service robot is the sensitive information. In the application, an intelligent model for identifying sensitive information can be trained, and whether the text information is sensitive information or not can be identified based on the semantics of the text information by using the intelligent models. The intelligent model is composed of a plurality of models, and the plurality of models comprise a context information classification model, a weight distribution model, a dimension reduction model and a classification function.

Referring to fig. 1, an embodiment of the present application provides a method for identifying sensitive information, including:

step 101: the method comprises the steps of obtaining word vectors of m words included in text information to be recognized, wherein the word vector of the first word is semantic representation of the first word, the first word is one of the m words, and m is an integer larger than 1.

In this step, the text information to be recognized can be segmented to obtain m words; and acquiring a word vector of each word from the corresponding relation between the word and the word vector according to each word in the m words.

For each record in the correspondence of a word to a word vector, the word vector stored in the record is established based on the semantics of the word stored in the record, so the word vector is a semantic representation of the word.

For any word, a multi-dimensional word vector may be generated based on the semantics of the word, and the word vector may be used to represent the word. For example, a d-dimensional word vector is generated based on the semantic meaning of the word, and the word may be replaced by the d-dimensional word vector, where d is 100, 200, or 300.

The corresponding relation between the words and the word vectors can be downloaded from a network.

For example, assume that the text information to be recognized is "someone is drowned in uniform temptation. "the text information is segmented to obtain 5 words, which are respectively ' someone ', ' drowning ', ' on ', ' uniform temptation ', '. "; and acquiring the word vector of each word in the 5 words from the corresponding relation between the words and the word vectors according to the 5 words. As shown in fig. 2, the word vector of each of the 5 words is obtained as a d-dimensional vector, that is, the word vector of each word is composed of d floating point numbers.

Step 102: and inputting the word vectors of the m words into a context information classification model, and acquiring the hidden layer vector of each word output by the context information classification model.

The context information classification model is used for generating a hidden layer vector of a first word based on a word vector of the first word and a word vector of a second word, wherein the second word is a word adjacent to the first word.

The context information classification model is obtained by training a first deep learning algorithm in advance, and the first deep learning algorithm can be BilSTM.

The hidden layer vector of the first term is a semantic representation of the first term and a semantic representation of the context information.

In this step, the word vector of each of the m words is input to the context information classification model in the order of each word in the text information. The context information classification model generates a hidden layer vector of the first word according to the word vector of the first word, the word vectors of x second words which are positioned in front of the first word and adjacent to the first word, and the word vectors of x second words which are positioned behind the first word and adjacent to the first word, wherein x is an integer which is greater than or equal to 1.

The hidden vector of the first word includes semantic information of a second word preceding the first word and/or semantic information of a second word following the first word.

And under the condition that the first word is the first word in the text information, the context information classification model generates a hidden layer vector of the first word according to the word vector of the first word and the word vectors of x second words which are positioned behind the first word and adjacent to the first word. And under the condition that the first word is the last word in the text information, the context information classification model generates a hidden layer vector of the first word according to the word vector of the first word and the word vectors of x second words which are positioned before the first word and are adjacent to the first word.

The hidden vector of the first term is a k-dimensional vector, and k may be equal to d, may be greater than d, or may be smaller than d.

For example, in this step, assuming that x is 1, a word vector of "someone", "drowning", a word vector of "yes", and a word vector sum of "uniform temptation" are sequentially input to the context information classification model. "is used as a word vector. For the word "someone", the context information classification model generates a hidden layer vector of "someone" according to the word vector of "someone" and the word vector of "drowning". For the word 'drowning', the context information classification model generates a hidden layer vector of the 'drowning' according to the word vector of the 'person', the word vector of the 'drowning' and the word vector of the 'in', and similarly generates a hidden layer vector of the word 'in' and a hidden layer vector of the 'uniform temptation'. For the word ". ", the context information classification model is based on the word vector sum of" uniform temptation ". "word vector generation". "hidden layer vector. Referring to fig. 3, hidden layer vectors of the five words output by the context classification model are then obtained.

Step 103: and inputting the hidden vector of the m words into a weight distribution model, and acquiring the first weight of each word output by the weight distribution model.

The weight distribution model is used for setting a first weight of each word in the m words based on the hidden vector of the m words; the first weight of a word is used to represent the contribution of the word to the textual information as sensitive information.

The weight assignment model is obtained by training a second deep learning algorithm in advance, and the second deep learning algorithm may be self-attention.

The weight distribution model is an attention mechanism which simulates the attention of human brain, and intuitively understands that the distribution of resources is the distribution, and the attention always focuses on a certain focus part in a picture at a certain moment and is invisible to other parts.

When the hidden vectors of the m words are input into the weight assignment model, the weight assignment model assigns attention to the words which are possibly sensitive words, and the first weight assigned to the words which are possibly sensitive words is larger than the first weights assigned to other words.

Sensitive words refer to words related to gambling, pornography, drugs, politics.

In this step, the hidden vectors of the m words are input to the weight assignment model. The weight distribution model sets a first weight of each word in the m words according to the hidden vector of the m words and outputs the first weight of each word. And acquiring a first weight of each word output by the weight distribution model.

For example, referring to fig. 4, the hidden vector of "someone", "drowning", the hidden vector of "above", the hidden vector sum of "conquering temptation". "the hidden layer vectors are input to the weight assignment model. The weight distribution model is based on the hidden vector of people, the hidden vector of drowning, the hidden vector of people, and the hidden vector sum of uniform temptation. The hidden layer vector of is assigned a first weight alpha of' human₁First weight of: "drowning₂First weight of "in₃First weight of "" uniform temptation ""₄And ". "first weight α of₅。

Step 104: and adjusting the first weight of each word according to the first dictionary and the second dictionary to obtain the second weight of each word.

The first dictionary is used for storing sensitive words, and the second dictionary is used for storing words with the frequency meeting preset conditions and occurring with the sensitive words at the same time.

The first dictionary may be downloaded from a network. For the second dictionary, the words stored in the second dictionary are suspected sensitive words, and the intersection of the first dictionary and the second dictionary is empty. The word in the second dictionary is not a true sensitive word, but the word may often be present in the same textual information as the sensitive word. For example, for the sensible word "uniform temptation" related to pornography, the word "drowning" which often appears in the same text message with the sensible word "uniform temptation" is a suspected sensible word.

For the second dictionary, a large amount of text information can be collected in advance, and for each text information, the text information is subjected to word segmentation to obtain a plurality of words. A lookup is made from the plurality of words as to whether there is a sensitive word located in the first dictionary. If the sensitive word exists, the frequency of the non-sensitive word appearing simultaneously with the sensitive word is increased for other non-sensitive words not located in the first dictionary. The initial value of the frequency of occurrence of each non-sensitive word simultaneously with the sensitive word is 0. And processing each piece of collected other text information in the manner to obtain the simultaneous occurrence frequency of each non-sensitive word and the sensitive word, and taking the non-sensitive words with the simultaneous occurrence frequency of the sensitive words exceeding a preset frequency threshold as suspected sensitive words to form a second dictionary.

In this step, when the first word is a sensitive word in the first dictionary, the first weight of the first word is increased to obtain a second weight. When the first word is a word in the second dictionary, the first weight of the first word is reduced to obtain a second weight. When the first word is not a word in the first dictionary nor a word in the second dictionary, the first weight of the first word is taken as the second weight.

When the first word is a sensitive word in the first dictionary, it means that the first word is a real sensitive word, and therefore the first weight of the first word needs to be increased. When the first word is a word in the second lexicon, it means that the first word is not a sensitive word, but the model easily misreads it as a sensitive word, thus reducing the first weight of the first word. When the first word is not a word in the first dictionary or a word in the second dictionary, it means that the first word may or may not be a newly appearing sensitive word that is not in the first dictionary, thus keeping the weight of the first word unchanged.

In this step, the first weight of the first word is multiplied by a first coefficient to obtain a second weight, and the first coefficient is greater than 1, so as to increase the first weight of the first word. Or multiplying the first weight of the first word by a second coefficient to obtain a second weight, wherein the second coefficient is smaller than 1, so as to reduce the first weight of the first word.

For example, referring to fig. 5, look up "someone", "drowning", "on", "uniform temptation", "from the first dictionary and the second dictionary, respectively. ", find out that the word" drowning "is in the second dictionary and" conquer temptation "is in the first dictionary. Thus a second weight beta of "conquering temptation₄＝α₄×a，a>1, "drowning" by a second weight β₂＝α₂×b，b<1. Second weight of "having person₁＝α₁And a second weight of₃＝α₃And, and ". "second weight of beta₅＝α₅。

Step 105: and acquiring an information matrix of the text information according to the second weight of each word and the hidden layer vector of each word.

In this step, the hidden layer vectors of the m words are formed into a first matrix H of dxk, the second weights of the m words are formed into a weight vector, and an information matrix R of the text information is obtained according to the first matrix H and a transposed vector of the weight vector according to the following first formula.

The first formula is R ═ H ═ beta^T，β^TIs a transposed vector.

Step 104 is an optional step, that is, step 104 may not be executed, and step 105 is directly executed after step 103 is executed. Thus, in step 105, an information matrix of the text information is obtained according to the first weight of each word and the hidden layer vector of each word. When the method is implemented, hidden layer vectors of the m words are formed into a first matrix H of dxk, first weights of the m words are formed into weight vectors, and an information matrix R of the text information is obtained according to the first matrix H and a transposed vector of the weight vectors according to the first formula.

Step 106: and inputting the information matrix of the text information into a dimension reduction model, and acquiring the two-dimensional vector of the text information output by the dimension reduction model.

The dimension reduction model is used for carrying out dimension reduction processing on an information matrix of the text information to obtain a two-dimensional vector of the text information.

The dimension reduction model is obtained by training a third deep learning algorithm in advance, and the third deep learning algorithm can be a full connection layer.

Step 107: and determining whether the text information is sensitive information or not through a classification function according to the two-dimensional vector of the text information.

In this step, the two-dimensional vector of the text information is input to the classification function. The classification function calculates the probability that the text information is sensitive information according to the two-dimensional vector, and when the probability is greater than a preset probability threshold value, the text information is determined to be sensitive information.

In the embodiment of the application, word vectors of m words included in text information to be recognized are obtained, the word vector of a first word is a semantic representation of the first word, and the first word is one of the m words. And generating a hidden layer vector of the first word based on the word vector of the first word and the word vector of the second word, wherein the second word is a word adjacent to the first word in front and back, and the hidden layer vector of the first word is a semantic representation of the first word and a semantic representation of context information. And identifying whether the text information is sensitive information according to the hidden vectors of the m words. Therefore, whether the text information is sensitive information or not is identified based on the semantic information of the whole text information, and whether the text information is sensitive information or not can be identified even if the text information does not include newly appeared sensitive words, so that the accuracy of identifying the sensitive information is improved. In addition, after the first weight of each word is set, the first weight of each word can be adjusted according to the first dictionary including the sensitive words and the second dictionary including the suspected sensitive words to obtain the second weight of each word, whether the text information is sensitive information or not is identified based on the second weight of each word, and the accuracy of identifying the sensitive information can be further improved.

For the intelligent model for identifying sensitive information, a sample set may be set in advance, where the sample set includes a plurality of text messages and label information corresponding to each text message, and the label information corresponding to a text message may be whether the text message is sensitive information. For example, the label information corresponding to the text information may be 1, which indicates that the text information is sensitive information, or the label information corresponding to the text information may be 0, which indicates that the text information is non-sensitive information. Or, the label information corresponding to the text information may be 0, which indicates that the text information is sensitive information, or the label information corresponding to the text information may be 1, which indicates that the text information is non-sensitive information.

After the sample set is set, the sample set can be used for training a deep learning algorithm to obtain an intelligent model for identifying sensitive information. Referring to fig. 6, the training method includes:

for any text information in the sample set, it is first identified whether the text information is sensitive information using the flow of steps 201 to 207 as follows.

Step 201: as with step 101, it will not be described in detail here.

Step 202: and inputting the word vectors of the m words into a first deep learning network, and acquiring the hidden layer vector of each word output by the first deep learning network.

The first deep learning network is used for generating a hidden layer vector of the first word based on a word vector of the first word and a word vector of a second word, wherein the second word is a word adjacent to the first word. The first deep learning algorithm may be BilSTM.

The detailed implementation process of this step is the same as the detailed implementation process of step 102, and the detailed implementation process of this step can be obtained only by replacing the context information classification model in step 102 with the first deep learning network, which is not described in detail herein.

Step 203: and inputting the hidden vector of the m words into a second deep learning network, and acquiring a first weight of each word output by the second deep learning network.

The second deep learning network is used for setting a first weight of each word in the m words based on the hidden vector quantity of the m words; the first weight of a word is used to represent the contribution of the word to the textual information as sensitive information. The second deep learning algorithm may be self-attention.

The detailed implementation process of this step is the same as the detailed implementation process of step 103, and the detailed implementation process of this step can be obtained only by replacing the weight distribution model in step 103 with the second deep learning network, which is not described in detail herein.

Step 204-: respectively, as in steps 104-105, and will not be described in detail herein.

Step 206: and inputting the information matrix of the text information into a third deep learning algorithm to obtain the two-dimensional vector of the text information output by the third deep learning algorithm.

And the third deep learning algorithm is used for performing dimension reduction processing on the information matrix of the text information to obtain a two-dimensional vector of the text information. The third deep learning algorithm may be a fully connected layer.

Step 207: and determining whether the text information is sensitive information or not through a classification function according to the two-dimensional vector of the text information.

For each other text message in the sample set, whether each other sample message is sensitive message is identified in the above manner.

Step 208: and comparing the marking information corresponding to each text information in the sample set with the identification result to obtain the comparison result of each text information.

The identification result of the text information is that the text information is sensitive information or the text information is non-sensitive information. And comparing the identification result of the text information with the labeling information to obtain a comparison result, wherein the identification result of the text information is the same as the labeling information or the identification result of the text information is different from the labeling information.

In this step, the value "1" may indicate that the recognition result of the text information is the same as the label information, and the value "0" may indicate that the recognition result of the text information is different from the label information. Alternatively, the value "0" may indicate that the recognition result of the text information is the same as the label information, and the value "1" may indicate that the recognition result of the text information is different from the label information.

Step 209: and obtaining the replacement value through a preset cost function according to the comparison result of each text message, and executing step 210 when the cost value is not the minimum cost value.

In this step, the comparison result of each text message is combined into an input vector, the input vector is used as an independent variable and is input into the cost function, and the cost value is calculated through the cost function.

The preset cost function is a curved surface, and the lowest point of the curved surface is the minimum cost value of the cost function.

Step 210: and adjusting the network parameters of the deep learning algorithm according to the cost value, and returning to execute the step 201.

In this step, the network parameters of the first deep learning algorithm, the network parameters of the second deep learning algorithm, the network parameters of the third deep learning algorithm and the network parameters of the classification function are adjusted according to the cost value.

And when the cost value is the minimum cost value, ending the process, and respectively taking the current first deep learning algorithm, the current second deep learning algorithm, the current third deep learning algorithm and the current classification function as a trained context information classification model, a trained weight distribution model, a trained dimension reduction model and a trained classification function.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 7, an embodiment of the present application provides an apparatus 300 for identifying sensitive information, where the apparatus 300 includes:

an obtaining module 301, configured to obtain word vectors of m words included in text information to be recognized, where a word vector of a first word is a semantic representation of the first word, the first word is one of the m words, and m is an integer greater than 1;

a generating module 302, configured to generate a hidden layer vector of the first word based on a word vector of the first word and a word vector of a second word, where the second word is a word adjacent to the first word in front of and behind, and the hidden layer vector of the first word is a semantic representation of the first word and a semantic representation of context information;

and the identifying module 303 is configured to identify whether the text information is sensitive information according to the hidden vectors of the m words.

Optionally, the generating module 302 is configured to:

Optionally, the identifying module 303 is configured to:

Optionally, the apparatus 300 further includes:

optionally, the identifying module 303 is configured to:

Optionally, the adjusting module is configured to:

In the embodiment of the application, the obtaining module obtains word vectors of m words included in the text information to be recognized, the word vector of the first word is a semantic representation of the first word, and the first word is one of the m words. The generation module generates a hidden layer vector of the first word based on the word vector of the first word and the word vector of the second word, the second word is a word adjacent to the first word in front and back, and the hidden layer vector of the first word is semantic representation of the first word and semantic representation of context information. And the identification module identifies whether the text information is sensitive information according to the hidden vectors of the m words. Therefore, whether the text information is sensitive information or not is identified based on the semantic information of the whole text information, and whether the text information is sensitive information or not can be identified even if the text information does not include newly appeared sensitive words, so that the accuracy of identifying the sensitive information is improved

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 8 shows a block diagram of a terminal 400 according to an exemplary embodiment of the present invention. The terminal 400 may be used to perform the above-mentioned method for identifying sensitive information or training method, and may be a portable mobile terminal, such as: a smartphone, a tablet, a laptop, or a desktop computer. The terminal 400 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

Generally, the terminal 400 includes: a processor 401 and a memory 402.

Processor 401 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 401 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 401 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 401 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 401 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 402 is used to store at least one instruction for execution by processor 401 to implement the method of identifying sensitive information or the training method provided by the method embodiments herein.

In some embodiments, the terminal 400 may further optionally include: a peripheral interface 403 and at least one peripheral. The processor 401, memory 402 and peripheral interface 403 may be connected by bus or signal lines. Each peripheral may be connected to the peripheral interface 403 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 404, touch screen display 405, camera 406, audio circuitry 407, positioning components 408, and power supply 409.

The peripheral interface 403 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 401 and the memory 402. In some embodiments, processor 401, memory 402, and peripheral interface 403 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 401, the memory 402 and the peripheral interface 403 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 404 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 404 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 404 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 404 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 404 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 404 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 405 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 405 is a touch display screen, the display screen 405 also has the ability to capture touch signals on or over the surface of the display screen 405. The touch signal may be input to the processor 401 as a control signal for processing. At this point, the display screen 405 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 405 may be one, providing the front panel of the terminal 400; in other embodiments, the display screen 405 may be at least two, respectively disposed on different surfaces of the terminal 400 or in a folded design; in still other embodiments, the display 405 may be a flexible display disposed on a curved surface or a folded surface of the terminal 400. Even further, the display screen 405 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display screen 405 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 406 is used to capture images or video. Optionally, camera assembly 406 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 406 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 407 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 401 for processing, or inputting the electric signals to the radio frequency circuit 404 for realizing voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 400. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 401 or the radio frequency circuit 404 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 407 may also include a headphone jack.

The positioning component 408 is used to locate the current geographic position of the terminal 400 for navigation or LBS (Location Based Service). The Positioning component 408 can be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

The power supply 409 is used to supply power to the various components in the terminal 400. The power source 409 may be alternating current, direct current, disposable or rechargeable. When the power source 409 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 400 also includes one or more sensors 410. The one or more sensors 410 include, but are not limited to: acceleration sensor 411, gyro sensor 412, pressure sensor 413, fingerprint sensor 414, optical sensor 415, and proximity sensor 416.

The acceleration sensor 411 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 400. For example, the acceleration sensor 411 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 401 may control the touch display screen 405 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 411. The acceleration sensor 411 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 412 may detect a body direction and a rotation angle of the terminal 400, and the gyro sensor 412 may cooperate with the acceleration sensor 411 to acquire a 3D motion of the terminal 400 by the user. From the data collected by the gyro sensor 412, the processor 401 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 413 may be disposed on a side bezel of the terminal 400 and/or a lower layer of the touch display screen 405. When the pressure sensor 413 is disposed on the side frame of the terminal 400, a user's holding signal to the terminal 400 can be detected, and the processor 401 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 413. When the pressure sensor 413 is disposed at the lower layer of the touch display screen 405, the processor 401 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 405. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 414 is used for collecting a fingerprint of the user, and the processor 401 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 414, or the fingerprint sensor 414 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 401 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 414 may be disposed on the front, back, or side of the terminal 400. When a physical key or vendor Logo is provided on the terminal 400, the fingerprint sensor 414 may be integrated with the physical key or vendor Logo.

The optical sensor 415 is used to collect the ambient light intensity. In one embodiment, the processor 401 may control the display brightness of the touch display screen 405 based on the ambient light intensity collected by the optical sensor 415. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 405 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 405 is turned down. In another embodiment, the processor 401 may also dynamically adjust the shooting parameters of the camera assembly 406 according to the ambient light intensity collected by the optical sensor 415.

A proximity sensor 416, also known as a distance sensor, is typically disposed on the front panel of the terminal 400. The proximity sensor 416 is used to collect the distance between the user and the front surface of the terminal 400. In one embodiment, when the proximity sensor 416 detects that the distance between the user and the front surface of the terminal 400 gradually decreases, the processor 401 controls the touch display screen 405 to switch from the bright screen state to the dark screen state; when the proximity sensor 416 detects that the distance between the user and the front surface of the terminal 400 gradually becomes larger, the processor 401 controls the touch display screen 405 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 8 is not intended to be limiting of terminal 400 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of identifying sensitive information, the method comprising:

2. The method of claim 1, wherein the generating the hidden-layer vector for the first word based on the word vector for the first word and the word vector for the second word comprises:

3. The method of claim 1, wherein said identifying whether the textual information is sensitive information according to the hidden vectors of the m words comprises:

4. The method of claim 3, wherein said setting a first weight for each of the m words according to the hidden vector of the m words comprises:

5. The method of claim 3, wherein said determining whether the textual information is sensitive information based on the information matrix of the textual information comprises:

6. The method of claim 3, wherein prior to obtaining the information matrix of the textual information based on the first weight of each term and the hidden vector of each term, further comprising:

7. The method of claim 1, wherein said adjusting the first weight of said each word according to a first dictionary and a second dictionary to obtain a second weight of said each word comprises:

8. An apparatus for identifying sensitive information, the apparatus comprising:

9. An electronic device comprising at least one processor and at least one memory, the at least one memory to store at least one instruction, the at least one instruction to be loaded and executed by the at least one processor to implement the method of any one of claims 1 to 7.

10. A computer readable storage medium storing at least one instruction which is loaded and executed by a processor to implement the method of any one of claims 1 to 7.