CN112560472B

CN112560472B - Method and device for identifying sensitive information

Info

Publication number: CN112560472B
Application number: CN201910918780.0A
Authority: CN
Inventors: 赵妍妍; 罗观柱; 秦兵
Original assignee: Harbin Institute of Technology; Tencent Technology Shenzhen Co Ltd
Current assignee: Harbin Institute of Technology; Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-26
Filing date: 2019-09-26
Publication date: 2023-07-11
Anticipated expiration: 2039-09-26
Also published as: CN112560472A

Abstract

The application relates to a method and a device for identifying sensitive information, and belongs to the field of information processing. The method comprises the following steps: acquiring word vectors of m words included in text information to be recognized, wherein the word vector of a first word is a semantic representation of the first word, the first word is one of the m words, and m is an integer greater than 1; generating a hidden layer vector of the first word based on the word vector of the first word and the word vector of a second word, wherein the second word is a word adjacent to the first word in front of and behind the first word, and the hidden layer vector of the first word is a semantic representation of the first word and a semantic representation of context information; and identifying whether the text information is sensitive information or not according to the hidden layer vectors of the m words. The method and the device can improve the accuracy of identifying the sensitive information.

Description

Method and device for identifying sensitive information

Technical Field

The present disclosure relates to the field of information processing, and in particular, to a method and apparatus for identifying sensitive information.

Background

At present, the internet becomes one of important media for people to acquire information with highly developed informatization, and great convenience is brought to work and life of people. As a product of new technology of the internet, the dialogue service robot provides question-answering service and intelligent operation for human beings by using a powerful knowledge base and computing power, and is widely paid attention to academia and industry. Although various technical researches of the conversation service robot have been advanced to a certain extent at present, due to the characteristics of openness and instantaneity, conversation content also often becomes an important way for lawbreakers to spread sensitive information such as low custom pollution.

To prevent the propagation of sensitive information, sensitive information can be identified. A sensitive dictionary is set in advance, and the sensitive dictionary comprises a large number of sensitive words. Detecting whether the text information to be recognized comprises the sensitive words in the sensitive dictionary, and if the text information to be recognized comprises the sensitive words, recognizing the text information as sensitive information, wherein the text information can be one-sentence text information or multiple-sentence text information consisting of multiple words.

The inventors have found that in the process of implementing the present application, at least the following drawbacks exist in the above manner:

at present, when the sensitive information is identified, the sensitive dictionary is relied on for identification, and when new sensitive words which are not in the sensitive dictionary appear in the text information, whether the text information is the sensitive information cannot be identified, and the accuracy of identifying the sensitive information is reduced.

Disclosure of Invention

The embodiment of the application provides a method and a device for identifying sensitive information, so as to improve the accuracy of identifying the sensitive information. The technical scheme is as follows:

in one aspect, a method of identifying sensitive information is provided, the method comprising:

acquiring word vectors of m words included in text information to be recognized, wherein the word vector of a first word is a semantic representation of the first word, the first word is one of the m words, and m is an integer greater than 1;

Generating a hidden layer vector of the first word based on the word vector of the first word and the word vector of a second word, wherein the second word is a word adjacent to the first word in front of and behind the first word, and the hidden layer vector of the first word is a semantic representation of the first word and a semantic representation of context information;

and identifying whether the text information is sensitive information or not according to the hidden layer vectors of the m words.

Optionally, the generating the hidden layer vector of the first word based on the word vector of the first word and the word vector of the second word includes:

inputting word vectors of each word in the m words into a context information classification model according to the sequence of each word in the text information, wherein the context information classification model is used for generating hidden layer vectors of the first word based on word vectors of the first word and word vectors of the second word;

and obtaining the hidden layer vector of each word output by the context classification model.

Optionally, the identifying whether the text information is sensitive information according to the hidden layer vector of the m words includes:

setting a first weight of each word in the m words according to the hidden layer vector of the m words, wherein the first weight of the word is used for representing the contribution of the word to the text information as sensitive information;

Acquiring an information matrix of the text information according to the first weight of each word and the hidden layer vector of each word;

and determining whether the text information is sensitive information according to the information matrix of the text information.

Optionally, the setting the first weight of each word in the m words according to the hidden layer vector of the m words includes:

inputting hidden layer vectors of the m words into a weight distribution model, wherein the weight distribution model is used for setting first weights of each word in the m words based on the hidden layer vectors of the m words;

and acquiring the first weight of each word output by the weight distribution model.

Optionally, the determining whether the text information is sensitive information according to the information matrix of the text information includes:

inputting the information matrix of the text information into a dimension reduction model, wherein the dimension reduction model is used for carrying out dimension reduction processing on the information matrix of the text information to obtain a two-dimensional vector of the text information;

and determining whether the text information is sensitive information or not through a classification function according to the two-dimensional vector of the text information.

Optionally, before the obtaining the information matrix of the text information according to the first weight of each word and the hidden layer vector of each word, the method further includes:

The first weight of each word is adjusted according to a first dictionary and a second dictionary, so that a second weight of each word is obtained, the first dictionary is used for storing sensitive words, and the second dictionary is used for storing words which are simultaneously appeared with the sensitive words and have the frequency meeting preset conditions;

the obtaining the information matrix of the text information according to the first weight of each word and the hidden layer vector of each word comprises the following steps:

and acquiring an information matrix of the text information according to the second weight of each word and the hidden layer vector of each word.

Optionally, the adjusting the first weight of each word according to the first dictionary and the second dictionary to obtain the second weight of each word includes:

when the first word is a sensitive word in the first dictionary, increasing a first weight of the first word to obtain a second weight;

and when the first word is a word in the second dictionary, reducing the first weight of the first word to obtain a second weight.

In another aspect, the present application provides an apparatus for identifying sensitive information, the apparatus comprising:

the method comprises the steps of acquiring word vectors of m words included in text information to be identified, wherein the word vector of a first word is a semantic representation of the first word, the first word is one of the m words, and m is an integer greater than 1;

A generation module, configured to generate a hidden layer vector of the first term based on a term vector of the first term and a term vector of a second term, where the second term is a term adjacent to the first term before and after, and the hidden layer vector of the first term is a semantic representation of the first term and a semantic representation of context information;

and the identification module is used for identifying whether the text information is sensitive information or not according to the hidden layer vectors of the m words.

Optionally, the generating module is configured to:

Optionally, the identification module is configured to:

Optionally, the apparatus further includes:

the adjusting module is used for adjusting the first weight of each word according to a first dictionary and a second dictionary to obtain the second weight of each word, the first dictionary is used for storing sensitive words, and the second dictionary is used for storing words which are simultaneously appeared with the sensitive words and have the frequency meeting preset conditions;

Optionally, the identification module is configured to:

Optionally, the adjusting module is configured to:

In another aspect, the present application provides an electronic device comprising at least one processor and at least one memory for storing at least one instruction loaded and executed by the at least one processor to implement the above-described method.

In another aspect, the present application provides a computer readable storage medium storing at least one instruction for loading and execution by a processor to implement the method described above.

The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:

by obtaining word vectors of m words included in the text information to be recognized, the word vector of the first word is a semantic representation of the first word, which is one of the m words. Generating a hidden layer vector of the first word based on the word vector of the first word and the word vector of the second word, the second word being a word adjacent to the first word, the hidden layer vector of the first word being a semantic representation of the first word and a semantic representation of the context information. And identifying whether the text information is sensitive information according to the hidden layer vectors of the m words. Therefore, whether the text information is sensitive information or not is identified based on the semantic information of the whole text information, and even if the text information does not comprise newly appeared sensitive words, whether the text information is sensitive information or not can be identified, and accuracy of identifying the sensitive information is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart of a method for identifying sensitive information provided by an embodiment of the present application;

FIG. 2 is a word vector diagram provided by an embodiment of the present application;

FIG. 3 is a schematic view of hidden layer vectors according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of assigning a first weight provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of adjusting a first weight provided in an embodiment of the present application;

FIG. 6 is a flowchart of a training method provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of an apparatus for identifying sensitive information according to an embodiment of the present application;

fig. 8 is a schematic diagram of a terminal structure according to an embodiment of the present application.

Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.

The conversation service robot is a computer program that performs conversation via conversation or text, and can simulate a human conversation. The conversation service robot can be used for client service or information acquisition and other applications. When a person interacts with the conversation service robot, the person may input sensitive information including pornography, obscene, and the like to the conversation service robot.

To prevent the propagation of the sensitive information, it is necessary to recognize whether text information input into the conversation service robot is the sensitive information. In the present application, a smart model for identifying sensitive information may be trained, and whether the text information is sensitive information may be identified based on semantics of the text information using the plurality of smart models. The intelligent model is composed of a plurality of models, wherein the plurality of models comprise a context information classification model, a weight distribution model, a dimension reduction model and a classification function.

Referring to fig. 1, an embodiment of the present application provides a method for identifying sensitive information, including:

step 101: the method comprises the steps of obtaining word vectors of m words included in text information to be recognized, wherein the word vector of a first word is a semantic representation of the first word, the first word is one of the m words, and m is an integer greater than 1.

In the step, the text information to be identified can be segmented to obtain m words; and according to each word in the m words, acquiring the word vector of each word from the corresponding relation between the word and the word vector.

For each record in the correspondence of a word to a word vector, the word vector stored in the record is established based on the semantics of the word stored in the record, so the word vector is a semantic representation of the word.

For any term, a multi-dimensional word vector may be generated based on the semantics of the term, which is used to represent the term. For example, a d-dimensional word vector is generated based on the semantics of the word, and the d-dimensional word vector may be used to replace the word, with values of d=100, 200, or 300.

The corresponding relation between the words and the word vectors can be downloaded from a network.

For example, assume that the text information to be recognized is "someone is drowned in a uniform temptation. The text information is divided into 5 words, namely ' someone ', ' drowsiness ', ' in ', ' uniform temptation ', ' are respectively obtained. "; and according to the 5 words, acquiring the word vector of each word in the 5 words from the corresponding relation between the words and the word vector. As shown in fig. 2, the word vector of each of the 5 words is obtained as a vector in d dimensions, i.e., the word vector of each word is composed of d floating point numbers.

Step 102: and inputting word vectors of the m words into a context information classification model, and obtaining hidden layer vectors of each word output by the context information classification model.

The context information classification model is used for generating hidden layer vectors of the first words based on word vectors of the first words and word vectors of second words, wherein the second words are words adjacent to the first words in front of and behind the first words.

The context information classification model is obtained by training a first deep learning algorithm in advance, and the first deep learning algorithm can be BiLSTM.

The hidden layer vector of the first term is a semantic representation of the first term and a semantic representation of the context information.

In this step, the word vector of each of the m words is input to the context information classification model in the order of each word in the text information. The context information classification model generates a hidden layer vector of the first word according to word vectors of the first word, word vectors of x second words which are positioned before and adjacent to the first word and word vectors of x second words which are positioned after and adjacent to the first word, wherein x is an integer greater than or equal to 1.

The hidden layer vector of the first word includes semantic information of a second word preceding the first word and/or semantic information of a second word following the first word.

In the case that the first word is the first word in the text information, the context information classification model generates a hidden layer vector of the first word from the word vector of the first word and word vectors of x second words located after and adjacent to the first word. In the case that the first word is the last word in the text information, the context information classification model generates a hidden layer vector of the first word from the word vector of the first word and word vectors of x second words located before and adjacent to the first word.

The hidden layer vector of the first word is a vector of k dimensions, and k may be equal to d, may be greater than d, or may be less than d.

For example, in this step, it is assumed that x=1, the word vector of "someone", "the word vector of" sink ", the word vector of" in ", and the word vector sum of" uniform temptation "are sequentially input to the context information classification model. "word vector. For the word "manned", the context information classification model generates a "manned" hidden layer vector from the "manned" word vector and the "immersed" word vector. For the word 'drowning', the context information classification model generates a hidden layer vector of the word 'drowning' and a hidden layer vector of the word 'uniform temptation' according to the word vector of the 'someone', the word vector of the 'drowning' and the word vector of the 'near', and the hidden layer vectors of the word 'near' and the word vector of the 'uniform temptation' are generated in a similar way. For words and phrases. The "contextual information classification model is based on the word vector sum of" uniform temptation ". "word vector generation". "hidden layer vector. Referring to fig. 3, hidden layer vectors of the five words output by the context classification model are then obtained.

Step 103: and inputting the hidden layer vectors of the m words into a weight distribution model, and acquiring the first weight of each word output by the weight distribution model.

The weight distribution model is used for setting first weights of each word in the m words based on hidden layer vectors of the m words; the first weight of a term is used to represent the contribution of the term to the text information as sensitive information.

The weight distribution model is obtained by training a second deep learning algorithm in advance, and the second deep learning algorithm can be self-attention.

The weight distribution model is an attention mechanism, the attention mechanism simulates the attention of the human brain, intuitively understands that the distribution of resources is realized, and at a specific moment, the attention is always focused on a certain focus part in a picture, and the attention is not visible to other parts.

When the hidden layer vectors of m words are input to the weight distribution model, the weight distribution model distributes attention to the words which are possibly sensitive words, and the first weights distributed to the words which are possibly sensitive words are larger than the first weights distributed to other words.

Sensitive words refer to words related to gambling, pornography, drugs, and politics.

In this step, the hidden layer vectors of the m words are input to the weight distribution model. The weight distribution model sets the first weight of each word in the m words according to the hidden layer vector of the m words, and outputs the first weight of each word. A first weight of each term output by the weight distribution model is obtained.

For example, referring to FIG. 4, a "man-in" hidden layer vector, a "sink" hidden layer vector, a "in" hidden layer vectorThe hidden layer vector sum of 'uniform temptation'. "hidden layer vectors are input to the weight distribution model. The weight distribution model is based on the hidden layer vector of ' someone ', ' drowning ', ' in ', ' uniform temptation ', ' sum of hidden layer vectors of ' Condition '. "hidden layer vector, assigned" first weight of someone "alpha ₁ First weight alpha of "sink" and sink ₂ First weight of @ alpha @ ₃ First weight alpha of "uniform temptation ₄ And (d) sum. First weight alpha of ₅ 。

Step 104: and adjusting the first weight of each word according to the first dictionary and the second dictionary to obtain the second weight of each word.

The first dictionary is used for storing sensitive words, and the second dictionary is used for storing words which are concurrent with the sensitive words and have frequencies meeting preset conditions.

The first dictionary may be downloaded from a network. For the second dictionary, the words stored in the second dictionary are suspected sensitive words, and the intersection of the first dictionary and the second dictionary is empty. The words in the second dictionary are not truly sensitive words, but may often appear in the same text message as the sensitive words. For example, for a sense word "uniform temptation" related to pornography, the word "drowning" that often occurs in the same text message at the same time as the sense word "uniform temptation" is a suspected sense word.

For the second dictionary, a large amount of text information can be collected in advance, and for each text information, the text information is segmented to obtain a plurality of words. Searching for whether the sensitive word in the first dictionary exists from the plurality of words. If there is a sensitive word, increasing the frequency of occurrence of the non-sensitive word with the sensitive word for other non-sensitive words not located in the first dictionary. The initial value of the frequency of occurrence of each non-sensitive word simultaneously with the sensitive word is 0. And processing the collected other text information according to the mode to obtain the frequency of occurrence of the non-sensitive words and the sensitive words, and taking the non-sensitive words with the frequency of occurrence of the non-sensitive words and the sensitive words exceeding a preset frequency threshold as suspected sensitive words and forming a second dictionary.

In this step, when the first word is a sensitive word in the first dictionary, the first weight of the first word is increased to obtain the second weight. When the first word is a word in the second dictionary, reducing the first weight of the first word results in a second weight. When the first word is not a word in the first dictionary nor a word in the second dictionary, the first weight of the first word is used as the second weight.

When the first word is a sensitive word in the first dictionary, it is indicated that the first word is a true sensitive word, and therefore the first weight of the first word needs to be increased. When the first word is a word in the second dictionary, it is indicated that the first word is not a sensitive word, but the model easily mistakes it as a sensitive word, thus reducing the first weight of the first word. When the first word is not a word in the first dictionary nor a word in the second dictionary, it indicates that the first word may or may not be a new occurrence of a sensitive word in the first dictionary, thus keeping the weight of the first word unchanged.

In the step, the first weight of the first word is multiplied by a first coefficient to obtain a second weight, and the first coefficient is larger than 1 so as to increase the first weight of the first word. Or multiplying the first weight of the first word by a second coefficient to obtain a second weight, wherein the second coefficient is smaller than 1, so as to reduce the first weight of the first word.

For example, referring to fig. 5, "someone", "indulgence", "in", "uniform temptation", and "are searched for from the first dictionary and the second dictionary, respectively. The words "crafted" are found to be in the second dictionary and "uniform crafted" are in the first dictionary. The second weight beta of such "uniform temptation ₄ ＝α ₄ ×a，a>1, "sink" second weight beta ₂ ＝α ₂ ×b，b<1. Second weight of "person ₁ ＝α ₁ Second weight of% ₃ ＝α ₃ And (d) a. Second weight beta of ₅ ＝α ₅ 。

Step 105: and obtaining an information matrix of the text information according to the second weight of each word and the hidden layer vector of each word.

In this step, the hidden layer vectors of the m words form a first matrix H of dxk, the second weights of the m words form a weight vector, and the information matrix R of the text information is obtained according to the first matrix H and the transposed vector of the weight vector as follows.

The first formula is r=h×β ^T ，β ^T Is the transposed vector.

Step 104 is an optional step, i.e. step 104 may not be performed, and step 105 may be performed directly after step 103 is performed. Thus, in step 105, an information matrix of the text information is obtained based on the first weight of each word and the hidden layer vector of each word. When the method is realized, the hidden layer vectors of the m words form a first matrix H of dxk, the first weights of the m words form weight vectors, and the information matrix R of the text information is obtained according to the first formula and the transposed vector of the weight vectors.

Step 106: and inputting the information matrix of the text information into a dimension reduction model, and obtaining the two-dimensional vector of the text information output by the dimension reduction model.

The dimension reduction model is used for carrying out dimension reduction processing on the information matrix of the text information to obtain a two-dimensional vector of the text information.

The dimension reduction model is obtained by training a third deep learning algorithm in advance, and the third deep learning algorithm can be a full-connection layer.

Step 107: and determining whether the text information is sensitive information according to the two-dimensional vector of the text information through a classification function.

In this step, a two-dimensional vector of the text information is input to the classification function. And the classification function calculates the probability of the text information as the sensitive information according to the two-dimensional vector, and determines the text information as the sensitive information when the probability is larger than a preset probability threshold.

In the embodiment of the application, a word vector of m words included in text information to be recognized is obtained, the word vector of a first word is a semantic representation of the first word, and the first word is one of the m words. Generating a hidden layer vector of the first word based on the word vector of the first word and the word vector of the second word, the second word being a word adjacent to the first word, the hidden layer vector of the first word being a semantic representation of the first word and a semantic representation of the context information. And identifying whether the text information is sensitive information according to the hidden layer vectors of the m words. Therefore, whether the text information is sensitive information or not is identified based on the semantic information of the whole text information, and even if the text information does not comprise newly appeared sensitive words, whether the text information is sensitive information or not can be identified, and accuracy of identifying the sensitive information is improved. In addition, after the first weight of each word is set, the first weight of each word can be adjusted according to the first dictionary including the sensitive word and the second dictionary including the suspected sensitive word to obtain the second weight of each word, and whether the text information is sensitive information or not is recognized based on the second weight of each word, so that the accuracy of recognizing the sensitive information can be further improved.

For the intelligent model for identifying the sensitive information, a sample set may be set in advance, where the sample set includes a plurality of text information and label information corresponding to each text information, and the label information corresponding to the text information may be whether the text information is the sensitive information. For example, the labeling information corresponding to the text information may be 1, which indicates that the text information is sensitive information, or the labeling information corresponding to the text information may be 0, which indicates that the text information is non-sensitive information. Alternatively, the labeling information corresponding to the text information may be 0, which indicates that the text information is sensitive information, or the labeling information corresponding to the text information may be 1, which indicates that the text information is non-sensitive information.

After the sample set is set, the sample set can be used for training a deep learning algorithm to obtain an intelligent model for identifying sensitive information. Referring to fig. 6, the training method includes:

for any text information in the sample set, it is first identified whether the text information is sensitive information using the flow of steps 201 to 207 as follows.

Step 201: as in step 101, this is not described in detail here.

Step 202: and inputting word vectors of the m words into a first deep learning network, and obtaining hidden layer vectors of each word output by the first deep learning network.

The first deep learning network is used for generating hidden layer vectors of the first words based on word vectors of the first words and word vectors of second words, wherein the second words are words adjacent to the first words in front of and behind the first words. The first deep learning algorithm may be BiLSTM.

The detailed implementation process of this step is the same as the detailed implementation process of step 102, and the detailed implementation process of this step can be obtained only by replacing the context information classification model in step 102 with the first deep learning network, which is not described in detail here.

Step 203: and inputting the hidden layer vectors of the m words into a second deep learning network, and acquiring the first weight of each word output by the second deep learning network.

The second deep learning network is used for setting a first weight of each word in the m words based on the hidden layer vector of the m words; the first weight of a term is used to represent the contribution of the term to the text information as sensitive information. The second deep learning algorithm may be self-intent.

The detailed implementation process of this step is the same as the detailed implementation process of step 103, and the detailed implementation process of this step can be obtained only by replacing the weight distribution model in step 103 with the second deep learning network, which is not described in detail here.

Steps 204-205: steps 104-105 are identical to each other and will not be described in detail herein.

Step 206: and inputting the information matrix of the text information into a third deep learning algorithm, and obtaining the two-dimensional vector of the text information output by the third deep learning algorithm.

And the third deep learning algorithm is used for performing dimension reduction processing on the information matrix of the text information to obtain a two-dimensional vector of the text information. The third deep learning algorithm may be a fully connected layer.

Step 207: and determining whether the text information is sensitive information according to the two-dimensional vector of the text information through a classification function.

For each other text information in the sample set, it is identified whether each other sample information is sensitive information in the manner described above.

Step 208: and comparing the labeling information corresponding to each text information in the sample set with the identification result to obtain a comparison result of each text information.

The text information is identified as sensitive or as non-sensitive. And comparing the identification result of the text information with the labeling information, wherein the obtained comparison result is that the identification result of the text information is the same as the labeling information or different from the labeling information.

In this step, the numerical value "1" may be used to indicate that the identification result of the text information is the same as the labeling information, and the numerical value "0" may be used to indicate that the identification result of the text information is different from the labeling information. Alternatively, the recognition result of the text information may be represented by a value "0" that is the same as the labeling information, and the recognition result of the text information may be represented by a value "1" that is different from the labeling information.

Step 209: and obtaining a substitution value through a preset cost function according to the comparison result of each text message, and executing step 210 when the cost value is not the minimum cost value.

In this step, the comparison result of each text message is formed into an input vector, the input vector is input as an argument to the cost function, and the cost value is calculated by the cost function.

The preset cost function is a curved surface, and the lowest point of the curved surface is the minimum cost value of the cost function.

Step 210: and adjusting network parameters of the deep learning algorithm according to the cost value, and returning to the step 201.

In this step, the network parameters of the first deep learning algorithm, the network parameters of the second deep learning algorithm, the network parameters of the third deep learning algorithm, and the network parameters of the classification function are adjusted according to the cost value.

And when the cost value is the minimum cost value, ending the flow, and taking the current first deep learning algorithm, the second deep learning algorithm, the third deep learning algorithm and the classification function as a trained context information classification model, a trained weight distribution model, a trained dimension reduction model and a trained classification function respectively.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Referring to fig. 7, an embodiment of the present application provides an apparatus 300 for identifying sensitive information, the apparatus 300 including:

an obtaining module 301, configured to obtain a word vector of m words included in text information to be identified, where the word vector of a first word is a semantic representation of the first word, the first word is one of the m words, and m is an integer greater than 1;

a generating module 302, configured to generate a hidden layer vector of the first term based on a term vector of the first term and a term vector of a second term, where the second term is a term that is adjacent to the first term, and the hidden layer vector of the first term is a semantic representation of the first term and a semantic representation of context information;

And the identifying module 303 is configured to identify whether the text information is sensitive information according to the hidden layer vectors of the m words.

Optionally, the generating module 302 is configured to:

Optionally, the identifying module 303 is configured to:

Optionally, the apparatus 300 further includes:

optionally, the identifying module 303 is configured to:

Optionally, the adjusting module is configured to:

In the embodiment of the application, the obtaining module obtains word vectors of m words included in the text information to be recognized, the word vector of the first word is a semantic representation of the first word, and the first word is one of the m words. The generation module generates a hidden layer vector of the first word based on the word vector of the first word and the word vector of the second word, the second word being a word adjacent to the first word before and after, the hidden layer vector of the first word being a semantic representation of the first word and a semantic representation of the context information. The identification module identifies whether the text information is sensitive information according to hidden layer vectors of the m words. Thus, whether the text information is sensitive information or not is identified based on the semantic information of the whole text information, and even if the text information does not comprise newly appeared sensitive words, whether the text information is sensitive information or not can be identified, thereby improving the accuracy of identifying the sensitive information

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 8 shows a block diagram of a terminal 400 according to an exemplary embodiment of the present invention. The terminal 400 may be used to perform the above method for identifying sensitive information or training method, and may be a portable mobile terminal, such as: smart phones, tablet computers, notebook computers or desktop computers. The terminal 400 may also be referred to by other names as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 400 includes: a processor 401 and a memory 402.

Processor 401 may include one or more processing cores such as a 4-core processor, an 8-core processor, etc. The processor 401 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 401 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 401 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 401 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 402 is used to store at least one instruction for execution by processor 401 to implement the method of identifying sensitive information or training method provided by the method embodiments herein.

In some embodiments, the terminal 400 may further optionally include: a peripheral interface 403 and at least one peripheral. The processor 401, memory 402, and peripheral interface 403 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 403 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 404, a touch display 405, a camera 406, audio circuitry 407, a positioning component 408, and a power supply 409.

Peripheral interface 403 may be used to connect at least one Input/Output (I/O) related peripheral to processor 401 and memory 402. In some embodiments, processor 401, memory 402, and peripheral interface 403 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 401, memory 402, and peripheral interface 403 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 404 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 404 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 404 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 404 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 404 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 404 may also include NFC (Near Field Communication ) related circuitry, which is not limited in this application.

The display screen 405 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 405 is a touch display screen, the display screen 405 also has the ability to collect touch signals at or above the surface of the display screen 405. The touch signal may be input as a control signal to the processor 401 for processing. At this time, the display screen 405 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 405 may be one, providing a front panel of the terminal 400; in other embodiments, the display 405 may be at least two, and disposed on different surfaces of the terminal 400 or in a folded design; in still other embodiments, the display 405 may be a flexible display disposed on a curved surface or a folded surface of the terminal 400. Even more, the display screen 405 may be arranged in an irregular pattern that is not rectangular, i.e. a shaped screen. The display 405 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 406 is used to capture images or video. Optionally, camera assembly 406 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 406 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 407 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 401 for processing, or inputting the electric signals to the radio frequency circuit 404 for realizing voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different portions of the terminal 400. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 401 or the radio frequency circuit 404 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuit 407 may also include a headphone jack.

The location component 408 is used to locate the current geographic location of the terminal 400 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 408 may be a positioning component based on the united states GPS (Global Positioning System ), the chinese beidou system, or the russian galileo system.

The power supply 409 is used to power the various components in the terminal 400. The power supply 409 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When power supply 409 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 400 further includes one or more sensors 410. The one or more sensors 410 include, but are not limited to: acceleration sensor 411, gyroscope sensor 412, pressure sensor 413, fingerprint sensor 414, optical sensor 415, and proximity sensor 416.

The acceleration sensor 411 may detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 400. For example, the acceleration sensor 411 may be used to detect components of gravitational acceleration on three coordinate axes. The processor 401 may control the touch display screen 405 to display a user interface in a lateral view or a longitudinal view according to the gravitational acceleration signal acquired by the acceleration sensor 411. The acceleration sensor 411 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 412 may detect a body direction and a rotation angle of the terminal 400, and the gyro sensor 412 may collect a 3D motion of the user to the terminal 400 in cooperation with the acceleration sensor 411. The processor 401 may implement the following functions according to the data collected by the gyro sensor 412: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 413 may be disposed at a side frame of the terminal 400 and/or at a lower layer of the touch display 405. When the pressure sensor 413 is disposed at a side frame of the terminal 400, a grip signal of the terminal 400 by a user may be detected, and the processor 401 performs a left-right hand recognition or a shortcut operation according to the grip signal collected by the pressure sensor 413. When the pressure sensor 413 is disposed at the lower layer of the touch display screen 405, the processor 401 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 405. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 414 is used to collect a fingerprint of the user, and the processor 401 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 414, or the fingerprint sensor 414 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the user is authorized by the processor 401 to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 414 may be provided on the front, back or side of the terminal 400. When a physical key or vendor Logo is provided on the terminal 400, the fingerprint sensor 414 may be integrated with the physical key or vendor Logo.

The optical sensor 415 is used to collect the ambient light intensity. In one embodiment, the processor 401 may control the display brightness of the touch display screen 405 according to the ambient light intensity collected by the optical sensor 415. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 405 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 405 is turned down. In another embodiment, the processor 401 may also dynamically adjust the shooting parameters of the camera assembly 406 according to the ambient light intensity collected by the optical sensor 415.

A proximity sensor 416, also referred to as a distance sensor, is typically provided on the front panel of the terminal 400. The proximity sensor 416 is used to collect the distance between the user and the front of the terminal 400. In one embodiment, when the proximity sensor 416 detects a gradual decrease in the distance between the user and the front face of the terminal 400, the processor 401 controls the touch display 405 to switch from the bright screen state to the off screen state; when the proximity sensor 416 detects that the distance between the user and the front surface of the terminal 400 gradually increases, the processor 401 controls the touch display screen 405 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 8 is not limiting of the terminal 400 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of identifying sensitive information, the method comprising:

acquiring an information matrix of the text information according to the second weight of each word and the hidden layer vector of each word;

2. The method of claim 1, wherein the generating the hidden layer vector of the first term based on the word vector of the first term and the word vector of the second term comprises:

3. The method of claim 1, wherein the setting the first weight for each of the m terms according to the hidden layer vector of the m terms comprises:

4. The method of claim 1, wherein the determining whether the text information is sensitive information based on the information matrix of the text information comprises:

5. The method of claim 1, wherein said adjusting the first weight of each word according to the first dictionary and the second dictionary to obtain the second weight of each word comprises:

6. An apparatus for identifying sensitive information, the apparatus comprising:

the recognition module is used for setting a first weight of each word in the m words according to the hidden layer vector of the m words, wherein the first weight of the word is used for representing the contribution of the word to the text information as sensitive information;

the recognition module is further used for obtaining an information matrix of the text information according to the second weight of each word and the hidden layer vector of each word; and determining whether the text information is sensitive information according to the information matrix of the text information.

7. An electronic device comprising at least one processor and at least one memory for storing at least one instruction to be loaded and executed by the at least one processor to implement the method of any one of claims 1 to 5.

8. A computer readable storage medium storing at least one instruction for loading and execution by a processor to implement the method of any one of claims 1 to 5.