CN117496977A

CN117496977A - Gateway-based data desensitization method

Info

Publication number: CN117496977A
Application number: CN202311444921.2A
Authority: CN
Inventors: 谢雨航; 刘明礼; 庄恩贵
Original assignee: Beijing Jingan Yun Xin Technology Co ltd
Current assignee: Beijing Jingan Yun Xin Technology Co ltd
Priority date: 2023-11-02
Filing date: 2023-11-02
Publication date: 2024-02-02
Anticipated expiration: 2043-11-02
Also published as: CN117496977B

Abstract

The invention relates to the field of data desensitization, in particular to a gateway-based data desensitization method, which is characterized in that voice data received by a gateway are converted into text data and divided into a plurality of text sentences, vocabularies in the text sentences are compared with a plurality of forbidden vocabularies stored in a sample database to determine whether characteristic vocabularies exist in the text sentences, a plurality of sample sentences containing sample associated vocabularies in the sample database are extracted, sentence structures of the text sentences with the characteristic vocabularies are analyzed and compared with sentence structures of the extracted plurality of sample sentences, the association relation between the text sentences and the sample sentences is judged by calculating structure fitting parameters, a corresponding desensitization strategy is executed based on the association relation between the text sentences and the sample sentences, the problem of poor desensitization effect when actual pronunciation and the tones of the forbidden vocabularies are different is considered through the process, and the method of self-adaptively adjusting the desensitization is improved.

Description

Gateway-based data desensitization method

Technical Field

The invention relates to the field of data desensitization, in particular to a gateway-based data desensitization method.

Background

Along with the continuous development of information technology and the wide application of intelligent equipment, sensitive information is more and more easily acquired and abused, and the technology of desensitizing data through a gateway is generated, so that the protection of private data is realized, and the data leakage is avoided.

Chinese patent publication No.: CN116760588A, the invention relates to the field of data desensitization, and discloses a data desensitization system and a desensitization method, the data desensitization system comprises: the system comprises a gateway, at least one business micro-service and an authentication center micro-service, wherein the business micro-service is in communication connection with the gateway, and the authentication center micro-service is in communication connection with the gateway; the gateway is configured to receive a desensitization processing request of a user, inquire a desensitization rule from an authentication center micro service according to the desensitization processing request, and then judge whether the desensitization rule exists or not; if yes, executing a desensitization rule and responding to the client; if not, directly corresponding clients; by uniformly processing desensitization at the gateway, heavy desensitization logic development of each service micro-service is avoided, and two sets of logic are not required to be developed for call between the micro-services and call by a user respectively, so that the effects of reducing workload, reducing cost and being easy to maintain are achieved.

However, the prior art has the following problems:

in the prior art, when the voice is desensitized, if the actual pronunciation is different from the tone of the forbidden word, misjudgment is easy to occur when the voice is converted into a text sentence for desensitization, the desensitization effect is poor, the conventional desensitization method does not consider the factors, and the desensitization effect is improved by adaptively adjusting the desensitization method according to the characteristics of the converted text sentence.

Disclosure of Invention

Therefore, the invention provides a gateway-based data desensitization method, which is used for solving the problems that in the prior art, when a voice is desensitized, if the actual pronunciation is different from the tone of an illegal word, misjudgment is easy to occur when the voice is converted into a text sentence for desensitization, and the existing desensitization method does not consider the self-adaptive adjustment of the above factors.

To achieve the above object, the present invention provides a gateway-based data desensitization method, which includes:

step S1, voice data received by a gateway are converted into text data, the text data are divided into a plurality of text sentences, and vocabularies in the text sentences are compared with a plurality of forbidden vocabularies stored in a sample database to determine whether characteristic vocabularies exist in the text sentences;

s2, extracting a plurality of sample sentences containing sample associated vocabulary in a sample database, wherein the sample associated vocabulary is forbidden vocabulary with the same pinyin characteristics as the feature vocabulary;

s3, analyzing the sentence structure of the text sentence with the characteristic vocabulary, and comparing the sentence structure with the sentence structure of the extracted sample sentences to calculate the structural fitting parameters to judge the association relationship between the text sentence and the sample sentences;

step S4, based on the association relation between the text sentence and the sample sentence, executing a corresponding desensitization strategy, comprising,

analyzing the semantic association degree of the feature words and the rest sentences, and desensitizing the text sentences when the semantic association degree is smaller than a preset standard;

or determining non-characteristic words in the text sentence, comparing the non-characteristic words with the extracted sample sentences, calculating an association characterization value according to the association degree of the non-characteristic words in the text sentence and the sample sentences so as to judge whether the characteristic words are forbidden words or not, and desensitizing the text sentence when the characteristic words are forbidden words.

Further, in the step S1, the process of determining whether the feature vocabulary exists in the text sentence based on the comparison result of the vocabulary in the text sentence and the forbidden vocabularies stored in the sample database comprises,

if the forbidden vocabulary exists in the sample database and the pinyin characteristics of the vocabulary in the text sentence are the same, determining that the characteristic vocabulary exists in the text sentence.

Further, in the step S3, the process of calculating the structure fitting parameter based on the comparison result of the sentence structure of the text sentence in which the feature vocabulary exists and the sentence structure of the extracted several sample sentences includes,

and determining the number of the characteristic sample sentences in the extracted sample sentences, and determining the ratio of the number of the characteristic sample sentences to the number of the extracted sample sentences as a structure fitting parameter, wherein the characteristic sample sentences are sample sentences with the same sentence structure as the text sentences with characteristic words.

Further, in the step S3, the process of determining the association relationship between the text sentence and the sample sentence based on the structure fitting parameter includes,

comparing the structural fitting parameter with a preset fitting comparison threshold value,

if the structural fitting parameter is larger than or equal to the fitting comparison threshold, judging that the association relationship between the text sentence and the sample sentence is a strong association relationship;

and if the structure fitting parameter is smaller than the fitting comparison threshold, judging that the association relationship between the text sentence and the sample sentence is a weak association relationship.

Further, in the step S4, the process of determining the executed desensitization strategy based on the association relation between the text sentence and the sample sentence comprises,

if the association relation between the text sentence and the sample sentence is a strong association relation, analyzing the semantic association degree of the characteristic words and the rest sentences, and desensitizing the text sentence when the semantic association degree is smaller than a preset standard;

if the association relation between the text sentence and the sample sentence is a weak association relation, determining a plurality of residual vocabularies except the characteristic vocabularies in the text sentence, comparing the residual vocabularies with the extracted sample sentences, calculating a plurality of association characterization values corresponding to the extracted sample sentences, judging whether the characteristic vocabularies are forbidden vocabularies based on the maximum value in the association characterization values, and desensitizing the text sentences when the characteristic vocabularies are forbidden vocabularies.

Further, in the step S4, the process of desensitizing the text sentence includes deleting the voice data corresponding to the text sentence.

Further, the process of calculating the association characterization value according to the semantic association degree of each non-characteristic word in the text sentence and the sample sentence comprises,

and calculating the semantic association degree of each non-characteristic word in each text sentence and the sample sentence, and determining an average value of the association degree as an association characterization value.

Further, in the step S4, the process of determining whether the feature vocabulary is an illicit vocabulary based on the maximum value in each of the associated characterization values includes,

determining the maximum value in each association characteristic value, comparing the maximum value with a preset association characteristic comparison threshold value,

and if the maximum value is greater than or equal to the association characterization comparison threshold value, judging that the characteristic vocabulary is forbidden vocabulary.

Further, in the step S4, an alarm message is sent out when the ratio of the number of text sentences to be desensitized to the number of text sentences in the text data exceeds a predetermined ratio, so as to alarm that the received voice data is abnormal.

Further, in the step S1, the data amount of the voice data received by the gateway at a time does not exceed a predetermined data amount threshold.

Compared with the prior art, the method has the advantages that voice data received by a gateway are converted into text data and are divided into a plurality of text sentences, vocabularies in the text sentences are compared with a plurality of forbidden vocabularies stored in the sample database to determine whether characteristic vocabularies exist in the text sentences, a plurality of sample sentences containing sample associated vocabularies in the sample database are extracted, sentence structures of the text sentences with the characteristic vocabularies are analyzed and compared with sentence structures of the extracted plurality of sample sentences, structural fitting parameters are calculated to judge association relations between the text sentences and the sample sentences, a corresponding desensitization strategy is executed based on the association relations between the text sentences and the sample sentences, the problem that the desensitization effect is poor when actual pronunciation and tones of the forbidden vocabularies are different is considered through the process, and the desensitization effect of the gateway to data is improved.

Particularly, in the invention, the vocabularies in the text sentence are compared with a plurality of forbidden vocabularies stored in the sample database to determine whether the characteristic vocabularies exist in the text sentence, in the actual situation, the voice data are influenced by factors such as the tone of the voice data when being converted into the text data, so that the converted text data are text sentences with the same pinyin characteristics as the voice data, therefore, the vocabularies in the text sentence are compared with a plurality of forbidden vocabularies stored in the sample database to determine the vocabularies with the same pinyin characteristics as the forbidden vocabularies in the sample database, namely the characteristic vocabularies, and the text sentence with the characteristic vocabularies is convenient to carry out specific processing subsequently so as to improve the desensitization effect of the gateway.

In particular, in the invention, the association relation between the text sentence and the sample sentence is judged by calculating the structure fitting parameter, the structure fitting parameter characterizes the similarity degree of the sentence structure of the text sentence with the characteristic words and the sentence structure of a plurality of extracted sample sentences, in the practical situation, the sentence structure similarity degree of two sentences is higher, which means that the two sentences have similar grammar structures, so that the meaning expressed by the two sentences has higher similarity, namely, the association relation between the text sentence and the sample sentence is stronger, therefore, the invention classifies the association relation between the text sentence and the sample sentence according to the structure fitting parameter so as to carry out corresponding desensitization strategy aiming at the strength of the association relation, and improves the desensitization effect of the gateway on the data.

In particular, in the invention, if the association relation between the text sentence and the sample sentence is judged to be a strong association relation, the semantic association degree of the feature vocabulary and the rest sentences is analyzed, the text sentence is desensitized when the semantic association degree is smaller than a preset standard, the strong association relation characterizes that the meaning expressed by the text sentence and the sample sentence has higher similarity, and in actual conditions, if the semantic association degree of the feature vocabulary and the rest sentences is smaller, namely the collocation is unreasonable, the feature vocabulary is indicated to be forbidden vocabulary, so that the text sentence is desensitized when the semantic association degree is smaller than the preset standard.

In particular, in the invention, if the association relation between the text sentence and the sample sentence is judged to be a weak association relation, a plurality of association characterization values corresponding to the text sentence and each extracted sample sentence are calculated, whether the feature vocabulary is an illegal vocabulary is judged based on the maximum value in each association characterization value, the text sentence is desensitized when the feature vocabulary is judged to be the illegal vocabulary, the weak association relation characterizes that the meaning expressed by the text sentence and the sample sentence has lower similarity, whether the feature vocabulary is the illegal vocabulary cannot be determined, in this case, the text sentence needs to be further judged, the association characterization value in the invention can indicate the association degree of the non-feature keywords of the text sentence and the sample sentence, if the maximum value in each association characterization value is higher than the preset association characterization comparison threshold, the feature vocabulary is the illegal vocabulary, and the text sentence is further judged under the condition that the feature vocabulary cannot be determined to be the illegal vocabulary through the process, so that the data desensitization effect of the gateway is improved.

Drawings

FIG. 1 is a schematic diagram of steps of a gateway-based data desensitization method according to an embodiment of the invention;

FIG. 2 is a flowchart of a text sentence judging whether feature words exist in the text sentence according to an embodiment of the present invention;

FIG. 3 is a flowchart for determining the association between a text sentence and a sample sentence according to an embodiment of the present invention;

FIG. 4 is a flow chart of determining whether a feature vocabulary is forbidden according to an embodiment of the present invention.

Detailed Description

In order that the objects and advantages of the invention will become more apparent, the invention will be further described with reference to the following examples; it should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.

Referring to fig. 1 to fig. 4, which are respectively a schematic step diagram of a gateway-based data desensitizing method according to an embodiment of the present invention, a text sentence presence/absence feature vocabulary determining flowchart, a text sentence and sample sentence association relationship determining flowchart, and a feature vocabulary forbidden vocabulary determining flowchart, the gateway-based data desensitizing method of the present invention includes:

Specifically, the specific implementation manner of converting the voice data into the text data is not limited, and the method can be to preset a language processing algorithm capable of converting the voice data into the text data and identifying whether each sentence in the text data is ended in the gateway so as to convert the voice data received by the gateway into the text data, and automatically insert proper punctuation marks at the end of each text sentence, which is the prior art and is not repeated here.

Specifically, the specific implementation manner of dividing the text data into text sentences is not limited, and the text data can be divided in other ways by presetting a sentence dividing algorithm in the gateway to judge the end of the text sentences according to punctuation marks and dividing the end of each text sentence, so that the function of dividing the text data into text sentences can be completed, and the description is omitted.

Specifically, the specific implementation manner of the sentence structure of the parsed text sentence and the sample sentence is not limited, the method can be that a grammar rule which is trained in advance and can identify the sentence structure is imported into a gateway, the sample adopted by the grammar rule can be text data with a plurality of different sentence structures, the function of identifying the sentence structure of the text sentence and the sentence structure of the sample sentence can be completed, and the method is the prior art and is not repeated here.

Specifically, the specific operation mode of the semantic association degree of the feature words and the rest sentences is not limited, a pre-trained Word vector model is imported into the gateway, the feature words and the rest sentences are respectively represented by vectors through the Word vector model, euclidean distances of the two vectors are used as the semantic association degree of the feature words and the rest sentences, other modes, such as Word2Vec Word vector models, can represent words as dense vectors, the semantic association degree between words can be measured by calculating the distance between the Word vectors, and the semantic association degree between words and sentences can be an average value of the semantic association degree of each Word in the words and the sentences, which is the prior art and is not repeated.

Specifically, the specific setting mode of the sample database is not limited, and the forbidden words can be extracted by a machine learning method for a large amount of data such as social media, news reports, user comments and the like, or other modes, and a person skilled in the art can select a proper method according to specific application scenes to extract the forbidden words, so that the forbidden words are not repeated.

Specifically, in the invention, if the association relation between the text sentence and the sample sentence is judged to be a strong association relation, the semantic association relation between the feature vocabulary and the residual sentence is analyzed, the text sentence is desensitized when the semantic association relation is smaller than a preset standard, the strong association relation characterizes that the meaning expressed by the text sentence and the sample sentence has higher similarity, in actual conditions, if the semantic association relation between the feature vocabulary and the residual sentence is smaller, namely, the collocation is unreasonable, the feature vocabulary is indicated to be forbidden vocabulary, and therefore, the text sentence is desensitized when the semantic association relation is smaller than the preset standard.

Specifically, in the invention, if the association relation between the text sentence and the sample sentence is judged to be a weak association relation, a plurality of association characterization values corresponding to the text sentence and each extracted sample sentence are calculated, whether the feature vocabulary is an illegal vocabulary is judged based on the maximum value in each association characterization value, the text sentence is desensitized when the feature vocabulary is judged to be the illegal vocabulary, the weak association relation characterizes that the meaning expressed by the text sentence and the sample sentence has lower similarity, whether the feature vocabulary is the illegal vocabulary cannot be determined, in this case, the text sentence needs to be further judged, the association characterization value in the invention can indicate the association degree of the non-feature keywords of the text sentence and the sample sentence, if the maximum value in each association characterization value is higher than the preset association characterization comparison threshold, the feature vocabulary is the illegal vocabulary, the text sentence is further judged under the condition that whether the feature vocabulary is the illegal vocabulary cannot be determined through the process, and the data desensitization effect of the gateway is improved.

Specifically, in the present embodiment, the predetermined criterion is selected from within the interval [0.5,0.8 ].

Specifically, as shown in fig. 2, in the step S1, the process of determining whether the feature vocabulary exists in the text sentence based on the comparison result between the vocabulary in the text sentence and the forbidden vocabularies stored in the sample database includes,

Specifically, in the invention, the vocabularies in the text sentence are compared with a plurality of forbidden vocabularies stored in the sample database to determine whether the characteristic vocabularies exist in the text sentence, in the actual situation, the voice data are influenced by factors such as the tone of the voice data when being converted into the text data, so that the converted text data are text sentences with the same pinyin characteristics as the voice data, therefore, the vocabularies in the text sentence are compared with a plurality of forbidden vocabularies stored in the sample database to determine the vocabularies with the same pinyin characteristics as the forbidden vocabularies in the sample database, namely the characteristic vocabularies, and the text sentence with the characteristic vocabularies is convenient to carry out specific processing subsequently so as to improve the desensitization effect of the gateway.

Specifically, in the step S3, the process of calculating the structure fitting parameter based on the comparison result of the sentence structure of the text sentence in which the feature vocabulary exists and the sentence structure of the extracted several sample sentences includes,

Specifically, as shown in fig. 3, in the step S3, the process of determining the association relationship between the text sentence and the sample sentence based on the structure fitting parameter includes,

comparing the structural fitting parameter N with a preset fitting comparison threshold N0,

if N is more than or equal to N0, judging that the association relationship between the text sentence and the sample sentence is a strong association relationship;

if N is less than N0, judging that the association relationship between the text sentence and the sample sentence is a weak association relationship.

Specifically, in the present embodiment, the fit contrast threshold N0 is selected from within the interval [0.7,0.9 ].

Specifically, in the invention, the association relation between the text sentence and the sample sentence is judged by calculating the structure fitting parameter, the structure fitting parameter characterizes the similarity degree of the sentence structure of the text sentence with the characteristic words and the sentence structure of a plurality of extracted sample sentences, in the practical situation, the sentence structure similarity degree of two sentences is higher, which means that the two sentences have similar grammar structures, so that the meaning expressed by the two sentences has higher similarity, namely, the association relation between the text sentence and the sample sentence is stronger, therefore, the invention classifies the association relation between the text sentence and the sample sentence according to the structure fitting parameter so as to carry out corresponding desensitization strategy aiming at the strength of the association relation, and improves the desensitization effect of the gateway on the data.

Specifically, in the step S4, the process of determining the executed desensitization strategy based on the association relation between the text sentence and the sample sentence comprises,

Specifically, the process of calculating the association characterization value according to the semantic association degree of each non-characteristic word in the text sentence and the sample sentence comprises the following steps of,

Specifically, the degree of association is based on

In detail, referring to fig. 4, in the step S4, the process of determining whether the feature vocabulary is the forbidden vocabulary based on the maximum value of each associated characterization value includes,

determining the maximum value Bm in each association characterization value, comparing the maximum value Bm with a preset association characterization comparison threshold Bm0,

if Bm is more than or equal to Bm0, judging that the characteristic words are forbidden words.

Specifically, in the present embodiment, the association token contrast threshold Bm0 is selected from within the interval [0.6,0.8 ].

Specifically, in step S4, when the ratio of the number of text sentences to be desensitized to the number of text sentences in the text data exceeds a predetermined ratio, an alarm message is sent to alarm that the received voice data is abnormal.

Specifically, in the present embodiment, the predetermined ratio is selected from within the interval [0.1,0.2 ].

Specifically, in the step S1, the data amount of the voice data received by the gateway at a single time does not exceed a predetermined data amount threshold.

Specifically, in the present embodiment, the predetermined data amount threshold is selected from within the interval [50, 100] in kilobytes/second.

The gateway-based data desensitization method of the present invention, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium, and based on such understanding, the technical solution of the present invention is essentially or partly contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention, where the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

Claims

1. A gateway-based data desensitization method, comprising:

2. The gateway-based data desensitizing method according to claim 1, wherein in said step S1, the process of determining whether feature words exist in a text sentence based on the comparison result of words in said text sentence and a plurality of forbidden words stored in a sample database comprises,

3. The gateway-based data desensitizing method according to claim 1, wherein in said step S3, the process of calculating a structure fitting parameter based on the comparison result of the sentence structure of the text sentence in which the feature vocabulary exists and the sentence structure of the extracted several sample sentences comprises,

4. The gateway-based data desensitizing method according to claim 1, wherein in said step S3, the process of determining the association relationship of said text sentence and sample sentence based on said structure fitting parameters comprises,

5. The gateway-based data desensitizing method according to claim 1, wherein in said step S4, the process of determining the executed desensitizing policy based on the association relationship of said text sentence and sample sentence comprises,

6. The gateway-based data desensitizing method according to claim 1, wherein in said step S4, said process of desensitizing said text sentence comprises deleting voice data corresponding to said text sentence.

7. The gateway-based data desensitization method according to claim 1, wherein in said step S4, the process of calculating an association characterization value according to the semantic association degree of each of said non-feature words in said text sentence with a sample sentence comprises,

8. The gateway-based data desensitization method according to claim 1, wherein in said step S4, said process of determining whether said feature vocabulary is forbidden vocabulary based on a maximum value of each of said associated token values comprises,

9. The gateway-based data desensitizing method according to claim 1, further comprising, in said step S4, sending out an alert message to alert the received voice data of anomalies when the ratio of the number of text sentences that need to be desensitized to the number of text sentences in the text data exceeds a predetermined ratio.

10. The gateway-based data desensitization method according to claim 1, wherein in said step S1, the data amount of voice data received by said gateway at a single time does not exceed a predetermined data amount threshold.