CN117496977A - Gateway-based data desensitization method - Google Patents

Gateway-based data desensitization method Download PDF

Info

Publication number
CN117496977A
CN117496977A CN202311444921.2A CN202311444921A CN117496977A CN 117496977 A CN117496977 A CN 117496977A CN 202311444921 A CN202311444921 A CN 202311444921A CN 117496977 A CN117496977 A CN 117496977A
Authority
CN
China
Prior art keywords
sentence
text
sample
sentences
association
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311444921.2A
Other languages
Chinese (zh)
Other versions
CN117496977B (en
Inventor
谢雨航
刘明礼
庄恩贵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingan Yun Xin Technology Co ltd
Original Assignee
Beijing Jingan Yun Xin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingan Yun Xin Technology Co ltd filed Critical Beijing Jingan Yun Xin Technology Co ltd
Priority to CN202311444921.2A priority Critical patent/CN117496977B/en
Publication of CN117496977A publication Critical patent/CN117496977A/en
Application granted granted Critical
Publication of CN117496977B publication Critical patent/CN117496977B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/66Arrangements for connecting between networks having differing types of switching systems, e.g. gateways
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of data desensitization, in particular to a gateway-based data desensitization method, which is characterized in that voice data received by a gateway are converted into text data and divided into a plurality of text sentences, vocabularies in the text sentences are compared with a plurality of forbidden vocabularies stored in a sample database to determine whether characteristic vocabularies exist in the text sentences, a plurality of sample sentences containing sample associated vocabularies in the sample database are extracted, sentence structures of the text sentences with the characteristic vocabularies are analyzed and compared with sentence structures of the extracted plurality of sample sentences, the association relation between the text sentences and the sample sentences is judged by calculating structure fitting parameters, a corresponding desensitization strategy is executed based on the association relation between the text sentences and the sample sentences, the problem of poor desensitization effect when actual pronunciation and the tones of the forbidden vocabularies are different is considered through the process, and the method of self-adaptively adjusting the desensitization is improved.

Description

Gateway-based data desensitization method
Technical Field
The invention relates to the field of data desensitization, in particular to a gateway-based data desensitization method.
Background
Along with the continuous development of information technology and the wide application of intelligent equipment, sensitive information is more and more easily acquired and abused, and the technology of desensitizing data through a gateway is generated, so that the protection of private data is realized, and the data leakage is avoided.
Chinese patent publication No.: CN116760588A, the invention relates to the field of data desensitization, and discloses a data desensitization system and a desensitization method, the data desensitization system comprises: the system comprises a gateway, at least one business micro-service and an authentication center micro-service, wherein the business micro-service is in communication connection with the gateway, and the authentication center micro-service is in communication connection with the gateway; the gateway is configured to receive a desensitization processing request of a user, inquire a desensitization rule from an authentication center micro service according to the desensitization processing request, and then judge whether the desensitization rule exists or not; if yes, executing a desensitization rule and responding to the client; if not, directly corresponding clients; by uniformly processing desensitization at the gateway, heavy desensitization logic development of each service micro-service is avoided, and two sets of logic are not required to be developed for call between the micro-services and call by a user respectively, so that the effects of reducing workload, reducing cost and being easy to maintain are achieved.
However, the prior art has the following problems:
in the prior art, when the voice is desensitized, if the actual pronunciation is different from the tone of the forbidden word, misjudgment is easy to occur when the voice is converted into a text sentence for desensitization, the desensitization effect is poor, the conventional desensitization method does not consider the factors, and the desensitization effect is improved by adaptively adjusting the desensitization method according to the characteristics of the converted text sentence.
Disclosure of Invention
Therefore, the invention provides a gateway-based data desensitization method, which is used for solving the problems that in the prior art, when a voice is desensitized, if the actual pronunciation is different from the tone of an illegal word, misjudgment is easy to occur when the voice is converted into a text sentence for desensitization, and the existing desensitization method does not consider the self-adaptive adjustment of the above factors.
To achieve the above object, the present invention provides a gateway-based data desensitization method, which includes:
step S1, voice data received by a gateway are converted into text data, the text data are divided into a plurality of text sentences, and vocabularies in the text sentences are compared with a plurality of forbidden vocabularies stored in a sample database to determine whether characteristic vocabularies exist in the text sentences;
s2, extracting a plurality of sample sentences containing sample associated vocabulary in a sample database, wherein the sample associated vocabulary is forbidden vocabulary with the same pinyin characteristics as the feature vocabulary;
s3, analyzing the sentence structure of the text sentence with the characteristic vocabulary, and comparing the sentence structure with the sentence structure of the extracted sample sentences to calculate the structural fitting parameters to judge the association relationship between the text sentence and the sample sentences;
step S4, based on the association relation between the text sentence and the sample sentence, executing a corresponding desensitization strategy, comprising,
analyzing the semantic association degree of the feature words and the rest sentences, and desensitizing the text sentences when the semantic association degree is smaller than a preset standard;
or determining non-characteristic words in the text sentence, comparing the non-characteristic words with the extracted sample sentences, calculating an association characterization value according to the association degree of the non-characteristic words in the text sentence and the sample sentences so as to judge whether the characteristic words are forbidden words or not, and desensitizing the text sentence when the characteristic words are forbidden words.
Further, in the step S1, the process of determining whether the feature vocabulary exists in the text sentence based on the comparison result of the vocabulary in the text sentence and the forbidden vocabularies stored in the sample database comprises,
if the forbidden vocabulary exists in the sample database and the pinyin characteristics of the vocabulary in the text sentence are the same, determining that the characteristic vocabulary exists in the text sentence.
Further, in the step S3, the process of calculating the structure fitting parameter based on the comparison result of the sentence structure of the text sentence in which the feature vocabulary exists and the sentence structure of the extracted several sample sentences includes,
and determining the number of the characteristic sample sentences in the extracted sample sentences, and determining the ratio of the number of the characteristic sample sentences to the number of the extracted sample sentences as a structure fitting parameter, wherein the characteristic sample sentences are sample sentences with the same sentence structure as the text sentences with characteristic words.
Further, in the step S3, the process of determining the association relationship between the text sentence and the sample sentence based on the structure fitting parameter includes,
comparing the structural fitting parameter with a preset fitting comparison threshold value,
if the structural fitting parameter is larger than or equal to the fitting comparison threshold, judging that the association relationship between the text sentence and the sample sentence is a strong association relationship;
and if the structure fitting parameter is smaller than the fitting comparison threshold, judging that the association relationship between the text sentence and the sample sentence is a weak association relationship.
Further, in the step S4, the process of determining the executed desensitization strategy based on the association relation between the text sentence and the sample sentence comprises,
if the association relation between the text sentence and the sample sentence is a strong association relation, analyzing the semantic association degree of the characteristic words and the rest sentences, and desensitizing the text sentence when the semantic association degree is smaller than a preset standard;
if the association relation between the text sentence and the sample sentence is a weak association relation, determining a plurality of residual vocabularies except the characteristic vocabularies in the text sentence, comparing the residual vocabularies with the extracted sample sentences, calculating a plurality of association characterization values corresponding to the extracted sample sentences, judging whether the characteristic vocabularies are forbidden vocabularies based on the maximum value in the association characterization values, and desensitizing the text sentences when the characteristic vocabularies are forbidden vocabularies.
Further, in the step S4, the process of desensitizing the text sentence includes deleting the voice data corresponding to the text sentence.
Further, the process of calculating the association characterization value according to the semantic association degree of each non-characteristic word in the text sentence and the sample sentence comprises,
and calculating the semantic association degree of each non-characteristic word in each text sentence and the sample sentence, and determining an average value of the association degree as an association characterization value.
Further, in the step S4, the process of determining whether the feature vocabulary is an illicit vocabulary based on the maximum value in each of the associated characterization values includes,
determining the maximum value in each association characteristic value, comparing the maximum value with a preset association characteristic comparison threshold value,
and if the maximum value is greater than or equal to the association characterization comparison threshold value, judging that the characteristic vocabulary is forbidden vocabulary.
Further, in the step S4, an alarm message is sent out when the ratio of the number of text sentences to be desensitized to the number of text sentences in the text data exceeds a predetermined ratio, so as to alarm that the received voice data is abnormal.
Further, in the step S1, the data amount of the voice data received by the gateway at a time does not exceed a predetermined data amount threshold.
Compared with the prior art, the method has the advantages that voice data received by a gateway are converted into text data and are divided into a plurality of text sentences, vocabularies in the text sentences are compared with a plurality of forbidden vocabularies stored in the sample database to determine whether characteristic vocabularies exist in the text sentences, a plurality of sample sentences containing sample associated vocabularies in the sample database are extracted, sentence structures of the text sentences with the characteristic vocabularies are analyzed and compared with sentence structures of the extracted plurality of sample sentences, structural fitting parameters are calculated to judge association relations between the text sentences and the sample sentences, a corresponding desensitization strategy is executed based on the association relations between the text sentences and the sample sentences, the problem that the desensitization effect is poor when actual pronunciation and tones of the forbidden vocabularies are different is considered through the process, and the desensitization effect of the gateway to data is improved.
Particularly, in the invention, the vocabularies in the text sentence are compared with a plurality of forbidden vocabularies stored in the sample database to determine whether the characteristic vocabularies exist in the text sentence, in the actual situation, the voice data are influenced by factors such as the tone of the voice data when being converted into the text data, so that the converted text data are text sentences with the same pinyin characteristics as the voice data, therefore, the vocabularies in the text sentence are compared with a plurality of forbidden vocabularies stored in the sample database to determine the vocabularies with the same pinyin characteristics as the forbidden vocabularies in the sample database, namely the characteristic vocabularies, and the text sentence with the characteristic vocabularies is convenient to carry out specific processing subsequently so as to improve the desensitization effect of the gateway.
In particular, in the invention, the association relation between the text sentence and the sample sentence is judged by calculating the structure fitting parameter, the structure fitting parameter characterizes the similarity degree of the sentence structure of the text sentence with the characteristic words and the sentence structure of a plurality of extracted sample sentences, in the practical situation, the sentence structure similarity degree of two sentences is higher, which means that the two sentences have similar grammar structures, so that the meaning expressed by the two sentences has higher similarity, namely, the association relation between the text sentence and the sample sentence is stronger, therefore, the invention classifies the association relation between the text sentence and the sample sentence according to the structure fitting parameter so as to carry out corresponding desensitization strategy aiming at the strength of the association relation, and improves the desensitization effect of the gateway on the data.
In particular, in the invention, if the association relation between the text sentence and the sample sentence is judged to be a strong association relation, the semantic association degree of the feature vocabulary and the rest sentences is analyzed, the text sentence is desensitized when the semantic association degree is smaller than a preset standard, the strong association relation characterizes that the meaning expressed by the text sentence and the sample sentence has higher similarity, and in actual conditions, if the semantic association degree of the feature vocabulary and the rest sentences is smaller, namely the collocation is unreasonable, the feature vocabulary is indicated to be forbidden vocabulary, so that the text sentence is desensitized when the semantic association degree is smaller than the preset standard.
In particular, in the invention, if the association relation between the text sentence and the sample sentence is judged to be a weak association relation, a plurality of association characterization values corresponding to the text sentence and each extracted sample sentence are calculated, whether the feature vocabulary is an illegal vocabulary is judged based on the maximum value in each association characterization value, the text sentence is desensitized when the feature vocabulary is judged to be the illegal vocabulary, the weak association relation characterizes that the meaning expressed by the text sentence and the sample sentence has lower similarity, whether the feature vocabulary is the illegal vocabulary cannot be determined, in this case, the text sentence needs to be further judged, the association characterization value in the invention can indicate the association degree of the non-feature keywords of the text sentence and the sample sentence, if the maximum value in each association characterization value is higher than the preset association characterization comparison threshold, the feature vocabulary is the illegal vocabulary, and the text sentence is further judged under the condition that the feature vocabulary cannot be determined to be the illegal vocabulary through the process, so that the data desensitization effect of the gateway is improved.
Drawings
FIG. 1 is a schematic diagram of steps of a gateway-based data desensitization method according to an embodiment of the invention;
FIG. 2 is a flowchart of a text sentence judging whether feature words exist in the text sentence according to an embodiment of the present invention;
FIG. 3 is a flowchart for determining the association between a text sentence and a sample sentence according to an embodiment of the present invention;
FIG. 4 is a flow chart of determining whether a feature vocabulary is forbidden according to an embodiment of the present invention.
Detailed Description
In order that the objects and advantages of the invention will become more apparent, the invention will be further described with reference to the following examples; it should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are merely for explaining the technical principles of the present invention, and are not intended to limit the scope of the present invention.
Referring to fig. 1 to fig. 4, which are respectively a schematic step diagram of a gateway-based data desensitizing method according to an embodiment of the present invention, a text sentence presence/absence feature vocabulary determining flowchart, a text sentence and sample sentence association relationship determining flowchart, and a feature vocabulary forbidden vocabulary determining flowchart, the gateway-based data desensitizing method of the present invention includes:
step S1, voice data received by a gateway are converted into text data, the text data are divided into a plurality of text sentences, and vocabularies in the text sentences are compared with a plurality of forbidden vocabularies stored in a sample database to determine whether characteristic vocabularies exist in the text sentences;
s2, extracting a plurality of sample sentences containing sample associated vocabulary in a sample database, wherein the sample associated vocabulary is forbidden vocabulary with the same pinyin characteristics as the feature vocabulary;
s3, analyzing the sentence structure of the text sentence with the characteristic vocabulary, and comparing the sentence structure with the sentence structure of the extracted sample sentences to calculate the structural fitting parameters to judge the association relationship between the text sentence and the sample sentences;
step S4, based on the association relation between the text sentence and the sample sentence, executing a corresponding desensitization strategy, comprising,
analyzing the semantic association degree of the feature words and the rest sentences, and desensitizing the text sentences when the semantic association degree is smaller than a preset standard;
or determining non-characteristic words in the text sentence, comparing the non-characteristic words with the extracted sample sentences, calculating an association characterization value according to the association degree of the non-characteristic words in the text sentence and the sample sentences so as to judge whether the characteristic words are forbidden words or not, and desensitizing the text sentence when the characteristic words are forbidden words.
Specifically, the specific implementation manner of converting the voice data into the text data is not limited, and the method can be to preset a language processing algorithm capable of converting the voice data into the text data and identifying whether each sentence in the text data is ended in the gateway so as to convert the voice data received by the gateway into the text data, and automatically insert proper punctuation marks at the end of each text sentence, which is the prior art and is not repeated here.
Specifically, the specific implementation manner of dividing the text data into text sentences is not limited, and the text data can be divided in other ways by presetting a sentence dividing algorithm in the gateway to judge the end of the text sentences according to punctuation marks and dividing the end of each text sentence, so that the function of dividing the text data into text sentences can be completed, and the description is omitted.
Specifically, the specific implementation manner of the sentence structure of the parsed text sentence and the sample sentence is not limited, the method can be that a grammar rule which is trained in advance and can identify the sentence structure is imported into a gateway, the sample adopted by the grammar rule can be text data with a plurality of different sentence structures, the function of identifying the sentence structure of the text sentence and the sentence structure of the sample sentence can be completed, and the method is the prior art and is not repeated here.
Specifically, the specific operation mode of the semantic association degree of the feature words and the rest sentences is not limited, a pre-trained Word vector model is imported into the gateway, the feature words and the rest sentences are respectively represented by vectors through the Word vector model, euclidean distances of the two vectors are used as the semantic association degree of the feature words and the rest sentences, other modes, such as Word2Vec Word vector models, can represent words as dense vectors, the semantic association degree between words can be measured by calculating the distance between the Word vectors, and the semantic association degree between words and sentences can be an average value of the semantic association degree of each Word in the words and the sentences, which is the prior art and is not repeated.
Specifically, the specific setting mode of the sample database is not limited, and the forbidden words can be extracted by a machine learning method for a large amount of data such as social media, news reports, user comments and the like, or other modes, and a person skilled in the art can select a proper method according to specific application scenes to extract the forbidden words, so that the forbidden words are not repeated.
Specifically, in the invention, if the association relation between the text sentence and the sample sentence is judged to be a strong association relation, the semantic association relation between the feature vocabulary and the residual sentence is analyzed, the text sentence is desensitized when the semantic association relation is smaller than a preset standard, the strong association relation characterizes that the meaning expressed by the text sentence and the sample sentence has higher similarity, in actual conditions, if the semantic association relation between the feature vocabulary and the residual sentence is smaller, namely, the collocation is unreasonable, the feature vocabulary is indicated to be forbidden vocabulary, and therefore, the text sentence is desensitized when the semantic association relation is smaller than the preset standard.
Specifically, in the invention, if the association relation between the text sentence and the sample sentence is judged to be a weak association relation, a plurality of association characterization values corresponding to the text sentence and each extracted sample sentence are calculated, whether the feature vocabulary is an illegal vocabulary is judged based on the maximum value in each association characterization value, the text sentence is desensitized when the feature vocabulary is judged to be the illegal vocabulary, the weak association relation characterizes that the meaning expressed by the text sentence and the sample sentence has lower similarity, whether the feature vocabulary is the illegal vocabulary cannot be determined, in this case, the text sentence needs to be further judged, the association characterization value in the invention can indicate the association degree of the non-feature keywords of the text sentence and the sample sentence, if the maximum value in each association characterization value is higher than the preset association characterization comparison threshold, the feature vocabulary is the illegal vocabulary, the text sentence is further judged under the condition that whether the feature vocabulary is the illegal vocabulary cannot be determined through the process, and the data desensitization effect of the gateway is improved.
Specifically, in the present embodiment, the predetermined criterion is selected from within the interval [0.5,0.8 ].
Specifically, as shown in fig. 2, in the step S1, the process of determining whether the feature vocabulary exists in the text sentence based on the comparison result between the vocabulary in the text sentence and the forbidden vocabularies stored in the sample database includes,
if the forbidden vocabulary exists in the sample database and the pinyin characteristics of the vocabulary in the text sentence are the same, determining that the characteristic vocabulary exists in the text sentence.
Specifically, in the invention, the vocabularies in the text sentence are compared with a plurality of forbidden vocabularies stored in the sample database to determine whether the characteristic vocabularies exist in the text sentence, in the actual situation, the voice data are influenced by factors such as the tone of the voice data when being converted into the text data, so that the converted text data are text sentences with the same pinyin characteristics as the voice data, therefore, the vocabularies in the text sentence are compared with a plurality of forbidden vocabularies stored in the sample database to determine the vocabularies with the same pinyin characteristics as the forbidden vocabularies in the sample database, namely the characteristic vocabularies, and the text sentence with the characteristic vocabularies is convenient to carry out specific processing subsequently so as to improve the desensitization effect of the gateway.
Specifically, in the step S3, the process of calculating the structure fitting parameter based on the comparison result of the sentence structure of the text sentence in which the feature vocabulary exists and the sentence structure of the extracted several sample sentences includes,
and determining the number of the characteristic sample sentences in the extracted sample sentences, and determining the ratio of the number of the characteristic sample sentences to the number of the extracted sample sentences as a structure fitting parameter, wherein the characteristic sample sentences are sample sentences with the same sentence structure as the text sentences with characteristic words.
Specifically, as shown in fig. 3, in the step S3, the process of determining the association relationship between the text sentence and the sample sentence based on the structure fitting parameter includes,
comparing the structural fitting parameter N with a preset fitting comparison threshold N0,
if N is more than or equal to N0, judging that the association relationship between the text sentence and the sample sentence is a strong association relationship;
if N is less than N0, judging that the association relationship between the text sentence and the sample sentence is a weak association relationship.
Specifically, in the present embodiment, the fit contrast threshold N0 is selected from within the interval [0.7,0.9 ].
Specifically, in the invention, the association relation between the text sentence and the sample sentence is judged by calculating the structure fitting parameter, the structure fitting parameter characterizes the similarity degree of the sentence structure of the text sentence with the characteristic words and the sentence structure of a plurality of extracted sample sentences, in the practical situation, the sentence structure similarity degree of two sentences is higher, which means that the two sentences have similar grammar structures, so that the meaning expressed by the two sentences has higher similarity, namely, the association relation between the text sentence and the sample sentence is stronger, therefore, the invention classifies the association relation between the text sentence and the sample sentence according to the structure fitting parameter so as to carry out corresponding desensitization strategy aiming at the strength of the association relation, and improves the desensitization effect of the gateway on the data.
Specifically, in the step S4, the process of determining the executed desensitization strategy based on the association relation between the text sentence and the sample sentence comprises,
if the association relation between the text sentence and the sample sentence is a strong association relation, analyzing the semantic association degree of the characteristic words and the rest sentences, and desensitizing the text sentence when the semantic association degree is smaller than a preset standard;
if the association relation between the text sentence and the sample sentence is a weak association relation, determining a plurality of residual vocabularies except the characteristic vocabularies in the text sentence, comparing the residual vocabularies with the extracted sample sentences, calculating a plurality of association characterization values corresponding to the extracted sample sentences, judging whether the characteristic vocabularies are forbidden vocabularies based on the maximum value in the association characterization values, and desensitizing the text sentences when the characteristic vocabularies are forbidden vocabularies.
Specifically, the process of calculating the association characterization value according to the semantic association degree of each non-characteristic word in the text sentence and the sample sentence comprises the following steps of,
and calculating the semantic association degree of each non-characteristic word in each text sentence and the sample sentence, and determining an average value of the association degree as an association characterization value.
Specifically, the degree of association is based on
In detail, referring to fig. 4, in the step S4, the process of determining whether the feature vocabulary is the forbidden vocabulary based on the maximum value of each associated characterization value includes,
determining the maximum value Bm in each association characterization value, comparing the maximum value Bm with a preset association characterization comparison threshold Bm0,
if Bm is more than or equal to Bm0, judging that the characteristic words are forbidden words.
Specifically, in the present embodiment, the association token contrast threshold Bm0 is selected from within the interval [0.6,0.8 ].
Specifically, in step S4, when the ratio of the number of text sentences to be desensitized to the number of text sentences in the text data exceeds a predetermined ratio, an alarm message is sent to alarm that the received voice data is abnormal.
Specifically, in the present embodiment, the predetermined ratio is selected from within the interval [0.1,0.2 ].
Specifically, in the step S1, the data amount of the voice data received by the gateway at a single time does not exceed a predetermined data amount threshold.
Specifically, in the present embodiment, the predetermined data amount threshold is selected from within the interval [50, 100] in kilobytes/second.
The gateway-based data desensitization method of the present invention, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium, and based on such understanding, the technical solution of the present invention is essentially or partly contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention, where the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will be within the scope of the present invention.

Claims (10)

1. A gateway-based data desensitization method, comprising:
step S1, voice data received by a gateway are converted into text data, the text data are divided into a plurality of text sentences, and vocabularies in the text sentences are compared with a plurality of forbidden vocabularies stored in a sample database to determine whether characteristic vocabularies exist in the text sentences;
s2, extracting a plurality of sample sentences containing sample associated vocabulary in a sample database, wherein the sample associated vocabulary is forbidden vocabulary with the same pinyin characteristics as the feature vocabulary;
s3, analyzing the sentence structure of the text sentence with the characteristic vocabulary, and comparing the sentence structure with the sentence structure of the extracted sample sentences to calculate the structural fitting parameters to judge the association relationship between the text sentence and the sample sentences;
step S4, based on the association relation between the text sentence and the sample sentence, executing a corresponding desensitization strategy, comprising,
analyzing the semantic association degree of the feature words and the rest sentences, and desensitizing the text sentences when the semantic association degree is smaller than a preset standard;
or determining non-characteristic words in the text sentence, comparing the non-characteristic words with the extracted sample sentences, calculating an association characterization value according to the association degree of the non-characteristic words in the text sentence and the sample sentences so as to judge whether the characteristic words are forbidden words or not, and desensitizing the text sentence when the characteristic words are forbidden words.
2. The gateway-based data desensitizing method according to claim 1, wherein in said step S1, the process of determining whether feature words exist in a text sentence based on the comparison result of words in said text sentence and a plurality of forbidden words stored in a sample database comprises,
if the forbidden vocabulary exists in the sample database and the pinyin characteristics of the vocabulary in the text sentence are the same, determining that the characteristic vocabulary exists in the text sentence.
3. The gateway-based data desensitizing method according to claim 1, wherein in said step S3, the process of calculating a structure fitting parameter based on the comparison result of the sentence structure of the text sentence in which the feature vocabulary exists and the sentence structure of the extracted several sample sentences comprises,
and determining the number of the characteristic sample sentences in the extracted sample sentences, and determining the ratio of the number of the characteristic sample sentences to the number of the extracted sample sentences as a structure fitting parameter, wherein the characteristic sample sentences are sample sentences with the same sentence structure as the text sentences with characteristic words.
4. The gateway-based data desensitizing method according to claim 1, wherein in said step S3, the process of determining the association relationship of said text sentence and sample sentence based on said structure fitting parameters comprises,
comparing the structural fitting parameter with a preset fitting comparison threshold value,
if the structural fitting parameter is larger than or equal to the fitting comparison threshold, judging that the association relationship between the text sentence and the sample sentence is a strong association relationship;
and if the structure fitting parameter is smaller than the fitting comparison threshold, judging that the association relationship between the text sentence and the sample sentence is a weak association relationship.
5. The gateway-based data desensitizing method according to claim 1, wherein in said step S4, the process of determining the executed desensitizing policy based on the association relationship of said text sentence and sample sentence comprises,
if the association relation between the text sentence and the sample sentence is a strong association relation, analyzing the semantic association degree of the characteristic words and the rest sentences, and desensitizing the text sentence when the semantic association degree is smaller than a preset standard;
if the association relation between the text sentence and the sample sentence is a weak association relation, determining a plurality of residual vocabularies except the characteristic vocabularies in the text sentence, comparing the residual vocabularies with the extracted sample sentences, calculating a plurality of association characterization values corresponding to the extracted sample sentences, judging whether the characteristic vocabularies are forbidden vocabularies based on the maximum value in the association characterization values, and desensitizing the text sentences when the characteristic vocabularies are forbidden vocabularies.
6. The gateway-based data desensitizing method according to claim 1, wherein in said step S4, said process of desensitizing said text sentence comprises deleting voice data corresponding to said text sentence.
7. The gateway-based data desensitization method according to claim 1, wherein in said step S4, the process of calculating an association characterization value according to the semantic association degree of each of said non-feature words in said text sentence with a sample sentence comprises,
and calculating the semantic association degree of each non-characteristic word in each text sentence and the sample sentence, and determining an average value of the association degree as an association characterization value.
8. The gateway-based data desensitization method according to claim 1, wherein in said step S4, said process of determining whether said feature vocabulary is forbidden vocabulary based on a maximum value of each of said associated token values comprises,
determining the maximum value in each association characteristic value, comparing the maximum value with a preset association characteristic comparison threshold value,
and if the maximum value is greater than or equal to the association characterization comparison threshold value, judging that the characteristic vocabulary is forbidden vocabulary.
9. The gateway-based data desensitizing method according to claim 1, further comprising, in said step S4, sending out an alert message to alert the received voice data of anomalies when the ratio of the number of text sentences that need to be desensitized to the number of text sentences in the text data exceeds a predetermined ratio.
10. The gateway-based data desensitization method according to claim 1, wherein in said step S1, the data amount of voice data received by said gateway at a single time does not exceed a predetermined data amount threshold.
CN202311444921.2A 2023-11-02 2023-11-02 Gateway-based data desensitization method Active CN117496977B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311444921.2A CN117496977B (en) 2023-11-02 2023-11-02 Gateway-based data desensitization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311444921.2A CN117496977B (en) 2023-11-02 2023-11-02 Gateway-based data desensitization method

Publications (2)

Publication Number Publication Date
CN117496977A true CN117496977A (en) 2024-02-02
CN117496977B CN117496977B (en) 2024-05-03

Family

ID=89684137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311444921.2A Active CN117496977B (en) 2023-11-02 2023-11-02 Gateway-based data desensitization method

Country Status (1)

Country Link
CN (1) CN117496977B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070010993A1 (en) * 2004-12-10 2007-01-11 Bachenko Joan C Method and system for the automatic recognition of deceptive language
CN109448730A (en) * 2018-11-27 2019-03-08 广州广电运通金融电子股份有限公司 A kind of automatic speech quality detecting method, system, device and storage medium
CN111179935A (en) * 2018-11-12 2020-05-19 中移(杭州)信息技术有限公司 Voice quality inspection method and device
CN111681672A (en) * 2020-05-26 2020-09-18 深圳壹账通智能科技有限公司 Voice data detection method and device, computer equipment and storage medium
CN116955610A (en) * 2023-04-27 2023-10-27 腾讯科技(深圳)有限公司 Text data processing method and device and storage medium
CN117423339A (en) * 2023-10-12 2024-01-19 广东保伦电子股份有限公司 Broadcasting terminal based on multipath sound source input

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070010993A1 (en) * 2004-12-10 2007-01-11 Bachenko Joan C Method and system for the automatic recognition of deceptive language
CN111179935A (en) * 2018-11-12 2020-05-19 中移(杭州)信息技术有限公司 Voice quality inspection method and device
CN109448730A (en) * 2018-11-27 2019-03-08 广州广电运通金融电子股份有限公司 A kind of automatic speech quality detecting method, system, device and storage medium
CN111681672A (en) * 2020-05-26 2020-09-18 深圳壹账通智能科技有限公司 Voice data detection method and device, computer equipment and storage medium
CN116955610A (en) * 2023-04-27 2023-10-27 腾讯科技(深圳)有限公司 Text data processing method and device and storage medium
CN117423339A (en) * 2023-10-12 2024-01-19 广东保伦电子股份有限公司 Broadcasting terminal based on multipath sound source input

Also Published As

Publication number Publication date
CN117496977B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
US11966493B2 (en) User identification system and method for fraud detection
CN107798032B (en) Method and device for processing response message in self-service voice conversation
US10452352B2 (en) Voice interaction apparatus, its processing method, and program
US8165877B2 (en) Confidence measure generation for speech related searching
JP5496863B2 (en) Emotion estimation apparatus, method, program, and recording medium
CN111159364B (en) Dialogue system, dialogue device, dialogue method, and storage medium
CN113179250B (en) Method and system for detecting unknown web threats
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
CN111768789B (en) Electronic equipment, and method, device and medium for determining identity of voice generator of electronic equipment
KR102030551B1 (en) Instant messenger driving apparatus and operating method thereof
CN111523317B (en) Voice quality inspection method and device, electronic equipment and medium
CN115150660A (en) Video editing method based on subtitles and related equipment
CN118212925A (en) Voice interaction method and device based on large language model and intelligent voice equipment
CN117496977B (en) Gateway-based data desensitization method
CN109271637B (en) Semantic understanding method and device
CN109388695B (en) User intention recognition method, apparatus and computer-readable storage medium
CN111464687A (en) Strange call request processing method and device
CN111970311B (en) Session segmentation method, electronic device and computer readable medium
CN115512687A (en) Voice sentence-breaking method and device, storage medium and electronic equipment
CN113645222A (en) Message flow detection method, system, device and computer readable storage medium
CN112395402A (en) Depth model-based recommended word generation method and device and computer equipment
CN112565242A (en) Remote authorization method, system, equipment and storage medium based on voiceprint recognition
CN110225207B (en) Harassment prevention method, system, terminal and storage medium fusing semantic understanding
KR102507810B1 (en) Voice-based sales information extraction and lead recommendation method using artificial intelligence, and data analysis apparatus therefor
CN112417108A (en) Request type identification method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant