CN110737770A - Text data sensitivity identification method and device, electronic equipment and storage medium - Google Patents

Text data sensitivity identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110737770A
CN110737770A CN201810719136.6A CN201810719136A CN110737770A CN 110737770 A CN110737770 A CN 110737770A CN 201810719136 A CN201810719136 A CN 201810719136A CN 110737770 A CN110737770 A CN 110737770A
Authority
CN
China
Prior art keywords
text data
processed
sensitive
data
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810719136.6A
Other languages
Chinese (zh)
Other versions
CN110737770B (en
Inventor
张梦
雍倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810719136.6A priority Critical patent/CN110737770B/en
Publication of CN110737770A publication Critical patent/CN110737770A/en
Application granted granted Critical
Publication of CN110737770B publication Critical patent/CN110737770B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The application provides text data sensitivity identification methods, devices, electronic equipment and storage media, and belongs to the technical field of the Internet, wherein the method comprises the steps of performing theme identification on text data to be processed to determine a theme type corresponding to the text data to be processed, performing data annotation on the text data to be processed according to a feature set corresponding to a theme type to determine an annotation label corresponding to the text data to be processed, and training an identification model by using a training sample set formed by the text data to be processed and the annotation label corresponding to the text data to be processed to obtain a sensitivity identification model.

Description

Text data sensitivity identification method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of internet, in particular to a method and a device for identifying sensitivity of text data, electronic equipment and a storage medium.
Background
Since the 21 st century, the development of science and technology is changing day by day, and in the information society of high-speed development of the internet industry, the internet has become an important way for people to acquire knowledge and information. For example, people can look up information and browse news through the internet, and government agencies and official media can obtain clue information and know civil appeal through the information on the internet. However, in the face of massive internet information, valuable clues with social significance are obtained without being different from the large sea fishing needles. Meanwhile, the expansion of the application range of the internet also causes the information on the internet to be poor, and poor information which is not beneficial to the growth of minors or harms the negative effects of social stability and the like is generated.
Therefore, the method and the device have very important practical significance for screening information on the Internet according to requirements, identifying sensitive data, improving the efficiency of acquiring information by using the Internet, or avoiding negative effects of bad information on the society. In the existing text data sensitivity identification technology, the sensitivity of a target text is determined mainly by a manual mode or by manually establishing a sensitive word list and then simply matching and querying the target text by a machine based on the sensitive word list. However, the method of manually establishing the sensitive word list lacks the combination of the sensitive words and the context, which easily causes inaccurate recognition result, low efficiency and waste of human resources.
Disclosure of Invention
The text data sensitivity identification method, the text data sensitivity identification device, the electronic equipment and the storage medium are used for solving the problems that in the related art, the accuracy is low, the efficiency is low and human resources are wasted in the existing method for identifying the sensitivity of text data by manually establishing a sensitive word list.
The text data sensitivity identification method provided by the embodiment of the aspect of includes the steps of performing theme identification on text data to be processed to determine a theme type corresponding to the text data to be processed, performing data annotation on the text data to be processed according to a feature set corresponding to the theme type to determine an annotation label corresponding to the text data to be processed, and training an identification model by using a training sample set formed by the text data to be processed and the annotation label corresponding to the text data to be processed to obtain a sensitivity identification model.
The text data sensitivity recognition device provided by the embodiment of the invention in another aspect comprises a determining module for performing topic recognition on text data to be processed to determine a topic type corresponding to the text data to be processed, a second determining module for performing data labeling on the text data to be processed according to a feature set corresponding to the topic type to determine a label corresponding to the text data to be processed, and a training module for training a recognition model by using a training sample set formed by the text data to be processed and the label corresponding to the text data to be processed to obtain a sensitive recognition model.
The electronic device according to the embodiment of the present application's aspect is characterized in that the processor implements the text data sensitivity recognition method as described above when executing the program.
The embodiment of the further aspect of the present application provides a computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the text data sensitivity recognition method as described above.
The embodiment of the aspect of the present application provides a computer program, which is executed by a processor to implement the text data sensitivity recognition method according to the embodiment of the present application.
According to the text data sensitivity identification method, the text data sensitivity identification device, the electronic equipment, the computer readable storage medium and the computer program, the subject identification can be performed on the text data to be processed, the th subject type corresponding to the text data to be processed is determined, the data annotation is performed on the text data to be processed according to the feature set corresponding to the th subject type, the annotation tag corresponding to the text data to be processed is determined, the training sample set formed by the text data to be processed and the annotation tag corresponding to the text data to be processed is further utilized, the identification model is trained, and the sensitivity identification model is obtained.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of text data sensitivity recognition methods provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of another text data sensitivity recognition methods provided in the embodiments of the present application;
fig. 3 is a schematic structural diagram of text data sensitivity recognition apparatuses provided in an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the like or similar elements throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
The embodiment of the application provides text data sensitivity identification methods aiming at the problems of low accuracy, low efficiency and waste of human resources of the existing method for identifying the sensitivity of text data by manually establishing a sensitive word list.
According to the text data sensitivity identification method provided by the embodiment of the application, the subject identification can be performed on the text data to be processed, the th subject type corresponding to the text data to be processed is determined, the data annotation is performed on the text data to be processed according to the feature set corresponding to the th subject type, the annotation label corresponding to the text data to be processed is determined, the training sample set formed by the text data to be processed and the annotation label corresponding to the text data to be processed is further utilized, the identification model is trained, and the sensitive identification model is obtained.
The text data sensitivity recognition method, apparatus, electronic device, storage medium, and computer program provided in the present application are described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of text data sensitivity recognition methods according to embodiments of the present application.
As shown in fig. 1, the text data sensitivity recognition method includes the following steps:
step 101, performing theme recognition on the text data to be processed to determine th theme type corresponding to the text data to be processed.
For example, the theme type can be a political type, an economic type, a social type and the like.
It is to be understood that the same terms may have different meanings in different contexts, and thus the same terms may have different sensitivities in different contexts. In order to avoid inaccurate labeling of the text data to be processed due to different sensitivities of the same word in different contexts, the text data to be processed may be first classified according to the whole context. The whole context of the text data to be processed may be related to the topic type of the text data to be processed, and therefore, in the embodiment of the present application, topic identification may be performed on the text data to be processed to preliminarily determine the whole context of the text data to be processed.
It should be noted that in possible implementation manners of the embodiment of the present application, a topic recognition model may be used to perform topic recognition on text data to be processed, and when training the topic recognition model, a training sample set may be formed by using text data with determined topic types, and the topic recognition model is generated through training.
And 102, performing data annotation on the text data to be processed according to the feature set corresponding to the th topic type to determine an annotation tag corresponding to the text data to be processed.
The feature set includes lexical features, semantic features, syntactic features and the like used for data tagging of the text data to be processed.
For example, theme type is "social class", the corresponding feature set may include feature A, feature B and feature C, and theme type is "economic class", the corresponding feature set may include feature B, feature E and feature D.
The feature sets corresponding to different theme types can be determined according to the policy files associated with the theme types. For example, the feature set corresponding to the "social class" title may be determined based on government agencies, official media, and business logic for discriminating sensitive content. For example, after analyzing the business logic of the sensitive content judged by government agencies and official media, the lexical features and semantic features corresponding to the social title are determined to include: negative content, specific events not completed, and the impact of an event may involve others.
It should be noted that the feature set corresponding to each th topic type may have a plurality of features, and when the text data to be processed matches or more of the feature sets corresponding to the th topic type thereof, or when the matching degree of the text data to be processed and the feature set is greater than a preset th threshold, the text data to be processed may be marked as sensitive, that is, the marking tag corresponding to the text data to be processed is sensitive.
In practical use, the preset th threshold may be determined according to practical situations, which is not limited in the embodiment of the present application, for example, the th threshold may be 60%.
For example, the th topic type corresponding to the text data to be processed is "social class", the feature set corresponding to the "social class" has 6 items, the preset threshold is 60%, and if the text data to be processed matches 4 items in the feature set, the matching degree of the text data to be processed with the feature set is 67%, which is greater than the preset threshold, so that the label tag of the text data to be processed is sensitive.
, in possible implementation forms of the embodiment of the present application, after determining a label tag corresponding to the text data to be processed, the reliability of the label tag to which the text to be processed belongs may also be determined according to the matching degree of the text data to be processed and the feature set corresponding to the topic type of the text data to be processed.
Specifically, different confidence level thresholds can be preset, for example, a second threshold and a third threshold are preset, where the second threshold is greater than the th threshold, and the third threshold is smaller than the th threshold.
In possible implementation forms of this application embodiment, the reliability of the annotation tag corresponding to the text data to be processed may be determined according to the following rule, that is, if the matching degree between the text data to be processed and the feature set corresponding to the th topic type is greater than the th threshold and less than the second threshold, the annotation tag corresponding to the text data to be processed is sensitive and has a lower reliability, and may be set to level 1, if the matching degree between the text data to be processed and the feature set corresponding to the th topic type is greater than the second threshold, the annotation tag corresponding to the text data to be processed is sensitive and has a higher reliability, and may be set to level 2, if the matching degree between the text data to be processed and the feature set corresponding to the th topic type is less than the th threshold and greater than the third threshold, the annotation tag corresponding to the text data to be processed is insensitive and has a lower reliability, and has a higher reliability, and if the matching degree between the text data to be processed and the th topic type is less than the third threshold, the annotation tag corresponding to be sensitive and has a higher reliability.
For example, the th subject types corresponding to the text data to be processed A, B, C, D are "social classes", the feature set corresponding to the social classes has 6 features, the preset th threshold is 60%, the second threshold is 80%, the third threshold is 20%, the text data a to be processed is matched with 4 features in the feature set, the text data B to be processed is matched with 6 features in the feature set, the text data C to be processed is matched with 3 features in the feature set, the text data D to be processed is not matched with any features in the feature set, the matching degrees of the text data A, B, C, D to be processed and the feature set are 67%, 100%, 50% and 0%, the matching degree of the text data a to be processed and the feature set is greater than the th threshold and less than the second threshold, the corresponding label is sensitive, the reliability level is low, the reliability level is 1 level, the matching degree of the text data B to be processed and the feature set is greater than the second threshold, the corresponding label is sensitive, the reliability level is 2, the matching degree of the text data C to be processed and the third threshold is less than the reliability level, the corresponding label is greater than the reliability level, the reliability level is greater than the third threshold, the sensitivity level is greater than the reliability level, the reliability level is greater than the reliability level, the sensitivity level, the reliability level is greater than the reliability level, the third threshold and the sensitivity.
It should be noted that the above examples are only illustrative and should not be construed as limiting the present application. During actual use, the credibility grades can be divided more finely according to actual needs.
Optionally, in possible implementation forms of this embodiment of the application, data tagging may be performed on the text data to be processed according to whether the text data to be processed matches all features in the feature set, that is, when the text data to be processed matches all features in the feature set, the text data to be processed may be tagged as sensitive, otherwise, the text data to be processed is tagged as insensitive.
, the feature set corresponding to the topic type can be determined according to the sensitive text data corresponding to the topic type, that is, before the step 102, the method can further include:
and performing data processing on the sensitive text data corresponding to the th topic type to determine a feature set corresponding to the th topic type.
The sensitive text data refers to text data whose corresponding label tag has been determined to be sensitive.
It should be noted that, in possible implementation forms of the embodiment of the present application, before performing data annotation on text data to be processed, data processing may be performed on sensitive text data corresponding to each topic type, to respectively determine a sensitive word or a common feature that is common to sensitive text data corresponding to each topic type, and further determine a feature that is common to sensitive text data corresponding to each topic type as a feature set corresponding to each topic type.
And 103, training a recognition model by using a training sample set formed by the text data to be processed and the label labels corresponding to the text data to be processed to obtain the sensitive recognition model.
In the embodiment of the application, after the data labeling is performed on the text data to be processed, the text data to be processed and the label corresponding to the text data to be processed can jointly form a training sample set according to the label corresponding to the text data to be processed, and a sensitive identification model is obtained through training.
, there may be multiple topic types corresponding to the text data to be processed, and feature sets corresponding to different topic types may be different, so that, in order to avoid poor recognition accuracy of the trained sensitive recognition model caused by different topic types, the sensitive recognition models may be trained according to topic types, that is, in this possible implementation form of embodiment of this application, the above step 103 may include:
and training a recognition model by utilizing a training sample set which is formed by the text data to be processed and the label tag corresponding to the text data to be processed and corresponds to the th topic type to obtain a th sensitive recognition model corresponding to the th topic type.
It can be understood that, when a training sample set is formed by using the text data to be processed and the label tags corresponding to the text data to be processed, the text data to be processed may be classified according to the th topic type corresponding to the text data to be processed, and then the text data to be processed corresponding to each th topic type and the label tags corresponding thereto respectively form training sample sets, and then the th sensitive identification models corresponding to each th topic type may be obtained through respective training according to the training sample set corresponding to each th topic type.
It should be noted that in possible implementation forms of this application, when the text data to be processed and the corresponding label tag thereof are used to form a training sample set, the confidence level that the text data to be processed and the label tag thereof belong to the corresponding label tag may also be used to form the training sample set together with the text data to be processed and the label tag thereof, so that the sensitive recognition model obtained by training may recognize not only the label tag corresponding to the text data, but also the confidence level that the text data belongs to the label tag.
, in this embodiment, after the training of the sensitive recognition model, the sensitivity of the target text data can be determined by using the sensitive recognition model, that is, in this embodiment possible implementation forms, after the step 103, the method further includes:
acquiring target text data;
performing theme recognition on the target text data to determine a second theme type corresponding to the target text data;
and identifying the target text data by using a second sensitive identification model corresponding to the second theme type to determine a sensitive label of the target text data.
The target text data refers to the text data of the sensitive label to be determined currently. The second theme type refers to a theme type corresponding to the target text data.
It should be noted that in the embodiment of the present application, after the target text data is obtained, the subject identification may be performed on the target text data, where a method for performing the subject identification on the target text data is the same as a method for performing the subject identification on the text data to be processed, and details are not repeated here.
In possible implementation forms of the embodiment of the present application, a second sensitive recognition model corresponding to a second topic type may be determined according to the second topic type corresponding to the target text data, and then the determined second sensitive recognition model is used to recognize the target text data, so as to determine a sensitive tag of the target text data.
For example, if the second topic type corresponding to the target text data a is determined to be "political type" after the topic identification is performed on the target text data a, the target text data may be identified by using the sensitive identification model corresponding to the "political type".
Further , in possible implementation forms of this embodiment, when the target text data is identified by using the sensitive identification model, not only the sensitive tag of the target text data but also the reliability of the sensitive tag to which the target text data belongs may be determined.
It should be noted that, in the embodiment of the present application, after the sensitive recognition model is obtained by training according to the training sample set, the training sample of the sensitive recognition model may be automatically optimized and perfected according to the recognition result of the sensitive recognition model and the appearance of a new sensitive vocabulary in the use process of the sensitive recognition model, so as to ensure the recognition accuracy and precision of the sensitive recognition model.
According to the text data sensitivity identification method provided by the embodiment of the application, the subject identification can be performed on the text data to be processed, the th subject type corresponding to the text data to be processed is determined, the data annotation is performed on the text data to be processed according to the feature set corresponding to the th subject type, the annotation label corresponding to the text data to be processed is determined, the training sample set formed by the text data to be processed and the annotation label corresponding to the text data to be processed is further utilized, the identification model is trained, and the sensitive identification model is obtained.
In possible implementation forms of the present application, the original text data obtained from the network may include low-quality noise data, which may affect the recognition performance of the finally trained sensitive recognition model.
The text data sensitivity recognition method provided by the embodiment of the present application is further illustrated in step in conjunction with fig. 2.
Fig. 2 is a schematic flowchart of another text data sensitivity identification methods provided in the embodiments of the present application.
As shown in fig. 2, the text data sensitivity recognition method includes the following steps:
step 201, performing data cleaning on the acquired text data to acquire the candidate text data.
The candidate text data refers to high-quality text data obtained after the text data is subjected to data cleaning.
It should be noted that, in possible implementation forms of the present application, to obtain a training sample set for training a sensitive recognition model, amounts of initial text data may be obtained from a network side, and then the initial text data is filtered to obtain text data to be processed, the original text data obtained from the network side may include noise data with low quality, which affects the stability and accuracy of the sensitive recognition model, because the data on the network is not only diverse in content, but also good in quality.
For example, the rules can be words less than 100, subject is not clear, report and the like, the preset data cleaning rules can be multiple, and if the text data is matched with or multiple preset data cleaning rules, the text data can be determined to be low-quality noise data and removed to obtain candidate text data.
In the possible implementation forms of the present application, data cleansing may be performed by using a data cleansing model, where known high-quality text data may be used to form a training sample set, a data cleansing model is obtained by training, and then the text data is input into the data cleansing model, that is, whether the input text data is high-quality text data or not may be determined, so as to obtain candidate text data.
Step 202, matching each sensitive word in a preset sensitive word list with each candidate text data respectively to determine the matching degree between each candidate text data and each sensitive word in the sensitive word list.
Step 203, acquiring text data to be processed from each candidate text data according to the matching degree between each candidate text data and each sensitive word in the sensitive word list.
It should be noted that before performing topic identification on candidate text data, the sensitivity of each candidate text data may be preliminarily determined according to a preset sensitive word list, and candidate texts that obviously do not have sensitivity are removed to obtain text data to be processed.
In possible implementation forms of the embodiment of the present application, a sensitive word may be extracted from known sensitive text data to form a basic sensitive word list, and the basic sensitive word list is expanded by using synonyms, near-synonym expansion, and the like to form a preset sensitive word list.
It should be noted that, when matching each sensitive word in the preset sensitive word list with each candidate text data, each sensitive word in the sensitive word list may be matched with the full text of each candidate text data one by one, so as to improve the matching accuracy.
In the embodiment of the present application, when determining the matching degree between each candidate text data and each sensitive word in the sensitive word list, the candidate text data may be first subjected to word segmentation processing to determine each word included in the candidate text data, and further determine the matching degree between each word in the candidate text data and each sensitive word in the preset sensitive word list.
It should be noted that, in possible implementation forms of this application, a fourth threshold of the matching degree between a participle in the candidate text data and each sensitive word in the preset sensitive word list may be preset, for example, may be 0.6.
, in possible implementation forms of the embodiment of the present application, a threshold of the number of suspected sensitive words in the candidate text data may also be preset, and when the number of suspected sensitive words in the candidate text data exceeds the threshold, the candidate text data may be determined as the text data to be processed.
In addition, a fifth threshold value can be preset according to actual conditions, and when the candidate text data contains the participles with the matching degree with the sensitive words in the preset sensitive word list larger than the fifth threshold value, the number of the suspected sensitive words in the candidate text data is not considered, and the candidate text data can be directly determined as the text data to be processed.
For example, assuming that the fourth threshold of the matching degree between the preset candidate text data and each sensitive word in the preset sensitive word list is 0.6, the fifth threshold is 0.9, and the threshold of the number of suspected sensitive words is 10, if the candidate text data a contains two sensitive words in the preset sensitive word list, the matching degree between the candidate text data a and the two sensitive words may be set to 1, and the candidate text data a may be directly determined as the text to be processed. If the matching degrees between the 20 participles and the sensitive words in the candidate text data B are all larger than 0.6 and smaller than 0.7, it can be determined that the candidate text data B contains 20 suspected sensitive words and is larger than the threshold of the number of the suspected sensitive words, and therefore the candidate text data B can be determined as the text data to be processed.
And 204, determining the type of the text data to be processed according to the publishing site of the text data to be processed.
Step 205, performing topic identification on the text data to be processed by using a topic identification model corresponding to the category of the text data to be processed.
It can be understood that, the source of the text data to be processed is different, and the corresponding text structure, writing manner, etc. may be different, thereby affecting the subject recognition of the text to be processed. For example, the content published by a media website usually has a fixed title structure and a fixed writing format, and the content published by an internet citizen publishing website such as a post bar usually has no regular text structure. Therefore, in the embodiment of the application, the category of the text data to be processed can be determined according to the publishing site of the text data to be processed, and then the topic identification model corresponding to the category of the text data to be processed is used for performing topic identification on the text data to be processed according to the category of the text data to be processed, so that the accuracy of topic identification is improved.
It should be noted that in possible implementation forms of the embodiment of the present application, amounts of text data may be obtained from different publishing sites, and a training sample set is configured by using the text data from the same publishing site to train and generate topic identification models corresponding to the publishing sites, respectively.
According to the text data sensitivity identification method provided by the embodiment of the application, firstly, data cleaning is carried out on the obtained text data to obtain candidate text data, then the text data to be processed is obtained from the candidate text data according to the matching degree between the candidate text data and sensitive words in a sensitive word list, further, the category of the text data to be processed is determined according to the publishing site of the text data to be processed, and the subject identification model corresponding to the category of the text data to be processed is adopted to carry out subject identification on the text data to be processed. Therefore, the text data to be processed is extracted by carrying out data cleaning and sensitive word matching processing on the acquired text data, so that the data volume processed in the topic identification process is reduced, the establishment speed of a sensitive identification model is improved, and the topic identification model corresponding to the publishing site is adopted to carry out topic identification on the text data, so that the accuracy of the finally determined topic type to which the text data to be processed belongs is higher.
In order to realize the above embodiment, the present application also proposes text data sensitivity recognition apparatuses.
Fig. 3 is a schematic structural diagram of text data sensitivity recognition apparatuses according to an embodiment of the present application.
As shown in fig. 3, the text data sensitivity recognition apparatus 30 includes:
the th determining module 31 is configured to perform topic identification on the text data to be processed to determine a th topic type corresponding to the text data to be processed.
The second determining module 32 is configured to perform data annotation on the text data to be processed according to the feature set corresponding to the th topic type, so as to determine an annotation tag corresponding to the text data to be processed.
And the training module 33 is configured to train a recognition model by using a training sample set formed by the text data to be processed and the label labels corresponding to the text data to be processed, so as to obtain a sensitive recognition model.
In practical use, the text data sensitivity recognition device provided by the embodiment of the application can be configured in any electronic equipment to execute the text data sensitivity recognition method.
The text data sensitivity recognition device provided by the embodiment of the application can perform theme recognition on text data to be processed, determine th theme type corresponding to the text data to be processed, perform data tagging on the text data to be processed according to the feature set corresponding to th theme type, determine a tagging label corresponding to the text data to be processed, and further perform training on a recognition model by using a training sample set formed by the text data to be processed and the tagging label corresponding to the text data to be processed, so as to obtain a sensitive recognition model.
In possible implementation forms of the present application, the text data sensitivity recognition apparatus 30 is specifically configured to:
matching each sensitive word in a preset sensitive word list with each candidate text data respectively to determine the matching degree between each candidate text data and each sensitive word in the sensitive word list;
and acquiring text data to be processed from the candidate text data according to the matching degree between the candidate text data and the sensitive words in the sensitive word list.
, in another possible implementations of the present application, the text data sensitivity recognition device 30 is further configured to:
and performing data cleaning on the acquired text data to acquire the candidate text data.
, in another possible implementations of the present application, the text data sensitivity recognition device 30 is further configured to:
and determining the category of the text data to be processed according to the publishing site of the text data to be processed.
, in another possible implementation forms of the present application, the determining module 31 is specifically configured to:
and adopting a theme recognition model corresponding to the category of the text data to be processed to perform theme recognition on the text data to be processed.
, in another possible implementations of the present application, the text data sensitivity recognition device 30 is further configured to:
and performing data processing on the sensitive text data corresponding to the th topic type to determine a feature set corresponding to the th topic type.
, in another possible implementation forms of the present application, the training module 33 is specifically configured to:
and training a recognition model by utilizing a training sample set which is formed by the text data to be processed and the label tag corresponding to the text data to be processed and corresponds to the th topic type to obtain a th sensitive recognition model corresponding to the th topic type.
, in another possible implementation forms of the present application, the text data sensitivity recognition device 30 is further configured to:
acquiring target text data;
performing theme recognition on the target text data to determine a second theme type corresponding to the target text data;
and identifying the target text data by using a second sensitive identification model corresponding to the second theme type to determine a sensitive label of the target text data.
It should be noted that the foregoing explanation of the embodiment of the text data sensitivity recognition method shown in fig. 1 and fig. 2 is also applicable to the text data sensitivity recognition apparatus 30 of this embodiment, and is not repeated here.
The text data sensitivity recognition device provided by the embodiment of the application can perform theme recognition on text data to be processed, determine th theme type corresponding to the text data to be processed, perform data tagging on the text data to be processed according to the feature set corresponding to th theme type, determine a tagging label corresponding to the text data to be processed, and further perform training on a recognition model by using a training sample set formed by the text data to be processed and the tagging label corresponding to the text data to be processed, so as to obtain a sensitive recognition model.
In order to implement the above embodiments, the present application also proposes electronic devices.
Fig. 4 is a schematic structural diagram of an electronic device according to embodiments of the present invention.
As shown in fig. 4, the electronic device 400 includes:
a memory 410 and a processor 420, and a bus 430 connecting the different components (including the memory 410 and the processor 420), wherein the memory 410 stores a computer program, and when the processor 420 executes the program, the text data sensitivity recognition method according to the embodiment of the present application is implemented.
Bus 430 represents or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures, including, but not limited to, an Industry Standard Architecture (ISA) bus, a micro-channel architecture (MAC) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus, to name a few.
Electronic device 400 typically includes a variety of electronic device readable media. Such media may be any available media that is accessible by electronic device 400 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 410 may also include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)440 and/or cache memory 450. electronic device 400 may further include other removable/non-removable, volatile/non-volatile computer system storage media storage system 460 may be used to read from and write to non-removable, non-volatile magnetic media (not shown in fig. 4, commonly referred to as a "hard disk drive"). although not shown in fig. 4, a magnetic disk drive may be provided for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive may be provided for reading from and writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media).
Program/utility 480 having sets (at least ) of program modules 470 may be stored, for example, in memory 410, such program modules 470 include, but are not limited to, an operating system, or more application programs, other program modules, and program data, each of these examples or some combination of these may include an implementation of a network environment.
Electronic device 400 may also communicate with or more external devices 490 (e.g., keyboard, pointing device, display 491, etc.), and may also communicate with or more devices that enable a user to interact with electronic device 400, and/or with any device (e.g., network card, modem, etc.) that enables electronic device 400 to communicate with or more other computing devices.
The processor 420 executes various functional applications and data processing by executing programs stored in the memory 410.
It should be noted that, for the implementation process and the technical principle of the electronic device of this embodiment, reference is made to the foregoing explanation of the text data sensitivity recognition method according to the embodiment of the present application, and details are not described here again.
The electronic device provided by the embodiment of the application can execute the text data sensitivity identification method as described above, perform topic identification on text data to be processed, determine th topic type corresponding to the text data to be processed, perform data tagging on the text data to be processed according to the feature set corresponding to st topic type to determine a tagging label corresponding to the text data to be processed, and further perform data tagging on the text data to be processed according to the feature set corresponding to the topic type by using a training sample set composed of the text data to be processed and the tagging label corresponding to the text data to be processed, and train the identification model to obtain the sensitive identification model.
To implement the above embodiments, the present application also proposes computer-readable storage media.
The computer readable storage medium stores thereon a computer program, and the computer program is executed by a processor to implement the text data sensitivity recognition method according to the embodiment of the present application.
In order to implement the above embodiments, the embodiment of the present application's re aspect provides computer programs, which when executed by a processor, implement the text data sensitivity recognition method according to the embodiments of the present application.
in alternative implementations, the present embodiments may take the form of any combination of or more computer-readable media such as, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave .
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or a combination thereof, as well as conventional procedural programming languages, such as the "C" language or similar programming languages.
This application is intended to cover any variations, uses, or adaptations of the application following the -generic principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains and as may be applied to the essential features hereinbefore set forth, the description and examples are to be regarded as illustrative only, and the true scope and spirit of the application is indicated by the claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1, A text data sensitivity recognition method, comprising:
performing theme recognition on the text data to be processed to determine th theme type corresponding to the text data to be processed;
according to the feature set corresponding to the th theme type, performing data annotation on the text data to be processed to determine an annotation tag corresponding to the text data to be processed;
and training a recognition model by using a training sample set formed by the text data to be processed and the label labels corresponding to the text data to be processed to obtain the sensitive recognition model.
2. The method of claim 1, wherein prior to performing topic identification on the text data to be processed, further comprising:
matching each sensitive word in a preset sensitive word list with each candidate text data respectively to determine the matching degree between each candidate text data and each sensitive word in the sensitive word list;
and acquiring text data to be processed from the candidate text data according to the matching degree between the candidate text data and the sensitive words in the sensitive word list.
3. The method as claimed in claim 2, wherein before the matching process of each sensitive word in the preset sensitive word list with each candidate text data, the method further comprises:
and performing data cleaning on the acquired text data to acquire the candidate text data.
4. The method of claim 1, wherein prior to performing topic identification on the text data to be processed, further comprising:
determining the type of the text data to be processed according to the publishing site of the text data to be processed;
the theme recognition is carried out on the text data to be processed, and the method comprises the following steps:
and adopting a theme recognition model corresponding to the category of the text data to be processed to perform theme recognition on the text data to be processed.
5. The method of any of claims 1-4, wherein prior to data annotation of the text data to be processed according to the feature set corresponding to the topic type, further comprising:
and performing data processing on the sensitive text data corresponding to the th topic type to determine a feature set corresponding to the th topic type.
6. The method according to any one of claims 1-4 and , wherein the training the recognition model by using the training sample set composed of the text data to be processed and the label labels corresponding to the text data to be processed to obtain the sensitive recognition model comprises:
and training a recognition model by utilizing a training sample set which is formed by the text data to be processed and the label tag corresponding to the text data to be processed and corresponds to the th topic type to obtain a th sensitive recognition model corresponding to the th topic type.
7. The method of claim 6, after said obtaining a sensitive recognition model corresponding to the topic type, further comprising:
acquiring target text data;
performing theme recognition on the target text data to determine a second theme type corresponding to the target text data;
and identifying the target text data by using a second sensitive identification model corresponding to the second theme type to determine a sensitive label of the target text data.
8, apparatus for recognizing sensitivity to text data, comprising:
the determining module is used for performing topic identification on the text data to be processed to determine a topic type corresponding to the text data to be processed;
a second determining module, configured to perform data annotation on the to-be-processed text data according to the feature set corresponding to the th topic type, so as to determine an annotation tag corresponding to the to-be-processed text data;
and the training module is used for training the recognition model by utilizing a training sample set formed by the text data to be processed and the label labels corresponding to the text data to be processed to obtain the sensitive recognition model.
An electronic device , comprising a memory, a processor, and a program stored on the memory and executable on the processor, wherein the processor implements the text data sensitivity recognition method according to any of claims 1-7 as when executing the program.
10, computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method for sensitive recognition of textual data according to any of claims 1-7, .
CN201810719136.6A 2018-07-03 2018-07-03 Text data sensitivity identification method and device, electronic equipment and storage medium Active CN110737770B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810719136.6A CN110737770B (en) 2018-07-03 2018-07-03 Text data sensitivity identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810719136.6A CN110737770B (en) 2018-07-03 2018-07-03 Text data sensitivity identification method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110737770A true CN110737770A (en) 2020-01-31
CN110737770B CN110737770B (en) 2023-01-20

Family

ID=69234229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810719136.6A Active CN110737770B (en) 2018-07-03 2018-07-03 Text data sensitivity identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110737770B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434331A (en) * 2020-11-20 2021-03-02 百度在线网络技术(北京)有限公司 Data desensitization method, device, equipment and storage medium
CN113128220A (en) * 2021-04-30 2021-07-16 北京奇艺世纪科技有限公司 Text distinguishing method and device, electronic equipment and storage medium
CN115544240A (en) * 2022-11-24 2022-12-30 闪捷信息科技有限公司 Text sensitive information identification method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184188A (en) * 2011-04-15 2011-09-14 百度在线网络技术(北京)有限公司 Method and equipment for determining sensitivity of target text
CN104505090A (en) * 2014-12-15 2015-04-08 北京国双科技有限公司 Method and device for voice recognizing sensitive words
CN106528655A (en) * 2016-10-18 2017-03-22 百度在线网络技术(北京)有限公司 Text subject recognition method and device
CN107045524A (en) * 2016-12-30 2017-08-15 中央民族大学 A kind of method and system of network text public sentiment classification
CN107437416A (en) * 2017-05-23 2017-12-05 阿里巴巴集团控股有限公司 A kind of consultation service processing method and processing device based on speech recognition
CN107818077A (en) * 2016-09-13 2018-03-20 北京金山云网络技术有限公司 A kind of sensitive content recognition methods and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184188A (en) * 2011-04-15 2011-09-14 百度在线网络技术(北京)有限公司 Method and equipment for determining sensitivity of target text
CN104505090A (en) * 2014-12-15 2015-04-08 北京国双科技有限公司 Method and device for voice recognizing sensitive words
CN107818077A (en) * 2016-09-13 2018-03-20 北京金山云网络技术有限公司 A kind of sensitive content recognition methods and device
CN106528655A (en) * 2016-10-18 2017-03-22 百度在线网络技术(北京)有限公司 Text subject recognition method and device
CN107045524A (en) * 2016-12-30 2017-08-15 中央民族大学 A kind of method and system of network text public sentiment classification
CN107437416A (en) * 2017-05-23 2017-12-05 阿里巴巴集团控股有限公司 A kind of consultation service processing method and processing device based on speech recognition

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434331A (en) * 2020-11-20 2021-03-02 百度在线网络技术(北京)有限公司 Data desensitization method, device, equipment and storage medium
CN112434331B (en) * 2020-11-20 2023-08-18 百度在线网络技术(北京)有限公司 Data desensitization method, device, equipment and storage medium
CN113128220A (en) * 2021-04-30 2021-07-16 北京奇艺世纪科技有限公司 Text distinguishing method and device, electronic equipment and storage medium
CN113128220B (en) * 2021-04-30 2023-07-18 北京奇艺世纪科技有限公司 Text discrimination method, text discrimination device, electronic equipment and storage medium
CN115544240A (en) * 2022-11-24 2022-12-30 闪捷信息科技有限公司 Text sensitive information identification method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110737770B (en) 2023-01-20

Similar Documents

Publication Publication Date Title
CN109460551B (en) Signature information extraction method and device
CN107423278B (en) Evaluation element identification method, device and system
US20090319449A1 (en) Providing context for web articles
CN110602045B (en) Malicious webpage identification method based on feature fusion and machine learning
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN108550054B (en) Content quality evaluation method, device, equipment and medium
CN103299324A (en) Learning tags for video annotation using latent subtags
CN108959474B (en) Entity relation extraction method
CN112347244A (en) Method for detecting website involved in yellow and gambling based on mixed feature analysis
CN108829661B (en) News subject name extraction method based on fuzzy matching
CN110427487B (en) Data labeling method and device and storage medium
CN110737770A (en) Text data sensitivity identification method and device, electronic equipment and storage medium
CN107273465A (en) SQL injection detection method
CN115238688B (en) Method, device, equipment and storage medium for analyzing association relation of electronic information data
CN110738033B (en) Report template generation method, device and storage medium
CN107844531B (en) Answer output method and device and computer equipment
CN111475651A (en) Text classification method, computing device and computer storage medium
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN112257444B (en) Financial information negative entity discovery method, device, electronic equipment and storage medium
CN112330501A (en) Document processing method and device, electronic equipment and storage medium
CN111597423A (en) Performance evaluation method and device of interpretable method of text classification model
CN114842982A (en) Knowledge expression method, device and system for medical information system
CN114021064A (en) Website classification method, device, equipment and storage medium
CN113836297A (en) Training method and device for text emotion analysis model
CN113177121A (en) Text topic classification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant