CN110737770A - Text data sensitivity identification method and device, electronic equipment and storage medium - Google Patents
Text data sensitivity identification method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN110737770A CN110737770A CN201810719136.6A CN201810719136A CN110737770A CN 110737770 A CN110737770 A CN 110737770A CN 201810719136 A CN201810719136 A CN 201810719136A CN 110737770 A CN110737770 A CN 110737770A
- Authority
- CN
- China
- Prior art keywords
- text data
- processed
- sensitive
- data
- recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The application provides text data sensitivity identification methods, devices, electronic equipment and storage media, and belongs to the technical field of the Internet, wherein the method comprises the steps of performing theme identification on text data to be processed to determine a theme type corresponding to the text data to be processed, performing data annotation on the text data to be processed according to a feature set corresponding to a theme type to determine an annotation label corresponding to the text data to be processed, and training an identification model by using a training sample set formed by the text data to be processed and the annotation label corresponding to the text data to be processed to obtain a sensitivity identification model.
Description
Technical Field
The application relates to the technical field of internet, in particular to a method and a device for identifying sensitivity of text data, electronic equipment and a storage medium.
Background
Since the 21 st century, the development of science and technology is changing day by day, and in the information society of high-speed development of the internet industry, the internet has become an important way for people to acquire knowledge and information. For example, people can look up information and browse news through the internet, and government agencies and official media can obtain clue information and know civil appeal through the information on the internet. However, in the face of massive internet information, valuable clues with social significance are obtained without being different from the large sea fishing needles. Meanwhile, the expansion of the application range of the internet also causes the information on the internet to be poor, and poor information which is not beneficial to the growth of minors or harms the negative effects of social stability and the like is generated.
Therefore, the method and the device have very important practical significance for screening information on the Internet according to requirements, identifying sensitive data, improving the efficiency of acquiring information by using the Internet, or avoiding negative effects of bad information on the society. In the existing text data sensitivity identification technology, the sensitivity of a target text is determined mainly by a manual mode or by manually establishing a sensitive word list and then simply matching and querying the target text by a machine based on the sensitive word list. However, the method of manually establishing the sensitive word list lacks the combination of the sensitive words and the context, which easily causes inaccurate recognition result, low efficiency and waste of human resources.
Disclosure of Invention
The text data sensitivity identification method, the text data sensitivity identification device, the electronic equipment and the storage medium are used for solving the problems that in the related art, the accuracy is low, the efficiency is low and human resources are wasted in the existing method for identifying the sensitivity of text data by manually establishing a sensitive word list.
The text data sensitivity identification method provided by the embodiment of the aspect of includes the steps of performing theme identification on text data to be processed to determine a theme type corresponding to the text data to be processed, performing data annotation on the text data to be processed according to a feature set corresponding to the theme type to determine an annotation label corresponding to the text data to be processed, and training an identification model by using a training sample set formed by the text data to be processed and the annotation label corresponding to the text data to be processed to obtain a sensitivity identification model.
The text data sensitivity recognition device provided by the embodiment of the invention in another aspect comprises a determining module for performing topic recognition on text data to be processed to determine a topic type corresponding to the text data to be processed, a second determining module for performing data labeling on the text data to be processed according to a feature set corresponding to the topic type to determine a label corresponding to the text data to be processed, and a training module for training a recognition model by using a training sample set formed by the text data to be processed and the label corresponding to the text data to be processed to obtain a sensitive recognition model.
The electronic device according to the embodiment of the present application's aspect is characterized in that the processor implements the text data sensitivity recognition method as described above when executing the program.
The embodiment of the further aspect of the present application provides a computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the text data sensitivity recognition method as described above.
The embodiment of the aspect of the present application provides a computer program, which is executed by a processor to implement the text data sensitivity recognition method according to the embodiment of the present application.
According to the text data sensitivity identification method, the text data sensitivity identification device, the electronic equipment, the computer readable storage medium and the computer program, the subject identification can be performed on the text data to be processed, the th subject type corresponding to the text data to be processed is determined, the data annotation is performed on the text data to be processed according to the feature set corresponding to the th subject type, the annotation tag corresponding to the text data to be processed is determined, the training sample set formed by the text data to be processed and the annotation tag corresponding to the text data to be processed is further utilized, the identification model is trained, and the sensitivity identification model is obtained.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of text data sensitivity recognition methods provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of another text data sensitivity recognition methods provided in the embodiments of the present application;
fig. 3 is a schematic structural diagram of text data sensitivity recognition apparatuses provided in an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the like or similar elements throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
The embodiment of the application provides text data sensitivity identification methods aiming at the problems of low accuracy, low efficiency and waste of human resources of the existing method for identifying the sensitivity of text data by manually establishing a sensitive word list.
According to the text data sensitivity identification method provided by the embodiment of the application, the subject identification can be performed on the text data to be processed, the th subject type corresponding to the text data to be processed is determined, the data annotation is performed on the text data to be processed according to the feature set corresponding to the th subject type, the annotation label corresponding to the text data to be processed is determined, the training sample set formed by the text data to be processed and the annotation label corresponding to the text data to be processed is further utilized, the identification model is trained, and the sensitive identification model is obtained.
The text data sensitivity recognition method, apparatus, electronic device, storage medium, and computer program provided in the present application are described in detail below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of text data sensitivity recognition methods according to embodiments of the present application.
As shown in fig. 1, the text data sensitivity recognition method includes the following steps:
For example, the theme type can be a political type, an economic type, a social type and the like.
It is to be understood that the same terms may have different meanings in different contexts, and thus the same terms may have different sensitivities in different contexts. In order to avoid inaccurate labeling of the text data to be processed due to different sensitivities of the same word in different contexts, the text data to be processed may be first classified according to the whole context. The whole context of the text data to be processed may be related to the topic type of the text data to be processed, and therefore, in the embodiment of the present application, topic identification may be performed on the text data to be processed to preliminarily determine the whole context of the text data to be processed.
It should be noted that in possible implementation manners of the embodiment of the present application, a topic recognition model may be used to perform topic recognition on text data to be processed, and when training the topic recognition model, a training sample set may be formed by using text data with determined topic types, and the topic recognition model is generated through training.
And 102, performing data annotation on the text data to be processed according to the feature set corresponding to the th topic type to determine an annotation tag corresponding to the text data to be processed.
The feature set includes lexical features, semantic features, syntactic features and the like used for data tagging of the text data to be processed.
For example, theme type is "social class", the corresponding feature set may include feature A, feature B and feature C, and theme type is "economic class", the corresponding feature set may include feature B, feature E and feature D.
The feature sets corresponding to different theme types can be determined according to the policy files associated with the theme types. For example, the feature set corresponding to the "social class" title may be determined based on government agencies, official media, and business logic for discriminating sensitive content. For example, after analyzing the business logic of the sensitive content judged by government agencies and official media, the lexical features and semantic features corresponding to the social title are determined to include: negative content, specific events not completed, and the impact of an event may involve others.
It should be noted that the feature set corresponding to each th topic type may have a plurality of features, and when the text data to be processed matches or more of the feature sets corresponding to the th topic type thereof, or when the matching degree of the text data to be processed and the feature set is greater than a preset th threshold, the text data to be processed may be marked as sensitive, that is, the marking tag corresponding to the text data to be processed is sensitive.
In practical use, the preset th threshold may be determined according to practical situations, which is not limited in the embodiment of the present application, for example, the th threshold may be 60%.
For example, the th topic type corresponding to the text data to be processed is "social class", the feature set corresponding to the "social class" has 6 items, the preset threshold is 60%, and if the text data to be processed matches 4 items in the feature set, the matching degree of the text data to be processed with the feature set is 67%, which is greater than the preset threshold, so that the label tag of the text data to be processed is sensitive.
, in possible implementation forms of the embodiment of the present application, after determining a label tag corresponding to the text data to be processed, the reliability of the label tag to which the text to be processed belongs may also be determined according to the matching degree of the text data to be processed and the feature set corresponding to the topic type of the text data to be processed.
Specifically, different confidence level thresholds can be preset, for example, a second threshold and a third threshold are preset, where the second threshold is greater than the th threshold, and the third threshold is smaller than the th threshold.
In possible implementation forms of this application embodiment, the reliability of the annotation tag corresponding to the text data to be processed may be determined according to the following rule, that is, if the matching degree between the text data to be processed and the feature set corresponding to the th topic type is greater than the th threshold and less than the second threshold, the annotation tag corresponding to the text data to be processed is sensitive and has a lower reliability, and may be set to level 1, if the matching degree between the text data to be processed and the feature set corresponding to the th topic type is greater than the second threshold, the annotation tag corresponding to the text data to be processed is sensitive and has a higher reliability, and may be set to level 2, if the matching degree between the text data to be processed and the feature set corresponding to the th topic type is less than the th threshold and greater than the third threshold, the annotation tag corresponding to the text data to be processed is insensitive and has a lower reliability, and has a higher reliability, and if the matching degree between the text data to be processed and the th topic type is less than the third threshold, the annotation tag corresponding to be sensitive and has a higher reliability.
For example, the th subject types corresponding to the text data to be processed A, B, C, D are "social classes", the feature set corresponding to the social classes has 6 features, the preset th threshold is 60%, the second threshold is 80%, the third threshold is 20%, the text data a to be processed is matched with 4 features in the feature set, the text data B to be processed is matched with 6 features in the feature set, the text data C to be processed is matched with 3 features in the feature set, the text data D to be processed is not matched with any features in the feature set, the matching degrees of the text data A, B, C, D to be processed and the feature set are 67%, 100%, 50% and 0%, the matching degree of the text data a to be processed and the feature set is greater than the th threshold and less than the second threshold, the corresponding label is sensitive, the reliability level is low, the reliability level is 1 level, the matching degree of the text data B to be processed and the feature set is greater than the second threshold, the corresponding label is sensitive, the reliability level is 2, the matching degree of the text data C to be processed and the third threshold is less than the reliability level, the corresponding label is greater than the reliability level, the reliability level is greater than the third threshold, the sensitivity level is greater than the reliability level, the reliability level is greater than the reliability level, the sensitivity level, the reliability level is greater than the reliability level, the third threshold and the sensitivity.
It should be noted that the above examples are only illustrative and should not be construed as limiting the present application. During actual use, the credibility grades can be divided more finely according to actual needs.
Optionally, in possible implementation forms of this embodiment of the application, data tagging may be performed on the text data to be processed according to whether the text data to be processed matches all features in the feature set, that is, when the text data to be processed matches all features in the feature set, the text data to be processed may be tagged as sensitive, otherwise, the text data to be processed is tagged as insensitive.
, the feature set corresponding to the topic type can be determined according to the sensitive text data corresponding to the topic type, that is, before the step 102, the method can further include:
and performing data processing on the sensitive text data corresponding to the th topic type to determine a feature set corresponding to the th topic type.
The sensitive text data refers to text data whose corresponding label tag has been determined to be sensitive.
It should be noted that, in possible implementation forms of the embodiment of the present application, before performing data annotation on text data to be processed, data processing may be performed on sensitive text data corresponding to each topic type, to respectively determine a sensitive word or a common feature that is common to sensitive text data corresponding to each topic type, and further determine a feature that is common to sensitive text data corresponding to each topic type as a feature set corresponding to each topic type.
And 103, training a recognition model by using a training sample set formed by the text data to be processed and the label labels corresponding to the text data to be processed to obtain the sensitive recognition model.
In the embodiment of the application, after the data labeling is performed on the text data to be processed, the text data to be processed and the label corresponding to the text data to be processed can jointly form a training sample set according to the label corresponding to the text data to be processed, and a sensitive identification model is obtained through training.
, there may be multiple topic types corresponding to the text data to be processed, and feature sets corresponding to different topic types may be different, so that, in order to avoid poor recognition accuracy of the trained sensitive recognition model caused by different topic types, the sensitive recognition models may be trained according to topic types, that is, in this possible implementation form of embodiment of this application, the above step 103 may include:
and training a recognition model by utilizing a training sample set which is formed by the text data to be processed and the label tag corresponding to the text data to be processed and corresponds to the th topic type to obtain a th sensitive recognition model corresponding to the th topic type.
It can be understood that, when a training sample set is formed by using the text data to be processed and the label tags corresponding to the text data to be processed, the text data to be processed may be classified according to the th topic type corresponding to the text data to be processed, and then the text data to be processed corresponding to each th topic type and the label tags corresponding thereto respectively form training sample sets, and then the th sensitive identification models corresponding to each th topic type may be obtained through respective training according to the training sample set corresponding to each th topic type.
It should be noted that in possible implementation forms of this application, when the text data to be processed and the corresponding label tag thereof are used to form a training sample set, the confidence level that the text data to be processed and the label tag thereof belong to the corresponding label tag may also be used to form the training sample set together with the text data to be processed and the label tag thereof, so that the sensitive recognition model obtained by training may recognize not only the label tag corresponding to the text data, but also the confidence level that the text data belongs to the label tag.
, in this embodiment, after the training of the sensitive recognition model, the sensitivity of the target text data can be determined by using the sensitive recognition model, that is, in this embodiment possible implementation forms, after the step 103, the method further includes:
acquiring target text data;
performing theme recognition on the target text data to determine a second theme type corresponding to the target text data;
and identifying the target text data by using a second sensitive identification model corresponding to the second theme type to determine a sensitive label of the target text data.
The target text data refers to the text data of the sensitive label to be determined currently. The second theme type refers to a theme type corresponding to the target text data.
It should be noted that in the embodiment of the present application, after the target text data is obtained, the subject identification may be performed on the target text data, where a method for performing the subject identification on the target text data is the same as a method for performing the subject identification on the text data to be processed, and details are not repeated here.
In possible implementation forms of the embodiment of the present application, a second sensitive recognition model corresponding to a second topic type may be determined according to the second topic type corresponding to the target text data, and then the determined second sensitive recognition model is used to recognize the target text data, so as to determine a sensitive tag of the target text data.
For example, if the second topic type corresponding to the target text data a is determined to be "political type" after the topic identification is performed on the target text data a, the target text data may be identified by using the sensitive identification model corresponding to the "political type".
Further , in possible implementation forms of this embodiment, when the target text data is identified by using the sensitive identification model, not only the sensitive tag of the target text data but also the reliability of the sensitive tag to which the target text data belongs may be determined.
It should be noted that, in the embodiment of the present application, after the sensitive recognition model is obtained by training according to the training sample set, the training sample of the sensitive recognition model may be automatically optimized and perfected according to the recognition result of the sensitive recognition model and the appearance of a new sensitive vocabulary in the use process of the sensitive recognition model, so as to ensure the recognition accuracy and precision of the sensitive recognition model.
According to the text data sensitivity identification method provided by the embodiment of the application, the subject identification can be performed on the text data to be processed, the th subject type corresponding to the text data to be processed is determined, the data annotation is performed on the text data to be processed according to the feature set corresponding to the th subject type, the annotation label corresponding to the text data to be processed is determined, the training sample set formed by the text data to be processed and the annotation label corresponding to the text data to be processed is further utilized, the identification model is trained, and the sensitive identification model is obtained.
In possible implementation forms of the present application, the original text data obtained from the network may include low-quality noise data, which may affect the recognition performance of the finally trained sensitive recognition model.
The text data sensitivity recognition method provided by the embodiment of the present application is further illustrated in step in conjunction with fig. 2.
Fig. 2 is a schematic flowchart of another text data sensitivity identification methods provided in the embodiments of the present application.
As shown in fig. 2, the text data sensitivity recognition method includes the following steps:
The candidate text data refers to high-quality text data obtained after the text data is subjected to data cleaning.
It should be noted that, in possible implementation forms of the present application, to obtain a training sample set for training a sensitive recognition model, amounts of initial text data may be obtained from a network side, and then the initial text data is filtered to obtain text data to be processed, the original text data obtained from the network side may include noise data with low quality, which affects the stability and accuracy of the sensitive recognition model, because the data on the network is not only diverse in content, but also good in quality.
For example, the rules can be words less than 100, subject is not clear, report and the like, the preset data cleaning rules can be multiple, and if the text data is matched with or multiple preset data cleaning rules, the text data can be determined to be low-quality noise data and removed to obtain candidate text data.
In the possible implementation forms of the present application, data cleansing may be performed by using a data cleansing model, where known high-quality text data may be used to form a training sample set, a data cleansing model is obtained by training, and then the text data is input into the data cleansing model, that is, whether the input text data is high-quality text data or not may be determined, so as to obtain candidate text data.
It should be noted that before performing topic identification on candidate text data, the sensitivity of each candidate text data may be preliminarily determined according to a preset sensitive word list, and candidate texts that obviously do not have sensitivity are removed to obtain text data to be processed.
In possible implementation forms of the embodiment of the present application, a sensitive word may be extracted from known sensitive text data to form a basic sensitive word list, and the basic sensitive word list is expanded by using synonyms, near-synonym expansion, and the like to form a preset sensitive word list.
It should be noted that, when matching each sensitive word in the preset sensitive word list with each candidate text data, each sensitive word in the sensitive word list may be matched with the full text of each candidate text data one by one, so as to improve the matching accuracy.
In the embodiment of the present application, when determining the matching degree between each candidate text data and each sensitive word in the sensitive word list, the candidate text data may be first subjected to word segmentation processing to determine each word included in the candidate text data, and further determine the matching degree between each word in the candidate text data and each sensitive word in the preset sensitive word list.
It should be noted that, in possible implementation forms of this application, a fourth threshold of the matching degree between a participle in the candidate text data and each sensitive word in the preset sensitive word list may be preset, for example, may be 0.6.
, in possible implementation forms of the embodiment of the present application, a threshold of the number of suspected sensitive words in the candidate text data may also be preset, and when the number of suspected sensitive words in the candidate text data exceeds the threshold, the candidate text data may be determined as the text data to be processed.
In addition, a fifth threshold value can be preset according to actual conditions, and when the candidate text data contains the participles with the matching degree with the sensitive words in the preset sensitive word list larger than the fifth threshold value, the number of the suspected sensitive words in the candidate text data is not considered, and the candidate text data can be directly determined as the text data to be processed.
For example, assuming that the fourth threshold of the matching degree between the preset candidate text data and each sensitive word in the preset sensitive word list is 0.6, the fifth threshold is 0.9, and the threshold of the number of suspected sensitive words is 10, if the candidate text data a contains two sensitive words in the preset sensitive word list, the matching degree between the candidate text data a and the two sensitive words may be set to 1, and the candidate text data a may be directly determined as the text to be processed. If the matching degrees between the 20 participles and the sensitive words in the candidate text data B are all larger than 0.6 and smaller than 0.7, it can be determined that the candidate text data B contains 20 suspected sensitive words and is larger than the threshold of the number of the suspected sensitive words, and therefore the candidate text data B can be determined as the text data to be processed.
And 204, determining the type of the text data to be processed according to the publishing site of the text data to be processed.
It can be understood that, the source of the text data to be processed is different, and the corresponding text structure, writing manner, etc. may be different, thereby affecting the subject recognition of the text to be processed. For example, the content published by a media website usually has a fixed title structure and a fixed writing format, and the content published by an internet citizen publishing website such as a post bar usually has no regular text structure. Therefore, in the embodiment of the application, the category of the text data to be processed can be determined according to the publishing site of the text data to be processed, and then the topic identification model corresponding to the category of the text data to be processed is used for performing topic identification on the text data to be processed according to the category of the text data to be processed, so that the accuracy of topic identification is improved.
It should be noted that in possible implementation forms of the embodiment of the present application, amounts of text data may be obtained from different publishing sites, and a training sample set is configured by using the text data from the same publishing site to train and generate topic identification models corresponding to the publishing sites, respectively.
According to the text data sensitivity identification method provided by the embodiment of the application, firstly, data cleaning is carried out on the obtained text data to obtain candidate text data, then the text data to be processed is obtained from the candidate text data according to the matching degree between the candidate text data and sensitive words in a sensitive word list, further, the category of the text data to be processed is determined according to the publishing site of the text data to be processed, and the subject identification model corresponding to the category of the text data to be processed is adopted to carry out subject identification on the text data to be processed. Therefore, the text data to be processed is extracted by carrying out data cleaning and sensitive word matching processing on the acquired text data, so that the data volume processed in the topic identification process is reduced, the establishment speed of a sensitive identification model is improved, and the topic identification model corresponding to the publishing site is adopted to carry out topic identification on the text data, so that the accuracy of the finally determined topic type to which the text data to be processed belongs is higher.
In order to realize the above embodiment, the present application also proposes text data sensitivity recognition apparatuses.
Fig. 3 is a schematic structural diagram of text data sensitivity recognition apparatuses according to an embodiment of the present application.
As shown in fig. 3, the text data sensitivity recognition apparatus 30 includes:
the th determining module 31 is configured to perform topic identification on the text data to be processed to determine a th topic type corresponding to the text data to be processed.
The second determining module 32 is configured to perform data annotation on the text data to be processed according to the feature set corresponding to the th topic type, so as to determine an annotation tag corresponding to the text data to be processed.
And the training module 33 is configured to train a recognition model by using a training sample set formed by the text data to be processed and the label labels corresponding to the text data to be processed, so as to obtain a sensitive recognition model.
In practical use, the text data sensitivity recognition device provided by the embodiment of the application can be configured in any electronic equipment to execute the text data sensitivity recognition method.
The text data sensitivity recognition device provided by the embodiment of the application can perform theme recognition on text data to be processed, determine th theme type corresponding to the text data to be processed, perform data tagging on the text data to be processed according to the feature set corresponding to th theme type, determine a tagging label corresponding to the text data to be processed, and further perform training on a recognition model by using a training sample set formed by the text data to be processed and the tagging label corresponding to the text data to be processed, so as to obtain a sensitive recognition model.
In possible implementation forms of the present application, the text data sensitivity recognition apparatus 30 is specifically configured to:
matching each sensitive word in a preset sensitive word list with each candidate text data respectively to determine the matching degree between each candidate text data and each sensitive word in the sensitive word list;
and acquiring text data to be processed from the candidate text data according to the matching degree between the candidate text data and the sensitive words in the sensitive word list.
, in another possible implementations of the present application, the text data sensitivity recognition device 30 is further configured to:
and performing data cleaning on the acquired text data to acquire the candidate text data.
, in another possible implementations of the present application, the text data sensitivity recognition device 30 is further configured to:
and determining the category of the text data to be processed according to the publishing site of the text data to be processed.
, in another possible implementation forms of the present application, the determining module 31 is specifically configured to:
and adopting a theme recognition model corresponding to the category of the text data to be processed to perform theme recognition on the text data to be processed.
, in another possible implementations of the present application, the text data sensitivity recognition device 30 is further configured to:
and performing data processing on the sensitive text data corresponding to the th topic type to determine a feature set corresponding to the th topic type.
, in another possible implementation forms of the present application, the training module 33 is specifically configured to:
and training a recognition model by utilizing a training sample set which is formed by the text data to be processed and the label tag corresponding to the text data to be processed and corresponds to the th topic type to obtain a th sensitive recognition model corresponding to the th topic type.
, in another possible implementation forms of the present application, the text data sensitivity recognition device 30 is further configured to:
acquiring target text data;
performing theme recognition on the target text data to determine a second theme type corresponding to the target text data;
and identifying the target text data by using a second sensitive identification model corresponding to the second theme type to determine a sensitive label of the target text data.
It should be noted that the foregoing explanation of the embodiment of the text data sensitivity recognition method shown in fig. 1 and fig. 2 is also applicable to the text data sensitivity recognition apparatus 30 of this embodiment, and is not repeated here.
The text data sensitivity recognition device provided by the embodiment of the application can perform theme recognition on text data to be processed, determine th theme type corresponding to the text data to be processed, perform data tagging on the text data to be processed according to the feature set corresponding to th theme type, determine a tagging label corresponding to the text data to be processed, and further perform training on a recognition model by using a training sample set formed by the text data to be processed and the tagging label corresponding to the text data to be processed, so as to obtain a sensitive recognition model.
In order to implement the above embodiments, the present application also proposes electronic devices.
Fig. 4 is a schematic structural diagram of an electronic device according to embodiments of the present invention.
As shown in fig. 4, the electronic device 400 includes:
a memory 410 and a processor 420, and a bus 430 connecting the different components (including the memory 410 and the processor 420), wherein the memory 410 stores a computer program, and when the processor 420 executes the program, the text data sensitivity recognition method according to the embodiment of the present application is implemented.
Program/utility 480 having sets (at least ) of program modules 470 may be stored, for example, in memory 410, such program modules 470 include, but are not limited to, an operating system, or more application programs, other program modules, and program data, each of these examples or some combination of these may include an implementation of a network environment.
The processor 420 executes various functional applications and data processing by executing programs stored in the memory 410.
It should be noted that, for the implementation process and the technical principle of the electronic device of this embodiment, reference is made to the foregoing explanation of the text data sensitivity recognition method according to the embodiment of the present application, and details are not described here again.
The electronic device provided by the embodiment of the application can execute the text data sensitivity identification method as described above, perform topic identification on text data to be processed, determine th topic type corresponding to the text data to be processed, perform data tagging on the text data to be processed according to the feature set corresponding to st topic type to determine a tagging label corresponding to the text data to be processed, and further perform data tagging on the text data to be processed according to the feature set corresponding to the topic type by using a training sample set composed of the text data to be processed and the tagging label corresponding to the text data to be processed, and train the identification model to obtain the sensitive identification model.
To implement the above embodiments, the present application also proposes computer-readable storage media.
The computer readable storage medium stores thereon a computer program, and the computer program is executed by a processor to implement the text data sensitivity recognition method according to the embodiment of the present application.
In order to implement the above embodiments, the embodiment of the present application's re aspect provides computer programs, which when executed by a processor, implement the text data sensitivity recognition method according to the embodiments of the present application.
in alternative implementations, the present embodiments may take the form of any combination of or more computer-readable media such as, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave .
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or a combination thereof, as well as conventional procedural programming languages, such as the "C" language or similar programming languages.
This application is intended to cover any variations, uses, or adaptations of the application following the -generic principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains and as may be applied to the essential features hereinbefore set forth, the description and examples are to be regarded as illustrative only, and the true scope and spirit of the application is indicated by the claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.
Claims (10)
1, A text data sensitivity recognition method, comprising:
performing theme recognition on the text data to be processed to determine th theme type corresponding to the text data to be processed;
according to the feature set corresponding to the th theme type, performing data annotation on the text data to be processed to determine an annotation tag corresponding to the text data to be processed;
and training a recognition model by using a training sample set formed by the text data to be processed and the label labels corresponding to the text data to be processed to obtain the sensitive recognition model.
2. The method of claim 1, wherein prior to performing topic identification on the text data to be processed, further comprising:
matching each sensitive word in a preset sensitive word list with each candidate text data respectively to determine the matching degree between each candidate text data and each sensitive word in the sensitive word list;
and acquiring text data to be processed from the candidate text data according to the matching degree between the candidate text data and the sensitive words in the sensitive word list.
3. The method as claimed in claim 2, wherein before the matching process of each sensitive word in the preset sensitive word list with each candidate text data, the method further comprises:
and performing data cleaning on the acquired text data to acquire the candidate text data.
4. The method of claim 1, wherein prior to performing topic identification on the text data to be processed, further comprising:
determining the type of the text data to be processed according to the publishing site of the text data to be processed;
the theme recognition is carried out on the text data to be processed, and the method comprises the following steps:
and adopting a theme recognition model corresponding to the category of the text data to be processed to perform theme recognition on the text data to be processed.
5. The method of any of claims 1-4, wherein prior to data annotation of the text data to be processed according to the feature set corresponding to the topic type, further comprising:
and performing data processing on the sensitive text data corresponding to the th topic type to determine a feature set corresponding to the th topic type.
6. The method according to any one of claims 1-4 and , wherein the training the recognition model by using the training sample set composed of the text data to be processed and the label labels corresponding to the text data to be processed to obtain the sensitive recognition model comprises:
and training a recognition model by utilizing a training sample set which is formed by the text data to be processed and the label tag corresponding to the text data to be processed and corresponds to the th topic type to obtain a th sensitive recognition model corresponding to the th topic type.
7. The method of claim 6, after said obtaining a sensitive recognition model corresponding to the topic type, further comprising:
acquiring target text data;
performing theme recognition on the target text data to determine a second theme type corresponding to the target text data;
and identifying the target text data by using a second sensitive identification model corresponding to the second theme type to determine a sensitive label of the target text data.
8, apparatus for recognizing sensitivity to text data, comprising:
the determining module is used for performing topic identification on the text data to be processed to determine a topic type corresponding to the text data to be processed;
a second determining module, configured to perform data annotation on the to-be-processed text data according to the feature set corresponding to the th topic type, so as to determine an annotation tag corresponding to the to-be-processed text data;
and the training module is used for training the recognition model by utilizing a training sample set formed by the text data to be processed and the label labels corresponding to the text data to be processed to obtain the sensitive recognition model.
An electronic device , comprising a memory, a processor, and a program stored on the memory and executable on the processor, wherein the processor implements the text data sensitivity recognition method according to any of claims 1-7 as when executing the program.
10, computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method for sensitive recognition of textual data according to any of claims 1-7, .
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810719136.6A CN110737770B (en) | 2018-07-03 | 2018-07-03 | Text data sensitivity identification method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810719136.6A CN110737770B (en) | 2018-07-03 | 2018-07-03 | Text data sensitivity identification method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110737770A true CN110737770A (en) | 2020-01-31 |
CN110737770B CN110737770B (en) | 2023-01-20 |
Family
ID=69234229
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810719136.6A Active CN110737770B (en) | 2018-07-03 | 2018-07-03 | Text data sensitivity identification method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110737770B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112434331A (en) * | 2020-11-20 | 2021-03-02 | 百度在线网络技术(北京)有限公司 | Data desensitization method, device, equipment and storage medium |
CN113128220A (en) * | 2021-04-30 | 2021-07-16 | 北京奇艺世纪科技有限公司 | Text distinguishing method and device, electronic equipment and storage medium |
CN115544240A (en) * | 2022-11-24 | 2022-12-30 | 闪捷信息科技有限公司 | Text sensitive information identification method and device, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184188A (en) * | 2011-04-15 | 2011-09-14 | 百度在线网络技术(北京)有限公司 | Method and equipment for determining sensitivity of target text |
CN104505090A (en) * | 2014-12-15 | 2015-04-08 | 北京国双科技有限公司 | Method and device for voice recognizing sensitive words |
CN106528655A (en) * | 2016-10-18 | 2017-03-22 | 百度在线网络技术(北京)有限公司 | Text subject recognition method and device |
CN107045524A (en) * | 2016-12-30 | 2017-08-15 | 中央民族大学 | A kind of method and system of network text public sentiment classification |
CN107437416A (en) * | 2017-05-23 | 2017-12-05 | 阿里巴巴集团控股有限公司 | A kind of consultation service processing method and processing device based on speech recognition |
CN107818077A (en) * | 2016-09-13 | 2018-03-20 | 北京金山云网络技术有限公司 | A kind of sensitive content recognition methods and device |
-
2018
- 2018-07-03 CN CN201810719136.6A patent/CN110737770B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102184188A (en) * | 2011-04-15 | 2011-09-14 | 百度在线网络技术(北京)有限公司 | Method and equipment for determining sensitivity of target text |
CN104505090A (en) * | 2014-12-15 | 2015-04-08 | 北京国双科技有限公司 | Method and device for voice recognizing sensitive words |
CN107818077A (en) * | 2016-09-13 | 2018-03-20 | 北京金山云网络技术有限公司 | A kind of sensitive content recognition methods and device |
CN106528655A (en) * | 2016-10-18 | 2017-03-22 | 百度在线网络技术(北京)有限公司 | Text subject recognition method and device |
CN107045524A (en) * | 2016-12-30 | 2017-08-15 | 中央民族大学 | A kind of method and system of network text public sentiment classification |
CN107437416A (en) * | 2017-05-23 | 2017-12-05 | 阿里巴巴集团控股有限公司 | A kind of consultation service processing method and processing device based on speech recognition |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112434331A (en) * | 2020-11-20 | 2021-03-02 | 百度在线网络技术(北京)有限公司 | Data desensitization method, device, equipment and storage medium |
CN112434331B (en) * | 2020-11-20 | 2023-08-18 | 百度在线网络技术(北京)有限公司 | Data desensitization method, device, equipment and storage medium |
CN113128220A (en) * | 2021-04-30 | 2021-07-16 | 北京奇艺世纪科技有限公司 | Text distinguishing method and device, electronic equipment and storage medium |
CN113128220B (en) * | 2021-04-30 | 2023-07-18 | 北京奇艺世纪科技有限公司 | Text discrimination method, text discrimination device, electronic equipment and storage medium |
CN115544240A (en) * | 2022-11-24 | 2022-12-30 | 闪捷信息科技有限公司 | Text sensitive information identification method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110737770B (en) | 2023-01-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109460551B (en) | Signature information extraction method and device | |
CN107423278B (en) | Evaluation element identification method, device and system | |
US20090319449A1 (en) | Providing context for web articles | |
CN110602045B (en) | Malicious webpage identification method based on feature fusion and machine learning | |
CN110413787B (en) | Text clustering method, device, terminal and storage medium | |
CN108550054B (en) | Content quality evaluation method, device, equipment and medium | |
CN103299324A (en) | Learning tags for video annotation using latent subtags | |
CN108959474B (en) | Entity relation extraction method | |
CN112347244A (en) | Method for detecting website involved in yellow and gambling based on mixed feature analysis | |
CN108829661B (en) | News subject name extraction method based on fuzzy matching | |
CN110427487B (en) | Data labeling method and device and storage medium | |
CN110737770A (en) | Text data sensitivity identification method and device, electronic equipment and storage medium | |
CN107273465A (en) | SQL injection detection method | |
CN115238688B (en) | Method, device, equipment and storage medium for analyzing association relation of electronic information data | |
CN110738033B (en) | Report template generation method, device and storage medium | |
CN107844531B (en) | Answer output method and device and computer equipment | |
CN111475651A (en) | Text classification method, computing device and computer storage medium | |
CN111782793A (en) | Intelligent customer service processing method, system and equipment | |
CN112257444B (en) | Financial information negative entity discovery method, device, electronic equipment and storage medium | |
CN112330501A (en) | Document processing method and device, electronic equipment and storage medium | |
CN111597423A (en) | Performance evaluation method and device of interpretable method of text classification model | |
CN114842982A (en) | Knowledge expression method, device and system for medical information system | |
CN114021064A (en) | Website classification method, device, equipment and storage medium | |
CN113836297A (en) | Training method and device for text emotion analysis model | |
CN113177121A (en) | Text topic classification method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |