CN113836915A - Data processing method, device, equipment and readable storage medium - Google Patents

Data processing method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN113836915A
CN113836915A CN202111119263.0A CN202111119263A CN113836915A CN 113836915 A CN113836915 A CN 113836915A CN 202111119263 A CN202111119263 A CN 202111119263A CN 113836915 A CN113836915 A CN 113836915A
Authority
CN
China
Prior art keywords
sensitive
data
text data
target text
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111119263.0A
Other languages
Chinese (zh)
Inventor
刘建华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202111119263.0A priority Critical patent/CN113836915A/en
Publication of CN113836915A publication Critical patent/CN113836915A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Character Discrimination (AREA)

Abstract

The embodiment of the application discloses a data processing method, a device, equipment and a readable storage medium, which relate to the artificial intelligence technology, wherein the method comprises the following steps: acquiring target text data, and performing character splitting processing on the target text data to obtain N characters contained in the target text data; carrying out hierarchical matching on the N characters and X hierarchical sensitive characters in a preset sensitive word bank to obtain a first matching result, and determining whether the N characters contain sensitive data or not according to the first matching result; if the N characters do not contain sensitive data, performing feature recognition on the target text data, and determining whether the target text data contains the sensitive data; and if the target text data contains sensitive data, outputting the sensitive data in the target text data. By adopting the embodiment of the application, the data processing efficiency can be improved, and the quality inspection efficiency can be further improved.

Description

Data processing method, device, equipment and readable storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a data processing method, apparatus, device, and readable storage medium.
Background
With the development of credit industry, debts of overdue repayment often need to be paid, but with the regulation and soundness of the law of the industry country, the reasonable compliance of the debt charging behavior is very important, and if the content of non-compliance exists, a debtor can be directly used as evidence to cause the enterprises of the debtor party to suffer from reputation and economic loss. Therefore, the supervision of the urging behavior is imperative in the process of urging receipt.
In the prior art, generally, a manual detection method is used for detecting whether illegal operations exist in the collection operation of collection staff, but the efficiency of the method is low, so that the problem to be solved urgently is how to perform quality detection on the illegal operations of the collection staff in the collection operation, the efficiency of data processing is improved, and the efficiency of quality detection is further improved.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device, data processing equipment and a readable storage medium, which can improve the data processing efficiency and further improve the quality inspection efficiency.
In a first aspect, the present application provides a data processing method, including:
acquiring target text data, and performing character splitting processing on the target text data to obtain N characters contained in the target text data, wherein N is a positive integer;
performing level matching on the N characters and sensitive characters of X levels in a preset sensitive word stock to obtain a first matching result, and determining whether the N characters contain sensitive data or not according to the first matching result; the preset sensitive word bank comprises dictionaries of X levels, the dictionaries are used for storing sensitive characters in a level storage mode, and X is a positive integer;
if the N characters do not contain sensitive data, performing feature recognition on the target text data, and determining whether the target text data contains sensitive data;
and if the target text data contains sensitive data, outputting the sensitive data in the target text data.
In a second aspect, the present application provides a data processing apparatus comprising:
the character splitting module is used for acquiring target text data and performing character splitting processing on the target text data to obtain N characters contained in the target text data, wherein N is a positive integer;
the hierarchical matching module is used for carrying out hierarchical matching on the N characters and sensitive characters of X hierarchies in a preset sensitive word stock to obtain a first matching result, and determining whether the N characters contain sensitive data or not according to the first matching result; the preset sensitive word bank comprises X levels of dictionaries, wherein the dictionaries are used for storing sensitive characters in a level storage mode, N is a positive integer, and X is a positive integer;
the characteristic identification module is used for carrying out characteristic identification on the target text data if the N characters do not contain sensitive data and determining whether the target text data contain the sensitive data;
and the data output module is used for outputting the sensitive data in the target text data if the target text data contains the sensitive data.
With reference to the second aspect, in a possible implementation manner, the hierarchical matching module is specifically configured to:
carrying out hierarchical matching on the jth character in the N characters and a sensitive character of a first hierarchy, wherein the X hierarchies comprise the first hierarchy, and j is a positive integer;
if the jth character is matched with the sensitive character of the first level, performing level matching on the (j + 1) th character and the sensitive character of the second level until k characters in the N characters are matched with the sensitive characters of the X levels, and determining that the first matching result is that the target text data contains sensitive data, wherein the k characters comprise the jth character and the (j + 1) th character, the X levels comprise the second level, and k is a positive integer.
With reference to the second aspect, in a possible implementation manner, the data processing apparatus further includes a character matching module, configured to:
if the jth character does not match the sensitive character of the first level, performing level matching on the jth +1 character and the sensitive character of the first level;
if each character in the N characters is not matched with the sensitive character of the first level, determining that the first matching result is that the target text data does not contain sensitive data; alternatively, the first and second electrodes may be,
and if the jth character is matched with the sensitive character of the first level and the j +1 th character is not matched with the sensitive character of the second level, determining that the first matching result is that the target text data does not contain sensitive data.
With reference to the second aspect, in a possible implementation manner, the data processing apparatus further includes a dictionary construction module, configured to:
acquiring target group preset sensitive data in M groups of preset sensitive data, wherein the target group preset sensitive data is any one group in the M groups of preset sensitive data, and M is a positive integer;
performing character splitting processing on the target group preset sensitive data to obtain i sensitive characters contained in the target group preset sensitive data, wherein i is a positive integer;
and acquiring the position of each sensitive character in the i sensitive characters in the target group of preset sensitive data, storing each sensitive character in the i sensitive characters into a dictionary of a corresponding hierarchy based on the position to obtain the preset sensitive word bank, wherein the hierarchy of the dictionary corresponds to the position, and one position corresponds to one hierarchy.
With reference to the second aspect, in a possible implementation manner, the feature identification module is specifically configured to:
performing word splitting processing on the target text data based on a target recognition model to obtain a first word combination contained in the target text data, wherein the first word combination is obtained by at least one word combination;
extracting the features of the first word combination to obtain a feature vector of the first word combination;
matching the feature vector based on the first word combination with the reference feature vector in the target recognition model to obtain a second matching result;
and determining whether the target text data contains sensitive data or not based on the second matching result.
With reference to the second aspect, in a possible implementation manner, the feature identification module is specifically configured to:
if the second matching result indicates that the target text data does not contain sensitive data, performing word splitting processing on the first word combination to obtain a second word combination, wherein the second word combination comprises at least one of single words, idioms and colloquialisms;
extracting the features of the second word combination to obtain a feature vector of the second word combination;
and matching the feature vector of the second word combination with the reference feature vector in the target recognition model to obtain a third matching result, and determining whether the target text data contains sensitive data or not based on the third matching result.
With reference to the second aspect, in a possible implementation manner, the data processing apparatus further includes: a model training module to:
obtaining sample text data and a sample label, wherein the sample label is used for indicating sensitive data information in the sample text data;
performing word splitting processing on the sample text data based on the initial recognition model to obtain a first sample word combination contained in the sample text data, wherein the first sample word combination is obtained by at least one word combination;
extracting the characteristics of the first sample word combination to obtain a sample characteristic vector of the first sample word combination;
matching the sample characteristic vector based on the first sample word combination with the reference characteristic vector in the initial recognition model to obtain a first sample matching result;
and determining a loss function of the initial recognition model based on the first sample matching result and the sample label, and training the initial recognition model based on the loss function to obtain a target recognition model.
In a third aspect, the present application provides a computer device comprising: a processor, a memory, a network interface;
the processor is connected with a memory and a network interface, wherein the network interface is used for providing a data communication function, the memory is used for storing a computer program, and the processor is used for calling the computer program so as to enable a computer device comprising the processor to execute the data processing method.
In a fourth aspect, the present application provides a computer-readable storage medium having stored therein a computer program adapted to be loaded and executed by a processor, so as to cause a computer device having the processor to execute the above-mentioned data processing method.
In a fifth aspect, the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data processing method provided in the various alternatives in the first aspect of the present application.
In the embodiment of the application, through acquiring target text data, performing character splitting processing on the target text data to obtain N characters contained in the target text data; carrying out hierarchical matching on the N characters and X hierarchical sensitive characters in a preset sensitive word bank to obtain a first matching result, and determining whether the N characters contain sensitive data or not according to the first matching result; if the N characters do not contain sensitive data, performing feature recognition on the target text data, and determining whether the target text data contains the sensitive data; and if the target text data contains sensitive data, outputting the sensitive data in the target text data. Because the character splitting processing and the matching are based on the character matching mode to identify whether the sensitive data exists in the target text data, the identification mode does not relate to complex processing logic and has higher identification efficiency. If the target text data is determined to contain the sensitive data, subsequent processing is not needed, and the data processing efficiency can be improved. If the sensitive data in the target text data are not recognized in the first text recognition, the target text data can be subjected to feature extraction and secondary recognition, and the accuracy of data processing can be improved. Because two different identification modes are adopted to identify and process the target text data, the accuracy of data processing can be improved, and the accuracy of quality inspection is further improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of a data processing method provided in an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating a preset sensitive word library according to an embodiment of the present disclosure;
FIG. 3 is a schematic flow chart illustrating a process of recognizing target text data based on a target recognition model according to an embodiment of the present application;
fig. 4 is a schematic diagram of splitting target text data according to an embodiment of the present application;
FIG. 5 is a schematic flow chart diagram of another data processing method provided in the embodiments of the present application;
fig. 6 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Among them, Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
The application relates to a natural language processing technology in artificial intelligence, which can be used for carrying out character splitting processing on target text data by utilizing the natural language processing technology to obtain N characters contained in the target text data; and performing hierarchical matching on the N characters and X hierarchical sensitive characters in a preset sensitive word bank to obtain a first matching result, and determining whether the N characters contain sensitive data according to the first matching result. Further, if the N characters do not contain sensitive data, feature recognition can be performed on the target text data by using a natural language processing technology to determine whether the target text data contains the sensitive data; and if the target text data contains sensitive data, outputting the sensitive data in the target text data. By adopting different identification methods to identify the target text data, namely, by carrying out two different identification treatments on the target text data, the accuracy of data processing can be improved, and the accuracy of quality inspection on the target text data is further improved.
Referring to fig. 1, fig. 1 is a schematic flow chart of a data acquisition method provided in an embodiment of the present application, where the data processing method may be applied to a computer device, where the computer device may be an electronic device, including but not limited to a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm-top computer, a vehicle-mounted device, an Augmented Reality/Virtual Reality (AR/VR) device, a helmet display, a wearable device, a smart speaker, a digital camera, a camera, and other Mobile Internet Devices (MID) with network access capability; the method can also refer to an independent server, a server cluster consisting of a plurality of servers, or a cloud computing center. As shown in fig. 1, the data processing method includes, but is not limited to, the following steps:
s101, acquiring target text data, and performing character splitting processing on the target text data to obtain N characters contained in the target text data.
The technical scheme is suitable for identifying the text data and determining whether the text data has sensitive data in a scene. For example, the method can be applied to a debt collection scene of a collector, the collector can send text data to a debt party through a collection system to collect the debt, and can determine whether the text data contains sensitive data by identifying the text data sent by the collector to the debt party, so that the quality inspection of the collection behavior of the collector is realized, and the phenomenon that the collector sends out unqualified data to the debt party and the debt party pays an instruction by taking the unqualified data as evidence to cause reputation and economic loss of the creditor and the collection institution is avoided. Optionally, the technical solution of the present application may also identify text data in other scenarios, for example, identify a statement published on a network by a user, and determine whether sensitive data exists, so that the user behavior is normalized in the scenario, which is not limited in the embodiment of the present application.
In the embodiment of the application, the computer device can acquire the target text data and perform character splitting processing on the target text data to obtain N characters contained in the target text data. The N is a positive integer, and the target text data may be a text message sent by the collector of the collection agency to the client (e.g., a terminal used by the debtor) through the collection system, so that the computer device (e.g., the collection system) may obtain the target text data sent by each collector, perform quality inspection on the target text data, and determine whether the target text data contains sensitive data.
Optionally, if the person is urged to receive the data in a voice manner, the computer device may acquire the voice data, and recognize the voice data based on a voice recognition technology to obtain text data corresponding to the voice data, that is, target text data, so as to process the target text data, determine whether the target text data contains sensitive data, and improve quality inspection on the sensitive data.
The characters in the embodiment of the present application may refer to one word, the target text data may be a text composed of a plurality of words, and after the target text data is subjected to the character splitting process, N characters composed of a single word and included in the target text data may be obtained. For example, the target text data is "you will get up to the gate without paying debt", and the target text data is subjected to the character splitting process to obtain 14 characters, which are "you", "no longer", "no", "paid", "still", "debt", "obligation", "i", "people", "right", "up", "gate", "ask" and "get".
S102, carrying out hierarchical matching on the N characters and X hierarchical sensitive characters in a preset sensitive word bank to obtain a first matching result, and determining whether the N characters contain sensitive data or not according to the first matching result.
In the embodiment of the application, the computer device may perform hierarchical matching on the N characters and sensitive characters of X hierarchies in a preset sensitive word stock to obtain a first matching result, and determine whether the N characters contain sensitive data according to the first matching result. The preset sensitive word library comprises X levels of dictionaries, the dictionaries are used for storing sensitive characters in a level storage mode, and X is a positive integer. The first matching result is used for indicating that the target text data contains sensitive data or the target text data does not contain sensitive data. And if the N characters have sensitive characters matched with the X levels of sensitive characters in the preset sensitive word stock, determining that the first matching result is that the target text data contains sensitive data. And if no sensitive character matched with the X levels of sensitive characters in the preset sensitive word stock exists in the N characters, determining that the first matching result is that the target text data does not contain sensitive data.
Optionally, the computer device may also pre-construct a preset sensitive word bank for subsequent use. Specifically, the computer device may obtain target group preset sensitive data in the M groups of preset sensitive data, where the target group preset sensitive data is any one of the M groups of preset sensitive data, and M is a positive integer; performing character splitting processing on preset sensitive data of a target group to obtain i sensitive characters contained in the preset sensitive data of the target group, wherein i is a positive integer; the method comprises the steps of obtaining the position of each sensitive character in i sensitive characters in preset sensitive data of a target group, storing each sensitive character in the i sensitive characters into a dictionary of a corresponding level based on the position, and obtaining a preset sensitive word bank, wherein the levels of the dictionary correspond to the positions, and one position corresponds to one level.
In particular implementations, the computer device can predetermine which data is sensitive data, which can include violation data, abuse data, and the like. For example, the computer device acquires M groups of preset sensitive data, where the M groups of preset sensitive data include target group preset sensitive data, and a processing manner for other groups of preset sensitive data in the M groups of preset sensitive data may refer to a processing manner for the target group preset sensitive data. For example, the target group preset sensitive data is w, w is composed of 3 sensitive characters, the character splitting processing is performed on w to obtain 3 sensitive characters w1, w2 and w3, the position of w1 in w, namely the first position, is obtained, and w1 is stored in a dictionary of a level corresponding to the first position (namely, in the dictionary of the first level) based on the first position. And acquiring the position of the w2 in the w, namely the second position, and storing the w2 in the dictionary of the hierarchy corresponding to the second position (namely the dictionary of the second hierarchy) based on the second position. And acquiring the position of w3 in w, namely a third position, and storing w3 in a dictionary of a level corresponding to the third position (namely the dictionary of the third level) based on the third position, thereby obtaining the preset sensitive word library. Since the position of the sensitive character in the preset sensitive data of the target group is determined according to the number of the sensitive characters, and the hierarchy of the dictionary is determined according to the position of the sensitive character in the preset sensitive data of the target group, the hierarchy of the dictionary is equal to the number of the sensitive characters in the preset sensitive data of the target group. That is, for example, if the number of the sensitive characters in the target group preset sensitive data is 5, the positions of the sensitive characters 1 to 5 in the target group preset sensitive data are the first position to the fifth position, respectively, and the corresponding dictionary has 5 levels, which are the first level to the fifth level, respectively.
As shown in fig. 2, fig. 2 is a schematic diagram of constructing a preset sensitive word library provided in an embodiment of the present application, where M is 2, that is, 2 groups of preset sensitive data are "go home" and "visit collection", respectively, and since "go home" is a first position in the preset sensitive data, "go home", the "go" is stored in a dictionary of a first hierarchy based on the first position. And if the 'home' is at a second position in the preset sensitive data 'home-going', storing the 'home' into a dictionary at a second level based on the second position. And if the 'home' is at a third position in the preset sensitive data 'home to go', storing the 'away' into a dictionary at a third level based on the third position. And if the 'go' is at the fourth position in the preset sensitive data 'go home', storing the 'go' into a dictionary at the fourth level based on the fourth position. Correspondingly, since the 'up' is at the first position in the preset sensitive data 'urging to the home', the 'up' is stored in the dictionary of the first level based on the first position. And if the door is at a second position in the preset sensitive data door urging receipt, storing the door into a dictionary of a second level based on the second position. And storing the urging in a third position in the home urging collection of the preset sensitive data based on the third position in a dictionary of a third level. And if the ' receiving ' is at the fourth position in the ' home urging receiving ' of the preset sensitive data ', storing the ' receiving ' into a dictionary at the fourth level based on the fourth position to obtain a preset sensitive word bank.
Optionally, the method for the computer device to perform hierarchical matching on the N characters and sensitive characters of X hierarchies in a preset sensitive word stock to obtain a first matching result, and determine whether the N characters include sensitive data according to the first matching result may include: the computer device may perform hierarchical matching on a jth character of the N characters with a sensitive character of a first hierarchy, wherein X hierarchies comprise the first hierarchy, and j is a positive integer; if the jth character is matched with the sensitive character of the first level, performing level matching on the jth +1 character and the sensitive character of the second level until k characters in the N characters are matched with the sensitive characters of the X levels, and determining that the first matching result is that the target text data contains sensitive data, wherein the k characters comprise the jth character and the jth +1 character, the X levels comprise the second level, and k is a positive integer smaller than or equal to N.
For example, N is equal to 2, j is equal to 1, the preset sensitive lexicon includes 2 levels of dictionaries, the computer device performs level matching on a first character of the 2 characters and a sensitive character of a first level, performs level matching on a second character and a sensitive character of a second level if the first character matches the sensitive character of the first level, and determines that the first matching result is that the target text data includes sensitive data if the second character matches the sensitive character of the second level. Or N is equal to 4, j is equal to 1, the preset sensitive word bank comprises dictionaries of 4 levels, the computer device carries out level matching on a first character in the 4 characters and a sensitive character of a first level, if the first character is matched with the sensitive character of the first level, the computer device carries out level matching on a second character and a sensitive character of a second level, if the second character is matched with the sensitive character of the second level, the computer device carries out level matching on a third character and a sensitive character of a third level, if the third character is matched with the sensitive character of the third level, the computer device carries out level matching on a fourth character and a sensitive character of a fourth level, and if the fourth character is matched with the sensitive character of the fourth level, the first matching result is determined that the target text data contains sensitive data.
In a possible implementation manner, if the jth character of the N characters does not match the sensitive character of the first hierarchy, the jth +1 character is hierarchically matched with the sensitive character of the first hierarchy, and if the jth +1 character does not match the sensitive character of the first hierarchy, the jth +2 character is hierarchically matched with the sensitive character of the first hierarchy until the pth character matched with the sensitive character of the first hierarchy is determined, and then the pth +1 character is hierarchically matched with the sensitive character of the second hierarchy, so as to determine whether the sensitive data exists in the target text data, where p is a positive integer.
For example, N is equal to 4, j is equal to 1, the preset sensitive lexicon includes dictionaries of 2 levels, the computer device performs level matching on a first character in the 4 characters and a sensitive character of a first level, if the first character does not match the sensitive character of the first level, performs level matching on a second character and the sensitive character of the first level, if the second character does not match the sensitive character of the first level, performs level matching on a third character and the sensitive character of the first level, if the third character matches the sensitive character of the first level, performs level matching on a fourth character and the sensitive character of a second level, and if the fourth character matches the sensitive character of the second level, determines that a first matching result is that the target text data includes sensitive data.
In another possible implementation, if the jth character does not match the sensitive character of the first level, performing level matching on the (j + 1) th character and the sensitive character of the first level; if each character in the N characters is not matched with the sensitive character of the first level, determining that the first matching result is that the target text data does not contain sensitive data; or, if the jth character is matched with the sensitive character of the first level and the j +1 th character is not matched with the sensitive character of the second level, determining that the first matching result is that the target text data does not contain sensitive data. That is, hierarchical matching is performed on each character in the N characters and a sensitive character of a first hierarchy in a preset sensitive word bank, and if each character in the N characters is not matched with the sensitive character of the first hierarchy in the preset sensitive word bank, it is determined that the first matching result is that the target text data does not contain sensitive data. And if one character in the N characters is matched with the sensitive character of the first level in the preset sensitive word stock and the next character of the character is not matched with the sensitive characters of the X-1 levels in the preset sensitive word stock, determining that the first matching result is that the target text data does not contain sensitive data.
In a specific implementation, the computer device performs character splitting processing on target text data to obtain N characters, such as a character string t, included in the target text data, and the computer device may define a hierarchical matching method f (h, dct), where h is an input field index, and the index is a position of a character in a section of text, for example, a position of a first character in the text is 0, and a second character is 1. And finding out characters in the text through indexing, matching the characters with a preset sensitive word stock dit, namely performing a recursive tone method f when matching is performed, wherein the input parameters are h +1 and dit [ t [ h ] ], and determining that the target text data contains sensitive data until the last sensitive character in the preset sensitive word stock is matched, otherwise, determining that the target text data does not contain sensitive data.
In the embodiment of the application, the N characters contained in the target text data are hierarchically matched with the X hierarchical sensitive characters in the preset sensitive word bank layer by layer, so that whether the target text data contains the sensitive data can be determined.
S103, if the N characters do not contain sensitive data, performing feature recognition on the target text data, and determining whether the target text data contains the sensitive data.
In this embodiment of the application, if it is determined in the above method that the N characters do not include the sensitive data, the computer device may further perform feature recognition on the target text data, and determine again whether the target text data includes the sensitive data. In the embodiment of the application, the computer device performs character splitting processing on the target text data and performs sensitive character matching on the split characters to determine whether the target text data contains sensitive data, if the target text data does not contain sensitive data, the computer device can also perform feature recognition on the target text data, determine whether the target text data contains sensitive data again, and perform secondary recognition on the target text data to determine whether the target text data contains sensitive data, so that the accuracy of data processing can be improved. Whether sensitive data exist in the target text data can be rapidly determined through character splitting processing and matching, if the target text data contain the sensitive data, subsequent processing is not needed, and data processing efficiency can be improved. If the sensitive data in the target text data are not recognized in the first text recognition, the target text data are further subjected to feature recognition, so that the accuracy of text recognition can be improved, and the accuracy of data processing can be improved.
In the embodiment of the application, the computer device may perform feature recognition on the target text data based on the target recognition model, specifically, may perform word splitting processing on the target text data, and determine whether the target text data contains sensitive data based on a first word combination obtained after splitting. Further, the computer device can also perform word splitting processing on the first word combination, and determine whether the target text data contains sensitive data based on a second word combination obtained after splitting. By splitting the target text data into a first word combination obtained by combining a plurality of words, the association degree of the first word combination and the target text data is higher, so that the first word combination is closer to the content which the target text data wants to express, the sensitive data judgment of the first word combination is actually combined with the context of the target text data for judgment, and the judgment result is more accurate. Further, if it is determined that the target text data does not contain sensitive data according to the first word combination, word splitting processing can be further performed on the first word combination to obtain a second word combination, which is equivalent to further judging whether the target text data contains sensitive data on the basis of the first word combination, so that accuracy of determining the sensitive data can be further improved, data processing accuracy is further improved, and quality inspection accuracy is also improved. In the embodiment of the present application, a manner of specifically performing feature recognition on target text data based on a target recognition model and determining whether the target text data includes sensitive data may be described in detail in the embodiment corresponding to fig. 3, which is not described herein more.
And S104, if the target text data contains sensitive data, outputting the sensitive data in the target text data.
In the embodiment of the application, if the target text data contains sensitive data, the computer equipment outputs the sensitive data in the target text data. For example, in the scenario of the debt collection of the collection staff, if the target text data includes sensitive data, it indicates that there is an illegal operation in the collection staff, the sensitive data may be output, and the sensitive data is sent to a manual quality inspection place for quality inspection, so that the relevant collection staff can be managed conveniently in the following process. Optionally, the computer device may further obtain history sensitive data information associated with the acquirer and output the history sensitive data information, where the history sensitive data information may include content of the acquirer for sending the history sensitive data and the number of times of sending the history sensitive data, so that the acquirer is conveniently managed correspondingly by the relevant manager. Alternatively, if the target text data does not contain sensitive data, the computer device may send the target text data to a client used by the debtor for collection.
In the embodiment of the application, through acquiring target text data, performing character splitting processing on the target text data to obtain N characters contained in the target text data; carrying out hierarchical matching on the N characters and X hierarchical sensitive characters in a preset sensitive word bank to obtain a first matching result, and determining whether the N characters contain sensitive data or not according to the first matching result; if the N characters do not contain sensitive data, performing feature recognition on the target text data, and determining whether the target text data contains the sensitive data; and if the target text data contains sensitive data, outputting the sensitive data in the target text data. In the embodiment of the application, because the character splitting processing and the matching are based on the character matching mode to identify whether the sensitive data exists in the target text data, the identification mode does not relate to complex processing logic, and the identification efficiency is high. If the target text data is determined to contain the sensitive data, subsequent processing is not needed, and the data processing efficiency can be improved. If the sensitive data in the target text data are not recognized in the first text recognition, the target text data can be subjected to feature extraction and secondary recognition, and the accuracy of data processing can be improved. Because two different identification modes are adopted to identify and process the target text data, the accuracy of data processing can be improved, and the accuracy of quality inspection is further improved.
Further, if it is determined in step S102 that the target text data does not include the sensitive data, the computer device may further perform feature recognition on the target text data to determine whether the target text data includes the sensitive data, so as to improve the data processing accuracy. Referring to fig. 3, fig. 3 is a schematic flowchart illustrating a process of recognizing target text data based on a target recognition model according to an embodiment of the present application. The method may be applied to a computer device; as shown in fig. 3, the data processing method includes, but is not limited to, the following steps:
s201, performing word splitting processing on the target text data based on the target recognition model to obtain a first word combination contained in the target text data.
The first word combination is obtained by at least one word combination, a plurality of first word combinations can be obtained by carrying out word splitting processing on the target text data, and each first word combination comprises one or more words.
S202, feature extraction is carried out on the first word combination to obtain a feature vector of the first word combination.
Wherein the feature vector of the first word combination is used for reflecting the semantics of each word in the first word combination and the context between every two words.
And S203, matching the feature vector based on the first word combination with the reference feature vector in the target recognition model to obtain a second matching result.
Wherein, the target recognition model comprises one or more reference feature vectors. And matching the feature vector of the first word combination with the reference feature vector in the target recognition model to obtain a second matching result, wherein the second matching result is used for indicating that the target text data contains sensitive data or the target text data does not contain sensitive data. And if the feature vector of the first word combination is matched with any reference feature vector in the target recognition model, indicating that the target text data contains sensitive data according to a second matching result. And if the feature vector of the first word combination is not matched with each reference feature vector in the target recognition model, indicating that the target text data does not contain sensitive data by the second matching result.
And S204, determining whether the target text data contains sensitive data or not based on the second matching result.
And if the second matching result is that the feature vector of the first word combination is matched with any reference feature vector in the target recognition model, determining that the target text data contains sensitive data. And if the second matching result is that the feature vector of the first word combination is not matched with each reference feature vector in the target recognition model, determining that the target text data does not contain sensitive data.
And S205, if the second matching result indicates that the target text data does not contain sensitive data, performing word splitting processing on the first word combination to obtain a second word combination.
The target text data is split into a first word combination to be subjected to feature extraction, and then the target text data is determined not to contain sensitive data, the computer equipment can further split the first word combination, and determine whether the split word combination contains the sensitive data, wherein a second word combination comprises at least one of single words, idioms and common languages, namely the second word combination is the word in the minimum unit, the word in the minimum unit can not be continuously split, and the second word combination comprises one word.
And S206, performing feature extraction on the second word combination to obtain a feature vector of the second word combination.
The feature vector of the second word combination is used for reflecting the semantics of the words in the second word combination, namely the content expressed by the second word combination.
And S207, matching the feature vector based on the second word combination with the reference feature vector in the target recognition model to obtain a third matching result, and determining whether the target text data contains sensitive data based on the third matching result.
In this embodiment of the application, the computer device may perform matching based on the feature vector of the second word combination and the reference feature vector in the target recognition model to obtain a third matching result, and determine whether the target text data includes sensitive data based on the third matching result. And the third matching result is used for indicating that the target text data contains sensitive data or the target text data does not contain sensitive data. And if the feature vector of the second word combination is matched with any reference feature vector in the target recognition model, indicating that the target text data contains sensitive data according to a third matching result. And if the feature vector of the second word combination is not matched with each reference feature vector in the target recognition model, indicating that the target text data does not contain sensitive data according to a third matching result.
As shown in fig. 4, fig. 4 is a schematic diagram of splitting target text data according to an embodiment of the present application, where the target text data includes 36 characters, and the target recognition model splits the target text data, and splits the target text data into a text composed of 16 characters and a text composed of 20 characters, respectively, that is, a first word combination includes a text composed of 16 characters and a text composed of 20 characters. Further, the computer device respectively extracts features of the two first word combinations, respectively obtains feature vectors of the two first word combinations, namely feature vectors of a text composed of 16 characters and feature vectors of a text composed of 20 characters, and respectively matches the feature vectors of the two first word combinations with reference feature vectors in the target recognition model to obtain two second matching results. And if any one of the two second matching results indicates that the target text data contains the sensitive data, determining that the target text data contains the sensitive data. And if the two second matching results both indicate that the target text data does not contain the sensitive data, determining that the target text data does not contain the sensitive data.
Further, the computer device can also perform word splitting processing on the text composed of 16 characters to obtain the text composed of 8 characters and the text composed of 8 characters, perform feature extraction on the text composed of 8 characters respectively, and match the extracted feature vectors with the reference feature vectors in the target recognition model. And if any one of the matching results indicates that the target text data contains the sensitive data, determining that the target text data contains the sensitive data. And if the two matching results both indicate that the target text data does not contain the sensitive data, determining that the target text data does not contain the sensitive data. Further, the computer device performs word splitting processing on the text composed of 8 characters to obtain the text composed of 4 characters and the text composed of 4 characters, and extracts the feature vectors of two second word combinations to match with the reference feature vectors in the target recognition model to obtain a third matching result. And if any one of the third matching results indicates that the target text data contains the sensitive data, determining that the target text data contains the sensitive data. And if the two third matching results both indicate that the target text data does not contain the sensitive data, determining that the target text data does not contain the sensitive data.
Further, since the second word combination 'e' on the left includes 4 characters, the 4 characters are idioms, i.e. the minimum character unit, and no splitting is needed. Splitting the second word combination on the right, splitting the text composed of 4 characters into two texts composed of 2 characters, respectively extracting the feature vectors of the two texts composed of 2 characters to match with the reference feature vector in the target recognition model to obtain two matching results, and if any one of the two matching results indicates that the target text data contains sensitive data, determining that the target text data contains sensitive data. And if the two matching results both indicate that the target text data does not contain the sensitive data, determining that the target text data does not contain the sensitive data. Further, since the second word combination 'n' on the left includes 2 characters, the 2 characters are words, i.e. the minimum character unit, and no splitting is required. Splitting the right 2 characters, splitting the text composed of the 2 characters into two texts composed of single characters, respectively extracting the feature vectors of the two texts composed of the single characters to match with the reference feature vector in the target recognition model to obtain two matching results, and determining that the target text data does not contain sensitive data if the two matching results indicate that the target text data does not contain sensitive data. Since the single character 'o' includes 1 character and the single character 'u' includes 1 character, the characters are all single words, namely the minimum character unit, and the splitting is not needed. By the method, 36 characters included in the target text data can be split step by step until the target text data is determined to include sensitive data, or the target text data is split into a plurality of minimum character units, and the target text data is determined not to include the sensitive data.
In the embodiment of the application, the target text data is split into the word combination consisting of a plurality of words, the characteristics of the word combination are extracted for matching, and then the word combination is split into the minimum character units such as common languages, idioms, words and single words, so that the target text data can be analyzed by combining the context of the target text data, and the target text data can be analyzed after being split layer by layer, and the quality inspection accuracy can be improved. It can be understood that, when the feature vectors of the text data are extracted and matched after the target text data is split, if the matching result corresponding to the feature vector of any one split text indicates that the target text data contains sensitive data, the subsequent processes of splitting the text, extracting the feature vectors and matching can be finished, and the quality inspection efficiency can be improved.
Optionally, before the computer device uses the target recognition model to recognize the target text data, the computer device may also train the target recognition model in advance, so that when the target text data is recognized based on the target recognition model, the accuracy is higher. Specifically, the computer device may obtain sample text data, and a sample label, wherein the sample label is used for indicating sensitive data information in the sample text data; performing word splitting processing on the sample text data based on the initial recognition model to obtain a first sample word combination contained in the sample text data, wherein the first sample word combination is obtained by at least one word combination; performing feature extraction on the first sample word combination to obtain a sample feature vector of the first sample word combination; matching the sample characteristic vector based on the first sample word combination with the reference characteristic vector in the initial recognition model to obtain a first sample matching result; and determining a loss function of the initial recognition model based on the first sample matching result and the sample label, and training the initial recognition model based on the loss function to obtain the target recognition model.
In the embodiment of the application, by obtaining the sample label of the sample text data, the sample label includes that the sample text data contains sensitive data or does not contain sensitive data, identifying the sample text data based on the initial identification model to obtain a first sample matching result, training the initial identification model based on the difference between the first sample matching result and the sample label to obtain the trained initial identification model, so that after feature identification is performed on the sample text data based on the trained initial identification model, the contact ratio between the obtained first sample matching result and the sample label is higher than a loss threshold value, the initial identification model at the moment is stored to obtain the target identification model, and when the target identification model is subsequently used for identifying the target text data, the accuracy of model identification can be improved.
Optionally, if the first sample matching result indicates that the sample text data does not contain sensitive data, the computer device may further perform word splitting processing on the first sample word combination to obtain a second sample word combination, where the second sample word combination includes at least one of a single word, a word, an idiom, and a colloquial; performing feature extraction on the second sample word combination to obtain a feature vector of the second sample word combination; matching the feature vector based on the second sample word combination with the reference feature vector in the initial recognition model to obtain a second sample matching result; and determining a loss function of the initial recognition model based on the second sample matching result and the sample label, and training the initial recognition model based on the loss function to obtain the target recognition model. The method comprises the steps of obtaining a first sample word combination, obtaining a second sample word combination, training the initial recognition model by using the first sample word combination obtained after splitting, and improving the accuracy of model recognition when the trained model is used for recognizing target text data.
In the embodiment of the application, the computer device can determine whether the target text data contains sensitive data or not based on the first word combination obtained after splitting, and if the target text data does not contain sensitive data; the computer equipment can further perform word splitting processing on the first word combination, and determine whether the target text data contains sensitive data or not based on a second word combination obtained after splitting. By splitting the target text data into a first word combination obtained by combining a plurality of words, the association degree of the first word combination and the target text data is higher, so that the first word combination is closer to the content which the target text data wants to express, the sensitive data judgment of the first word combination is actually combined with the context of the target text data for judgment, and the judgment result is more accurate. Further, if it is determined that the target text data does not contain sensitive data according to the first word combination, word splitting processing can be further performed on the first word combination to obtain a second word combination, which is equivalent to further judging whether the target text data contains sensitive data on the basis of the first word combination, so that accuracy of determining the sensitive data can be further improved, data processing accuracy is further improved, and quality inspection accuracy is also improved.
Further, referring to fig. 5, fig. 5 is a schematic flowchart of another data processing method provided in the embodiment of the present application, where the data processing method may be applied to a computer device, as shown in fig. 5, the data processing method includes, but is not limited to, the following steps:
s301, acquiring target text data, and performing character splitting processing on the target text data to obtain N characters contained in the target text data.
S302, carrying out hierarchical matching on the N characters and X hierarchical sensitive characters in a preset sensitive word bank to obtain a first matching result.
S303, determining whether the N characters contain sensitive data according to the first matching result.
In the embodiment of the present application, if yes, that is, if N characters include sensitive data, step S306 is executed; if not, that is, the N characters do not contain sensitive data, step S304 is executed.
S304, carrying out feature recognition on the target text data, and determining whether the target text data contains sensitive data.
In the embodiment of the present application, if yes, that is, if N characters include sensitive data, step S306 is executed; if not, that is, the N characters do not contain sensitive data, step S305 is executed.
S305, sending the target text data to the client.
In a scenario where the debt is urged to be collected by the acquirer, for example, the client may be a terminal used by the debt party. For example, in a scenario where recognition is made for a user's speech published on a network, the client may be a terminal used by a party receiving the user's speech.
And S306, outputting the sensitive data in the target text data.
In this embodiment of the application, specific implementation manners of steps S301 to S306 may refer to implementation manners of steps S101 to S104 in the embodiment corresponding to fig. 1, and are not described herein again.
In the embodiment of the application, because the character splitting processing and the matching are based on the character matching mode to identify whether the sensitive data exists in the target text data, the identification mode does not relate to complex processing logic, and the identification efficiency is high. If the target text data is determined to contain the sensitive data, subsequent processing is not needed, and the data processing efficiency can be improved. If the sensitive data in the target text data are not recognized in the first text recognition, the target text data can be subjected to feature extraction and secondary recognition, and the accuracy of data processing can be improved. Because two different identification modes are adopted to identify and process the target text data, the accuracy of data processing can be improved, and the accuracy of quality inspection is further improved.
The method of the embodiments of the present application is described above, and the apparatus of the embodiments of the present application is described below.
Referring to fig. 6, fig. 6 is a schematic diagram of a component structure of a data processing apparatus according to an embodiment of the present application, where the data processing apparatus may be a computer program (including program code) running in a computer device, for example, the data processing apparatus is an application software; the data processing device can be used for executing corresponding steps in the data processing method provided by the embodiment of the application. The data processing device 60 includes:
the character splitting module 61 is configured to obtain target text data, perform character splitting processing on the target text data, and obtain N characters included in the target text data;
the hierarchical matching module 62 is configured to perform hierarchical matching on the N characters and sensitive characters of X hierarchical levels in a preset sensitive word bank to obtain a first matching result, and determine whether the N characters include sensitive data according to the first matching result; the preset sensitive word bank comprises X levels of dictionaries, wherein the dictionaries are used for storing sensitive characters in a level storage mode, N is a positive integer, and X is a positive integer;
a feature recognition module 63, configured to perform feature recognition on the target text data if the N characters do not include sensitive data, and determine whether the target text data includes the sensitive data;
and a data output module 64, configured to output the sensitive data in the target text data if the target text data includes the sensitive data.
Optionally, the hierarchy matching module 62 is specifically configured to:
carrying out hierarchical matching on the jth character in the N characters and a sensitive character of a first hierarchy, wherein the X hierarchies comprise the first hierarchy, and j is a positive integer;
if the jth character is matched with the sensitive character of the first level, performing level matching on the (j + 1) th character and the sensitive character of the second level until k characters in the N characters are matched with the sensitive characters of the X levels, and determining that the first matching result is that the target text data contains sensitive data, wherein the k characters comprise the jth character and the (j + 1) th character, the X levels comprise the second level, and k is a positive integer.
Optionally, the data processing apparatus 60 further comprises a character matching module 65 for:
if the jth character does not match the sensitive character of the first level, performing level matching on the jth +1 character and the sensitive character of the first level;
if each character in the N characters is not matched with the sensitive character of the first level, determining that the first matching result is that the target text data does not contain sensitive data; alternatively, the first and second electrodes may be,
and if the jth character is matched with the sensitive character of the first level and the j +1 th character is not matched with the sensitive character of the second level, determining that the first matching result is that the target text data does not contain sensitive data.
Optionally, the data processing apparatus 60 further comprises a dictionary construction module 66 for:
acquiring target group preset sensitive data in M groups of preset sensitive data, wherein the target group preset sensitive data is any one group in the M groups of preset sensitive data, and M is a positive integer;
performing character splitting processing on the target group preset sensitive data to obtain i sensitive characters contained in the target group preset sensitive data, wherein i is a positive integer;
and acquiring the position of each sensitive character in the i sensitive characters in the target group of preset sensitive data, storing each sensitive character in the i sensitive characters into a dictionary of a corresponding hierarchy based on the position to obtain the preset sensitive word bank, wherein the hierarchy of the dictionary corresponds to the position, and one position corresponds to one hierarchy.
Optionally, the feature recognition module 63 is specifically configured to:
performing word splitting processing on the target text data based on a target recognition model to obtain a first word combination contained in the target text data, wherein the first word combination is obtained by at least one word combination;
extracting the features of the first word combination to obtain a feature vector of the first word combination;
matching the feature vector based on the first word combination with the reference feature vector in the target recognition model to obtain a second matching result;
and determining whether the target text data contains sensitive data or not based on the second matching result.
Optionally, the feature recognition module 63 is specifically configured to:
if the second matching result indicates that the target text data does not contain sensitive data, performing word splitting processing on the first word combination to obtain a second word combination, wherein the second word combination comprises at least one of single words, idioms and colloquialisms;
extracting the features of the second word combination to obtain a feature vector of the second word combination;
and matching the feature vector of the second word combination with the reference feature vector in the target recognition model to obtain a third matching result, and determining whether the target text data contains sensitive data or not based on the third matching result.
Optionally, the data processing apparatus 60 further includes: a model training module 67 for:
obtaining sample text data and a sample label, wherein the sample label is used for indicating sensitive data information in the sample text data;
performing word splitting processing on the sample text data based on the initial recognition model to obtain a first sample word combination contained in the sample text data, wherein the first sample word combination is obtained by at least one word combination;
extracting the characteristics of the first sample word combination to obtain a sample characteristic vector of the first sample word combination;
matching the sample characteristic vector based on the first sample word combination with the reference characteristic vector in the initial recognition model to obtain a first sample matching result;
and determining a loss function of the initial recognition model based on the first sample matching result and the sample label, and training the initial recognition model based on the loss function to obtain a target recognition model.
It should be noted that, for the content that is not mentioned in the embodiment corresponding to fig. 6, reference may be made to the description of the method embodiment, and details are not described here again.
In the embodiment of the application, because the character splitting processing and the matching are based on the character matching mode to identify whether the sensitive data exists in the target text data, the identification mode does not relate to complex processing logic, and the identification efficiency is high. If the target text data is determined to contain the sensitive data, subsequent processing is not needed, and the data processing efficiency can be improved. If the sensitive data in the target text data are not recognized in the first text recognition, the target text data can be subjected to feature extraction and secondary recognition, and the accuracy of data processing can be improved. Because two different identification modes are adopted to identify and process the target text data, the accuracy of data processing can be improved, and the accuracy of quality inspection is further improved.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 7, the computer device 70 may include: the processor 701, the network interface 704 and the memory 705, and the computer device 70 may further include: a user interface 703, and at least one communication bus 702. Wherein a communication bus 702 is used to enable connective communication between these components. The user interface 703 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 703 may also include a standard wired interface and a standard wireless interface. The network interface 704 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 705 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 705 may optionally be at least one memory device located remotely from the processor 701. As shown in fig. 7, the memory 705, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.
In the computer device 70 shown in fig. 7, the network interface 704 may provide a network communication function; the user interface 703 is mainly used as an interface for providing input to the user; and processor 701 may be used to invoke a device control application stored in memory 705 to implement:
acquiring target text data, and performing character splitting processing on the target text data to obtain N characters contained in the target text data, wherein N is a positive integer;
performing level matching on the N characters and sensitive characters of X levels in a preset sensitive word stock to obtain a first matching result, and determining whether the N characters contain sensitive data or not according to the first matching result; the preset sensitive word bank comprises dictionaries of X levels, the dictionaries are used for storing sensitive characters in a level storage mode, and X is a positive integer;
if the N characters do not contain sensitive data, performing feature recognition on the target text data, and determining whether the target text data contains sensitive data;
and if the target text data contains sensitive data, outputting the sensitive data in the target text data.
It should be understood that the computer device 70 described in this embodiment may perform the description of the data processing method in the embodiment corresponding to fig. 1, fig. 3, and fig. 5, and may also perform the description of the data processing apparatus in the embodiment corresponding to fig. 6, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.
In the embodiment of the application, because the character splitting processing and the matching are based on the character matching mode to identify whether the sensitive data exists in the target text data, the identification mode does not relate to complex processing logic, and the identification efficiency is high. If the target text data is determined to contain the sensitive data, subsequent processing is not needed, and the data processing efficiency can be improved. If the sensitive data in the target text data are not recognized in the first text recognition, the target text data can be subjected to feature extraction and secondary recognition, and the accuracy of data processing can be improved. Because two different identification modes are adopted to identify and process the target text data, the accuracy of data processing can be improved, and the accuracy of quality inspection is further improved.
Embodiments of the present application also provide a computer-readable storage medium storing a computer program, the computer program comprising program instructions, which, when executed by a computer, cause the computer to perform the method according to the foregoing embodiments, and the computer may be a part of the above-mentioned computer device. Such as the processor 701 described above. By way of example, the program instructions may be executed on one computer device, or on multiple computer devices located at one site, or distributed across multiple sites and interconnected by a communication network, which may comprise a blockchain network.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims (10)

1. A data processing method, comprising:
acquiring target text data, and performing character splitting processing on the target text data to obtain N characters contained in the target text data, wherein N is a positive integer;
carrying out hierarchical matching on the N characters and X hierarchical sensitive characters in a preset sensitive word bank to obtain a first matching result, and determining whether the N characters contain sensitive data or not according to the first matching result; the preset sensitive word bank comprises dictionaries of X levels, the dictionaries are used for storing sensitive characters in a level storage mode, and X is a positive integer;
if the N characters do not contain sensitive data, performing feature recognition on the target text data, and determining whether the target text data contains sensitive data;
and if the target text data contains sensitive data, outputting the sensitive data in the target text data.
2. The method of claim 1, wherein the performing hierarchical matching on the N characters and sensitive characters at X levels in a preset sensitive lexicon to obtain a first matching result, and determining whether the N characters contain sensitive data according to the first matching result comprises:
performing hierarchical matching on a jth character of the N characters and a sensitive character of a first hierarchy, wherein the X hierarchies comprise the first hierarchy, and j is a positive integer;
if the jth character is matched with the sensitive character of the first level, performing level matching on the jth +1 character and the sensitive character of the second level until k characters in the N characters are matched with the sensitive characters of the X levels, and determining that the first matching result is that the target text data contains sensitive data, wherein the k characters comprise the jth character and the jth +1 character, the X levels comprise the second level, and k is a positive integer smaller than or equal to N.
3. The method of claim 2, further comprising:
if the jth character does not match the sensitive character of the first level, performing level matching on the jth +1 character and the sensitive character of the first level;
if each character in the N characters is not matched with the sensitive character of the first level, determining that the first matching result is that the target text data does not contain sensitive data; alternatively, the first and second electrodes may be,
and if the jth character is matched with the sensitive character of the first level and the j +1 th character is not matched with the sensitive character of the second level, determining that the first matching result is that the target text data does not contain sensitive data.
4. The method according to any one of claims 1-3, further comprising:
acquiring target group preset sensitive data in M groups of preset sensitive data, wherein the target group preset sensitive data is any one group in the M groups of preset sensitive data, and M is a positive integer;
performing character splitting processing on the preset sensitive data of the target group to obtain i sensitive characters contained in the preset sensitive data of the target group, wherein i is a positive integer;
and acquiring the position of each sensitive character in the i sensitive characters in the preset sensitive data of the target group, and storing each sensitive character in the i sensitive characters into a dictionary of a corresponding hierarchy based on the position to obtain the preset sensitive word bank, wherein the hierarchy of the dictionary corresponds to the position, and one position corresponds to one hierarchy.
5. The method of claim 1, wherein the performing feature recognition on the target text data and determining whether the target text data contains sensitive data comprises:
performing word splitting processing on the target text data based on a target recognition model to obtain a first word combination contained in the target text data, wherein the first word combination is obtained by at least one word combination;
extracting features of the first word combination to obtain a feature vector of the first word combination;
matching the feature vector based on the first word combination with the reference feature vector in the target recognition model to obtain a second matching result;
and determining whether the target text data contains sensitive data or not based on the second matching result.
6. The method of claim 5, wherein determining whether the target text data includes sensitive data based on the second matching result comprises:
if the second matching result indicates that the target text data does not contain sensitive data, performing word splitting processing on the first word combination to obtain a second word combination, wherein the second word combination comprises at least one of single words, idioms and colloquialisms;
extracting features of the second word combination to obtain a feature vector of the second word combination;
and matching the feature vector of the second word combination with the reference feature vector in the target recognition model to obtain a third matching result, and determining whether the target text data contains sensitive data or not based on the third matching result.
7. The method of claim 1, further comprising:
obtaining sample text data and a sample label, wherein the sample label is used for indicating sensitive data information in the sample text data;
performing word splitting processing on the sample text data based on an initial recognition model to obtain a first sample word combination contained in the sample text data, wherein the first sample word combination is obtained by at least one word combination;
performing feature extraction on the first sample word combination to obtain a sample feature vector of the first sample word combination;
matching the sample characteristic vector based on the first sample word combination with the reference characteristic vector in the initial recognition model to obtain a first sample matching result;
and determining a loss function of the initial recognition model based on the first sample matching result and the sample label, and training the initial recognition model based on the loss function to obtain a target recognition model.
8. A data processing apparatus, comprising:
the character splitting module is used for acquiring target text data and performing character splitting processing on the target text data to obtain N characters contained in the target text data, wherein N is a positive integer;
the hierarchical matching module is used for carrying out hierarchical matching on the N characters and sensitive characters of X hierarchies in a preset sensitive word stock to obtain a first matching result, and determining whether the N characters contain sensitive data or not according to the first matching result; the preset sensitive word bank comprises dictionaries of X levels, the dictionaries are used for storing sensitive characters in a level storage mode, N is a positive integer, and X is a positive integer;
the feature recognition module is used for performing feature recognition on the target text data if the N characters do not contain sensitive data, and determining whether the target text data contain the sensitive data;
and the data output module is used for outputting the sensitive data in the target text data if the target text data contains the sensitive data.
9. A computer device, comprising: a processor, a memory, and a network interface;
the processor is coupled to the memory and the network interface, wherein the network interface is configured to provide data communication functionality, the memory is configured to store program code, and the processor is configured to invoke the program code to cause the computer device to perform the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-7.
CN202111119263.0A 2021-09-23 2021-09-23 Data processing method, device, equipment and readable storage medium Pending CN113836915A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111119263.0A CN113836915A (en) 2021-09-23 2021-09-23 Data processing method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111119263.0A CN113836915A (en) 2021-09-23 2021-09-23 Data processing method, device, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN113836915A true CN113836915A (en) 2021-12-24

Family

ID=78969601

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111119263.0A Pending CN113836915A (en) 2021-09-23 2021-09-23 Data processing method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113836915A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7950062B1 (en) * 2006-08-15 2011-05-24 Trend Micro Incorporated Fingerprinting based entity extraction
CN110674247A (en) * 2019-09-23 2020-01-10 广州虎牙科技有限公司 Barrage information intercepting method and device, storage medium and equipment
CN112052364A (en) * 2020-09-27 2020-12-08 深圳前海微众银行股份有限公司 Sensitive information detection method, device, equipment and computer readable storage medium
CN112328732A (en) * 2020-10-22 2021-02-05 上海艾融软件股份有限公司 Sensitive word detection method and device and sensitive word tree construction method and device
CN112861526A (en) * 2019-11-27 2021-05-28 上海鱼泡泡信息科技有限公司 Sensitive word matching method and device, computer equipment and storage medium
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium
US11093641B1 (en) * 2018-12-13 2021-08-17 Amazon Technologies, Inc. Anonymizing sensitive data in logic problems for input to a constraint solver

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7950062B1 (en) * 2006-08-15 2011-05-24 Trend Micro Incorporated Fingerprinting based entity extraction
US11093641B1 (en) * 2018-12-13 2021-08-17 Amazon Technologies, Inc. Anonymizing sensitive data in logic problems for input to a constraint solver
CN110674247A (en) * 2019-09-23 2020-01-10 广州虎牙科技有限公司 Barrage information intercepting method and device, storage medium and equipment
CN112861526A (en) * 2019-11-27 2021-05-28 上海鱼泡泡信息科技有限公司 Sensitive word matching method and device, computer equipment and storage medium
CN112052364A (en) * 2020-09-27 2020-12-08 深圳前海微众银行股份有限公司 Sensitive information detection method, device, equipment and computer readable storage medium
CN112328732A (en) * 2020-10-22 2021-02-05 上海艾融软件股份有限公司 Sensitive word detection method and device and sensitive word tree construction method and device
CN113076748A (en) * 2021-04-16 2021-07-06 平安国际智慧城市科技股份有限公司 Method, device and equipment for processing bullet screen sensitive words and storage medium

Similar Documents

Publication Publication Date Title
CN107679039B (en) Method and device for determining statement intention
CN115035538B (en) Training method of text recognition model, and text recognition method and device
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
EP4113357A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
CN116824278B (en) Image content analysis method, device, equipment and medium
EP4123496A2 (en) Method and apparatus for extracting text information, electronic device and storage medium
CN112084779B (en) Entity acquisition method, device, equipment and storage medium for semantic recognition
CN111767714B (en) Text smoothness determination method, device, equipment and medium
EP4191544A1 (en) Method and apparatus for recognizing token, electronic device and storage medium
CN113723105A (en) Training method, device and equipment of semantic feature extraction model and storage medium
CN114386410A (en) Training method and text processing method of pre-training model
CN111738018A (en) Intention understanding method, device, equipment and storage medium
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN116467417A (en) Method, device, equipment and storage medium for generating answers to questions
CN108268602A (en) Analyze method, apparatus, equipment and the computer storage media of text topic point
CN110633456A (en) Language identification method, language identification device, server and storage medium
CN113918710A (en) Text data processing method and device, electronic equipment and readable storage medium
EP4254256A1 (en) Spoken language processing method and apparatus, electronic device, and storage medium
US20200159824A1 (en) Dynamic Contextual Response Formulation
CN110929499A (en) Text similarity obtaining method, device, medium and electronic equipment
CN115565186A (en) Method and device for training character recognition model, electronic equipment and storage medium
CN114528851A (en) Reply statement determination method and device, electronic equipment and storage medium
CN113836915A (en) Data processing method, device, equipment and readable storage medium
CN113449506A (en) Data detection method, device and equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination