CN114065748A

CN114065748A - Method and device for identifying risk data and electronic equipment

Info

Publication number: CN114065748A
Application number: CN202111224817.3A
Authority: CN
Inventors: 李晔; 刘涛; 薛贤巨
Original assignee: Beijing Qilu Information Technology Co Ltd
Current assignee: Beijing Qilu Information Technology Co Ltd
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-02-18

Abstract

The invention relates to the field of computer information processing, and provides a method, a device, equipment and a medium for identifying risk data, aiming at the defect that sensitive words related to data categories are easy to be confused with data sensitive words of adjacent industry services during identification of risk data in specific industry services.

Description

Method and device for identifying risk data and electronic equipment

Technical Field

The present invention relates to the field of computer information processing, and in particular, to a method, an apparatus, an electronic device, and a computer-readable medium for identifying a risk.

Background

With the development of society and networks, information processing of various industries is carried out more and more through networks, the data volume on the network is also increased rapidly, but various data and information processing carried out through the networks often have various data risks, business processes in different industries, fields and even various fields have specific risk characteristics of the business processes, and users related to related data may also have risks.

Identifying or evaluating the corresponding risk data from the network big data and identifying the object bringing the risk requires more effective means. In the prior art, collected manual labels are combined with traditional naive Bayes and other supervised classification models to train manual label data to identify data with risks and objects causing risks, but the data and the objects have the text characteristics of more related category sensitive words, and in actual wind control, many professions/industries, businesses and the like which are not in the industry and have other special lines of the sensitive words are easily mishit, so that the identification accuracy is low. Therefore, due to the complexity of industries, adjacent association and the like, the text language itself cannot be distinguished, so that time and labor cost for finally needing to eliminate specific risk data which are not corresponding industries are increased, the evaluation and identification efficiency is low, and the process is complicated.

Therefore, there is a need to improve the existing scheme for recognizing risk data to simplify the algorithm, improve the recognition efficiency and reduce the cost.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the present invention provides a method, an apparatus, an electronic device and a computer-readable medium for identifying risk data, so as to solve the technical problem of how to improve efficiency; furthermore, the technical problem of how to more accurately and automatically mine and evaluate specific risk data by fully utilizing text information; further, the technical problem of mining and identifying risk objects in various specific industries is solved by finding as many risk objects as possible while excluding mishit risk data objects as much as possible. Therefore, the evaluation process is simplified, the identification cost is reduced, and the identification efficiency is improved.

In order to solve the above technical problem, a first aspect of the present invention provides a method for identifying risk data, including: obtaining a first text data set according to the user characteristics of the user to be evaluated; extracting text information of the first text data set to obtain a third text data set; constructing text feature variables using the third text data set; inputting the text characteristic variables into a pre-acquired specific jieba risk data evaluation model for calculation to obtain corresponding model scores; determining whether the respective user characteristic belongs to particular risk data based on the model score to identify whether the respective user is a risk user; the constructing text feature variables using the third text data set includes: a step of filtering the obfuscated semantic data in the third set of text data.

According to a preferred embodiment of the present invention, extracting text information of the first text data set to obtain a third text data set specifically includes: and performing word segmentation on the first text data set by adopting a word segmentation algorithm, segmenting each segment of words in the text data into a sequence of a group of words to obtain a second text data set after word segmentation, and processing the words in the second text data set to retain effective data (such as word segmentation) to form a third text data set according to the effective data retained in the second text data set after processing.

According to a preferred embodiment of the invention, the word segmentation algorithm is a jieba word segmentation, word segmentation and part-of-speech tagging are performed by using jieba word segmentation software, and partial word segmentation results are retained through part-of-speech screening.

According to a preferred embodiment of the present invention, forming the third text data set according to the valid data retained in the processed second text data set specifically includes: filtering the word in the second text data set by adopting a word filtering method so as to obtain reserved effective word data, and constructing a third text data set according to the effective word data (such as all word segmentation results reserved without screening or partial word segmentation results reserved after screening); wherein the word filtering method comprises stop word filtering and/or low-frequency word filtering.

According to a preferred embodiment of the present invention, the constructing the text feature variable using the third text data set further includes: presetting a sensitive word lexicon corresponding to specific risk data, wherein sensitive words of the specific risk data are preset in the sensitive word lexicon corresponding to the specific risk data, counting the sensitive words of common specific risk data in the third text data set, and taking a counting result as a model variable; and/or, correcting wrongly written words in the third text data set; and/or merging the similar meaning words in the third text data set; and/or constructing a word vector for the third text data set.

According to a preferred embodiment of the invention, the risk-specific data evaluation model is an XGBoost model or a logistic regression model.

According to a preferred embodiment of the present invention, the filtering the obfuscated semantic data in the third text data set specifically includes: judging whether the information in the third text data set belongs to the confusion semantic data in the special industry, and if the information belongs to the confusion semantic data in the special industry, filtering the information; and/or the specific risk data comprises black intermediary data.

A second aspect of the present invention provides an apparatus for identifying risk data, comprising: the first text data set acquisition module is used for acquiring a first text data set according to the user characteristics of the user to be evaluated; the text information extraction module is used for extracting the text information of the first text data set to obtain a third text data set; a variable construction module for constructing text characteristic variables by using the third text data set; the score calculating module is used for inputting the text characteristic variables into a pre-acquired specific jieba risk data evaluation model for calculation to obtain corresponding model scores; a decision identification module for determining whether the corresponding user characteristic belongs to specific risk data based on the model score to identify whether the corresponding user is a risk user; wherein the variable construction module comprises: and the confusion semantic data filtering module is used for filtering the confusion semantic data in the third text data set.

A third aspect of the present invention provides an electronic device, comprising: a processor; and a memory storing computer executable instructions that, when executed, cause the processor to perform the method of the aforementioned first aspect.

A fourth aspect of the present invention proposes a computer readable medium, wherein said computer readable medium stores one or more programs which, when executed by a processor, implement the method of the aforementioned first aspect.

In one embodiment of the invention, interference is reduced to a greater extent in network information big data identification, text information is utilized as much as possible, manual identification and model identification are combined, and an automatic mining identification effect is improved, so that objects which are mistakenly hit are eliminated as much as possible, and simultaneously, as many risk objects or risk users as possible are identified and found, and the efficiency of big data mining identification is improved; further, through processing the text information and judging the special line occupation, the judgment result is used as a variable to enter a model, and meanwhile, the text information of the sensitive word class which is not in the special line occupation is fully utilized, and the difference between the text information of the special line occupation and the text information which is not in the special line occupation is found out through a natural language processing mode, so that the identification accuracy is improved; furthermore, by fully utilizing more accurate text information, the generation object of the risk data is automatically mined and identified efficiently, the mishit objects are eliminated as much as possible, and meanwhile, as many risk objects as possible are found, so that the method is suitable for mining and identifying the risk objects in various specific industries. Therefore, the evaluation process is simplified, the identification cost is reduced, the identification efficiency is improved, and the industrial adaptability is improved.

Drawings

In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects obtained more clear, the following will describe in detail the embodiments of the present invention with reference to the accompanying drawings. It should be noted, however, that the drawings described below are only illustrations of exemplary embodiments of the invention, from which other embodiments can be derived by those skilled in the art without inventive step.

FIG. 1 is a primary flow diagram of one embodiment of a method of identifying risk data according to the present invention.

FIG. 2 is a schematic diagram of the main steps that may be taken to construct model variables according to an embodiment of the method of the present invention.

Fig. 3 is a functional module architecture block diagram of an embodiment of an apparatus for identifying risk data according to the present invention.

Fig. 4 is a block diagram of an exemplary embodiment of an electronic device according to the present invention.

FIG. 5 is a schematic diagram of one embodiment of a computer-readable medium of the present invention.

Detailed Description

Exemplary embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention may be embodied in many specific forms, and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art.

The structures, properties, effects or other characteristics described in a certain embodiment may be combined in any suitable manner in one or more other embodiments, while still complying with the technical idea of the invention.

In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.

The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.

The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.

In one embodiment of the invention, interference is reduced to a greater extent in network information big data identification, for example, a first text data set is obtained according to user features of a user to be evaluated, a third text data set is obtained by extracting text information of the first text data set, and then a text feature variable is constructed by using the third text data set, particularly, confusion semantic data in the third text data set is filtered firstly, namely, the text feature is distinguished by combining with industry or field text features, so that the text information is utilized as much as possible, a variable input into an evaluation model is obtained to evaluate a model score of the corresponding user feature, so that manual identification labeling and the like are combined with model identification, and the text information is utilized as much as possible for processing; and then, whether the corresponding user characteristics belong to specific risk data or not is judged based on the model scores so as to identify whether the corresponding user is a risk user or not, namely, an object generating risk data. Therefore, the identification processing process is simplified, the automatic mining identification efficiency and accuracy are improved, the safety data which are mistakenly hit are eliminated as much as possible, meanwhile, the risk data which are mistakenly hit are identified as much as possible, and the method can be more suitable for the specific data risk conditions of more industries and fields

[ example 1 ]

Fig. 1 is a main flow diagram of an embodiment of a method according to the invention. As shown in fig. 1, the method of the present invention at least comprises the following steps:

s101: and acquiring user characteristics of a user to be evaluated to obtain a first text data set.

The users include individuals, teams, etc. of users who engage in business or business on the web, producing various behavioral and/or status data.

The user characteristics may include at least one of information entered by the user/object, information retrieved from the user terminal, and information extracted from third party data that matches the user.

Specifically, the user characteristics may further include behavior information or status information of the user.

Taking the industries such as network transaction or network financial loan as an example, if a user who is a network platform in these specific industries provides intermediary service in the business of the loan industry, the user characteristics include various behaviors or state information provided by such an intermediary service provider on the network, such as various assets, names, accounts and the like registered at each time stage, and the behavior information includes information of communication, consumption, assets, geography and the like generated in the process of performing network transaction and financial loan intermediary service at different time stages. Thus, the characteristics of users of such industries or the characteristics of users of such industries may include at least: at least one of credit information, asset information, consumption information, communication information, geographic information, terminal information of a user, and the like.

The third-party data may be data in a third-party data system, and the third-party data is relative to a platform for evaluating the risk of the user, and a specific form of the third-party data is not particularly limited herein.

Further, for the intermediary professional user in the example of loan industry, the property information in the user characteristics may also include static property information, such as property, liability, and owner equity, or dynamic property information, such as fund turnover, periodic income, etc.; the communication information may include keywords in the chat log, such as keywords of "money returning" of the other party in the dialog box, the terminal information of the user may include the model of the terminal, which may reflect the consumption intention and the accumulation intention of the user, and this is not specifically illustrated and limited for other specific forms of the user characteristics.

Further, the obtained user features may be formed into a first text data set according to the feature type, the corresponding user, and the relationship between the user and other users, and so on.

S102: extracting text information of the first text data set.

In one embodiment, various desired text information can be extracted from the first text data formed from various information or data of the user characteristics.

In a specific example, during extraction, a word segmentation algorithm may be used to segment words from the first text data set, and any or each segment in the text data in the set is segmented into a sequence of words, so as to obtain a second text data set after word segmentation. Then, the second text data set can be further filtered by adopting a word filtering method to obtain a third text data set, wherein the word filtering method at least comprises stop word filtering.

Further, in the processing of filtering the second text data set by using a word filtering method to obtain a third text data set, the word filtering method may further include low-frequency word filtering.

Wherein, stop words are filtered: the Stop word dictionary Stop _ Dict (w) summarized according to expert knowledge is used for filtering w e Stop _ Dict in the segmentation result. Specifically, in determining text attributes, some words are referred to as stop words (stop words), such as "what", "is", "this", "that", and so forth; the functional words are also called structural words or fictional words, which refer to words without complete lexical meaning but with grammatical meaning or grammatical function, and are commonly pronouns, numerators, prepositions, conjunctions and exclamation words. Further, when the text attribute is determined, the stop words and the function words are removed, and punctuation marks are removed.

In this embodiment, word filtering may be performed mainly on the results of word segmentation, including stop word filtering and low frequency word filtering. The word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification, and for text data, various word segmentation algorithms can be used, such as a word segmentation algorithm based on character string matching, a word segmentation algorithm based on rules, or a word segmentation algorithm based on a neural network.

The low-frequency words can be used to represent words with low daily use frequency, or words with statistical frequency lower than a certain threshold.

In a specific embodiment, a jieba word segmentation software can be used to perform word segmentation and part-of-speech tagging, and filter part of the word segmentation result, and retain part of the word segmentation result (for example, nz other proper names n noun v verb j is abbreviated as vn moving noun nt organization group nl noun idiom ng noun morpheme).

For example: the first text data set of user a contains "three obtains five thousand deposits from the industrial and commercial bank, two thousand intermediary fees", "four Liqu borrows one hundred thousand from beat credits", and "five Wang applies for a credit card of the Central Bank. The words "zhang san", "worker bank", "loan", "wuwan", "intermediary fee", "two thousand", "li si u", "clap credit", "borrow", "hundred thousand", "wang wu", "apply", "chinese bank", "credit card" of the user a can be obtained by the jieba word segmentation method, and the words "zhang san", "li si", "wang wu" are names, "worker bank", "chinese bank", "credit card" are "bank related", "clap credit" is "network credit platform", "borrow", "loan", is verb "and" intermediary fee "is" intermediary related "and the like. Parts of speech (nouns, verbs and the like) are screened, and data are cleaned in the mode, so that valid data (partial word segmentation results) are reserved. Therefore, the operation pressure of the server can be reduced, the model prediction efficiency is improved, and the interference noise is reduced.

And then, after the extracted text information is formed into second text data in the modes of word segmentation and the like, the retained effective text data obtained after the filtering and screening of the word segmentation is carried out, and a third text data set is formed.

S103: and constructing a text characteristic variable by using the third text data set.

In one embodiment, a sensitive word lexicon corresponding to specific risk data is preset, wherein sensitive words of the specific risk data are preset in the sensitive word lexicon corresponding to the specific risk data, the sensitive words of common specific risk data in the third text data set are counted, and a counting result is used as a model variable. Specific examples are shown in fig. 2:

s1031: and storing a sensitive word counting variable of a corresponding category of the specific risk data in the text characteristic variable.

In one embodiment, a count variable is stored that may be sensitive words associated with a professional (or domain-specific) category of a particular industry. Taking the aforementioned intermediary user in the loan industry as an example, a text feature variable may be provided that stores a count variable of intermediary-like sensitive words. Further, a word bank/database of sensitive words corresponding to the corresponding specific risk data is preset, in which the sensitive words of the specific risk data are preset, for example, a word bank of sensitive words of the intermediary vocational class is preset, the word bank has intermediary sensitive words, and the sensitive words in the word bank are stored as the counting variable. The specific examples are as follows:

setting a black intermediary sensitive word library, wherein common black intermediary sensitive words preset in the black intermediary sensitive word library can be:

the intermediary is related: intermediary + (card/finance), office + card, generation + office + credit, package office, etc.;

loan association: debit, credit, black house, white house, credit + (credit/credit), high profit, etc.;

credit card related: carrying out cash withdrawal, large volume, high volume, amount, pos/p0s, ps machine, cash register, (brushing/curing) + card, flower and white bar;

bank correlation: credit/application card, bank name, such as civil bank, Beijing bank, etc.;

the network loan platform is related: and (4) network loan platform names, such as beat loan, calf investment and the like.

In a particular embodiment, + means that two words must appear in the same note at the same time, and/means that only one of the two words appears.

Common risk-specific data in the third set of text data, such as sensitive words of black intermediaries in the previous example, are counted, the result of the counting being used as a model variable.

Taking the data of the user a as an example, the number of the "bank related" sensitive words is 3, the number of the "network credit platform" sensitive words is 1, the number of the "intermediary related" sensitive words is 1, the number of the "loan related" sensitive words is 2, and the like.

S1032: correcting wrongly written words in the third set of text data.

In one embodiment, the text information is prone to wrongly written characters, so that the model noise can be further reduced by correcting and correcting the frequently written wrongly written characters. Manual fine error correction and/or machine check error correction may be employed here.

S1033: merging the synonyms in the third text data set.

In one embodiment, the similar words are sorted and merged by a similar meaning word merging mode. The method can reduce the number of variables, improve the accuracy of the variables and further improve the efficiency of obtaining the model variables through the third text data set.

S1044: filtering the obfuscated semantic data in the third set of text data.

In one embodiment, it is determined whether the information in the third text data set belongs to the confusion semantic data in the special industry, and if the information belongs to the confusion semantic data in the special industry, the information is filtered. Therefore, the method can be suitable for identifying risk data corresponding to different specific industries and fields or occupations in the industries and the fields, and is favorable for accurately identifying the corresponding fields and improving the identification efficiency.

Taking the intermediary occupation identification risk user in the loan industry, such as black intermediary risk data, as an example, part of the special line occupation is similar to the sensitive words of the black intermediary, so that mishits are easy to occur, and therefore, whether the information in the third text data set belongs to the confusion semantic data in the special line occupation can be judged according to the newly added variable, so that the model can identify whether the information is an intermediary more accurately.

For example, the third text data set may also include a lot of dirty data, such as repayment data directly imported from the short message, and the billing data includes words related to the loan, but is an official short message.

Illustratively, the first text data set of user B contains the following words: "beibeibei @163 | > flower will be paid automatically 98.73 yuan from balance, savings card or balance treasure today, may log in to bei Payment-North China check, has also ignored | flower |". The data is participled through a jieba participle method to obtain a plurality of words, the bookkeeping data contains the loan agency related word 'repayment', but the data contains prepositions in the loan but non-agency data text. Therefore, the keyword of "[ flower ] obtained by the jieba word segmentation method needs to be analyzed by a keyword filtering technique, and the piece of data is filtered by the keyword filtering technique, such as the combination of regular forward use and non-forward use, so as to ensure the data feature differentiation and robustness.

According to the keyword filtering technology, the data which are subjected to semantic confusion in the third text data set are filtered from the semantic perspective, and the accuracy of subsequent recognition and the model evaluation efficiency are improved.

S1055: and constructing a word vector for the third text data set.

In one embodiment, word vector construction may be performed on the words in the third text data set, for example: CBOW (continuous Bag of words) mode and Skip-Gram mode to convert text information into low-dimensional continuous values. Therefore, all text information is fully applied, and the accuracy of subsequent recognition and the efficiency of model evaluation are improved.

In addition to the preferred schemes, additional combinations may be made throughout the construction variables.

S104: and inputting the text characteristic variables into a pre-acquired specific risk data evaluation model for calculation to obtain corresponding model scores.

Taking the identification of the risk users in the network users, such as black intermediaries as an example, the text feature variables constructed by using the third text data set are input into a risk data evaluation model obtained in advance, such as a black intermedium evaluation model, and calculation is performed to obtain a model score, for example, whether the user features corresponding to the input variables are scores of black intermediaries risk data.

In one embodiment, the risk data evaluation model may be a score that predicts a probability that a user characteristic or user data of a user to be evaluated is at risk and that the corresponding user may be a risky user.

Taking the above-mentioned example of identifying whether the network user is a black agency, the evaluation model is used to predict the probability that the user to be evaluated is a black agency for loan, and output a black agency score.

Further, obtaining the assessment model may be obtaining at least one of the particular risk assessment models. Specifically, the evaluation model uses existing risk data and corresponding risk objects (risk users) thereof as positive samples, combines normal data and corresponding normal objects (normal users) thereof as negative samples, generates a training set, and is obtained by training with a machine learning algorithm. Wherein, the normal user is a white list risk-free user.

Taking the example of obtaining the evaluation model for evaluating the black intermediary, the method specifically includes: and acquiring at least one black intermediary evaluation model. The evaluation model is obtained by using known black intermediary data and normal user data to establish a training data set, using a black intermediary library as a positive sample, using a normal user library as a negative sample and training by using a machine learning algorithm.

In a particular embodiment, the evaluation model is an XGBoost model.

In a particular embodiment, the evaluation model is a logistic regression model.

In a specific embodiment, the text characteristic variables constructed by using the third text data set are input into the XGBoost model, the goodness and badness of the model are determined through AUC and KS indexes, the grouping score of each variable is determined according to the intermediary score scoring logic, and the accuracy of the model result is verified by using the test set, so as to finally form the probability or probability score of whether the risk exists in the specific risk data corresponding to the user to be evaluated, for example, the data information of the loan intermediary to be identified or evaluated on the network is evaluated as the probability score of the black loan intermediary. And (4) finishing the evaluation, and outputting the scores of the model evaluation, such as probability scores and the like.

S105: determining whether the respective user characteristic belongs to particular risk data based on the model score to identify whether the respective user is a risky user.

In one embodiment, if the probability score is determined, it can be determined by whether the probability score exceeds a preset threshold. The likelihood or probability of risk data is high if exceeded. Specifically, the black intermediary is identified in the loan intermediaries as an example: if the probability is lower than a preset high risk threshold, it is determined that the risk data corresponding to the user feature calculated by the evaluation model is not high in risk, and it may be identified that the corresponding user risk is small, for example: when the probability that the user to be evaluated is the black intermediary, namely the risk probability, does not exceed the preset high risk threshold, the user is not the black intermediary user, and can be trusted to pass, provide platform permission of the service, provide the service for the user, and allow the user to perform various network processing on the platform.

If the probability is higher than a preset high risk threshold, that is, it is determined that the risk data corresponding to the user feature calculated by the evaluation model has a high risk, it may be identified that the corresponding user has a high risk, for example: when the probability that the user to be evaluated is a black intermediary, namely the risk probability, exceeds a preset high risk threshold, a platform providing services can refuse to give credit to the user to be evaluated, does not provide corresponding services, does not allow the user to perform various network processing on the platform, and/or identifies high risk users/objects, and warns or reminds other users/objects which may perform network interactive processing with the user on the network.

Further, if the probability exceeds a preset high risk threshold value and the user to be evaluated can be identified as a risk user, the probability or the possibility that the user to be evaluated is a risk user (such as a black agency) can be further determined by using a manual investigation system through a platform providing services. For example: the platform can comprehensively judge whether the user to be evaluated is a risk user (such as a black intermediary) in a network search mode, a WeChat search mode and the like. If the user is determined to be a non-risk user (such as a non-black intermediary), the credit is passed normally; if the user is judged to be a risk user (such as a black agency), the credit is refused and the control processing is carried out.

According to the embodiment of the invention, through processing the text information related to the object to be recognized, the method is suitable for the corresponding industry field on the one hand, eliminates various interferences on the other hand, reduces noises as much as possible, and realizes the differentiation of different risks of the industry by utilizing the text information as much as possible, namely, the recognition processing is simplified, the cost of manually recognizing and differentiating the risk data and the risk users of various industries of different types is reduced, the recognition accuracy is improved, and the recognition efficiency is improved. In addition, through a mode of combining manual work and model evaluation, for model prediction evaluation, a platform excavates a suspected risk object in a special industry, such as a client/user like a black intermediary, and further more accurate final confirmation is performed in combination with manual work, so that the risk can be reduced, the confirmation rate of risk-free clients/users like non-black intermediaries can be mistakenly rejected, the network data risk can be effectively reduced, for example, the black intermediary generates group fraud, the loss of legal users on the platform is avoided, and the network data safety is ensured.

[ example 2 ]

Similarly, one embodiment of a corresponding apparatus for evaluating a loan black agency corresponds to the method. As shown in fig. 3, the apparatus according to an embodiment of the present invention may specifically include:

the first text data set obtaining module 201 is configured to obtain a user characteristic of a user to be evaluated, so as to obtain a first text data set. The functions and processing manners of the specific modules may be described with reference to step S101.

And the text information extraction module 202 is configured to extract text information of the first text data set to obtain a third text data set. The functions and processing manners of the specific modules may be described with reference to step S102.

A variable construction module 203, configured to construct a text feature variable using the third text data set. The variable construction module 203 includes a sensitive word module 2031 (such as an intermediary industry sensitive word), an incorrect word error correction module 2032, a synonym merging module 2033, an obfuscated semantic data filtering module 2034 for filtering obfuscated semantic data in the third text data set, and a word vector construction module 2035. The functions and processing manners of the specific modules may be described with reference to steps S103, S1031, S1032, S1033, S1034, and S1035.

And the score calculating module 204 is configured to input the text characteristic variable into a pre-obtained specific jieba risk data evaluation model for calculation, so as to obtain a corresponding model score. The functions and processing manners of the specific modules may be described with reference to step S104.

A decision identification module 205, configured to determine whether the corresponding user feature belongs to specific risk data based on the model score to identify whether the corresponding user is a risk user; the functions and processing manners of the specific modules may be described with reference to step S105.

The specific functional correspondence of each module may refer to specific steps and contents of S101-S105, which are not described herein again.

[ example 3 ]

In the following, embodiments of the electronic device of the present invention are described, which may be regarded as an implementation in physical form for the above-described embodiments of the method and apparatus of the present invention. Details described in the embodiments of the electronic device of the invention should be considered supplementary to the embodiments of the method or apparatus described above; for details which are not disclosed in embodiments of the electronic device of the invention, reference may be made to the above-described embodiments of the method or the apparatus.

Fig. 4 is a block diagram of an exemplary embodiment of an electronic device according to the present invention. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 4, the electronic apparatus 200 of the exemplary embodiment is represented in the form of a general-purpose data processing apparatus. The components of the electronic device 200 may include, but are not limited to: at least one processing unit 210, at least one memory unit 220, a bus 230 connecting different system components (including the memory unit 220 and the processing unit 210), a display unit 240, and the like.

The storage unit 220 stores a computer readable program, which may be a code of a source program or a read-only program. The program may be executed by the processing unit 210 such that the processing unit 210 performs the steps of various embodiments of the present invention. For example, the processing unit 210 may perform the steps as shown in fig. 1.

The memory unit 220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)2201 and/or a cache memory unit 2202, and may further include a read only memory unit (ROM) 2203. The storage unit 220 may also include a program/utility 2204 having a set (at least one) of program modules 2205, such program modules 2205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 200 may also communicate with one or more external devices 300 (e.g., keyboard, display, network device, bluetooth device, etc.), enable a user to interact with the electronic device 200 via the external devices 300, and/or enable the electronic device 200 to communicate with one or more other data processing devices (e.g., router, modem, etc.). Such communication may occur via input/output (I/O) interfaces 250, and may also occur via network adapter 260 with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network such as the Internet). The network adapter 260 may communicate with other modules of the electronic device 200 via the bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in the electronic device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

[ example 4 ]

FIG. 5 is a schematic diagram of one computer-readable medium embodiment of the present invention. As shown in fig. 5, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The computer program, when executed by one or more data processing devices, enables the computer-readable medium to implement the above-described methods of the present invention.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments of the present invention described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a data processing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In summary, the present invention can be implemented as a method, an apparatus, an electronic device, or a computer-readable medium executing a computer program. Some or all of the functions of the present invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP).

While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims

1. A method of identifying risk data, comprising:

obtaining a first text data set according to the user characteristics of the user to be evaluated;

extracting text information of the first text data set to obtain a third text data set;

constructing text feature variables using the third text data set;

inputting the text characteristic variables into a pre-acquired specific jieba risk data evaluation model for calculation to obtain corresponding model scores;

determining whether the respective user characteristic belongs to particular risk data based on the model score to identify whether the respective user is a risk user;

the constructing text feature variables using the third text data set includes: a step of filtering the obfuscated semantic data in the third set of text data.

2. The method according to claim 1, wherein extracting text information of the first text data set to obtain a third text data set specifically includes:

and performing word segmentation on the first text data set by adopting a word segmentation algorithm, segmenting a section of words into a sequence of a group of words to obtain a second text data set after word segmentation, and forming a third text data set according to effective data reserved in the second text data set after processing.

3. The method of claim 2, comprising:

the word segmentation algorithm is a jieba word segmentation;

performing word segmentation and part-of-speech tagging by using jieba word segmentation software; and the number of the first and second electrodes,

and filtering parts of speech to reserve partial word segmentation results.

4. The method according to any one of claims 2 to 3, wherein forming the third text data set from the valid data retained in the processed second text data set specifically comprises:

filtering the second text data set by adopting a word filtering method to obtain reserved effective word segmentation data so as to construct a third text data set;

wherein the word filtering method comprises stop word filtering and/or low-frequency word filtering.

5. The method of any of claims 1 to 4, wherein the constructing text feature variables using the third set of text data further comprises:

presetting a sensitive word lexicon corresponding to specific risk data, wherein sensitive words of the specific risk data are preset in the sensitive word lexicon corresponding to the specific risk data, counting the sensitive words of common specific risk data in the third text data set, and taking a counting result as a model variable;

and/or the presence of a gas in the gas,

correcting wrongly written words in the third set of text data;

and/or the presence of a gas in the gas,

merging the similar words in the third text data set;

and/or the presence of a gas in the gas,

and constructing a word vector for the third text data set.

6. The method according to any one of claims 1 to 5, wherein the specific jieba risk data assessment model is an XGboost model or a logistic regression model.

7. The method according to any one of claims 1 to 6,

the filtering the obfuscated semantic data in the third text data set specifically includes: judging whether the information in the third text data set belongs to the confusion semantic data in the special industry, and if the information belongs to the confusion semantic data in the special industry, filtering the information;

and/or the presence of a gas in the gas,

the specific risk data includes black intermediary data.

8. An apparatus for identifying risk data, comprising:

the first text data set acquisition module is used for acquiring a first text data set according to the user characteristics of the user to be evaluated;

the text information extraction module is used for extracting the text information of the first text data set to obtain a third text data set;

a variable construction module for constructing text characteristic variables by using the third text data set;

the score calculating module is used for inputting the text characteristic variables into a pre-acquired specific jieba risk data evaluation model for calculation to obtain corresponding model scores;

a decision identification module for determining whether the corresponding user characteristic belongs to specific risk data based on the model score to identify whether the corresponding user is a risk user;

wherein the variable construction module comprises: and the confusion semantic data filtering module is used for filtering the confusion semantic data in the third text data set.

9. An electronic device, comprising:

a processor; and

a memory storing computer-executable instructions that, when executed, cause the processor to perform the steps of the method of any of claims 1 to 7.

10. A computer readable medium, wherein the computer readable medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-7.