CN113011875A

CN113011875A - Text processing method and device, computer equipment and storage medium

Info

Publication number: CN113011875A
Application number: CN202110038717.5A
Authority: CN
Inventors: 赵薇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2021-06-22
Anticipated expiration: 2041-01-12
Also published as: CN113011875B

Abstract

The embodiment of the application discloses a text processing method and device, computer equipment and a storage medium. The text processing method comprises the following steps: acquiring a target transaction text of a target transaction account; performing word segmentation processing on the target transaction text by adopting a preset word group set to obtain one or more target word groups; the preset phrase set is obtained by identifying and processing a text of a target type; performing text recognition processing on one or more target phrases to obtain the type of a target transaction text; and if the type of the target transaction text is the target type, determining the target transaction account as the transaction account of the target type. By the method and the device, efficiency and accuracy of identifying the transaction account types can be improved.

Description

Text processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text processing method and apparatus, a computer device, and a storage medium.

Background

The type of each transaction account in the mass transaction accounts can be identified to ensure the safety of the transaction system. At present, the transaction behavior under the transaction account is verified manually based on past experience to determine the type of the transaction account.

Because the number of the transaction account numbers is huge, and each transaction account number also contains a large amount of transaction behaviors, the identification efficiency is reduced only by manually identifying the type of the transaction account number, and the manual identification has a large main effect, so that the accuracy of the manual identification is low.

Disclosure of Invention

The embodiment of the application provides a text processing method and device, computer equipment and a storage medium, and can improve the efficiency and accuracy of identifying the type of a transaction account.

An embodiment of the present application provides a text processing method, including:

acquiring a target transaction text of a target transaction account;

performing word segmentation processing on the target transaction text by adopting a preset word group set to obtain one or more target word groups; the preset phrase set is obtained by identifying and processing a text of a target type;

performing text recognition processing on the one or more target phrases to obtain the type of the target transaction text;

and if the type of the target transaction text is the target type, determining the target transaction account as the transaction account of the target type.

An embodiment of the present application provides a text processing apparatus in one aspect, including:

the acquisition module is used for acquiring a target transaction text of a target transaction account;

the word segmentation module is used for carrying out word segmentation processing on the target transaction text by adopting a preset word group set to obtain one or more target word groups; the preset phrase set is obtained by identifying and processing a text of a target type;

the first identification module is used for performing text identification processing on the one or more target phrases to obtain the type of the target transaction text;

and the determining module is used for determining the target transaction account as the transaction account of the target type if the type of the target transaction text is the target type.

An aspect of the embodiments of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to execute the method in the foregoing embodiments.

An aspect of the embodiments of the present application provides a computer storage medium, in which a computer program is stored, where the computer program includes program instructions, and when the program instructions are executed by a processor, the method in the foregoing embodiments is performed.

An aspect of the embodiments of the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium, and when the computer instructions are executed by a processor of a computer device, the computer instructions perform the methods in the embodiments described above.

According to the method and the device, the type of the transaction account number is automatically identified by the terminal according to the transaction text under the transaction account, manual participation is not needed, and the identification efficiency and accuracy of the transaction account number can be improved; furthermore, the type of the transaction account is determined based on the transaction text under the transaction account, so that the identification mode of the transaction account can be enriched; in the transaction text recognition process, the transaction text is segmented based on the preset phrase set determined after the target type text is recognized, compared with a conventional phrase set, the accuracy of segmentation can be guaranteed, and the accuracy of account recognition can be further improved by means of a more accurate segmentation result.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a system architecture diagram of a text process provided by an embodiment of the present invention;

2 a-2 d are schematic diagrams of a text processing scenario provided by an example of the present application;

FIG. 3 is a flow chart of text processing provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an evaluation effect of a model provided in an embodiment of the present application;

fig. 5 is a schematic flowchart of a process for updating a thesaurus according to an embodiment of the present application;

fig. 6 is a schematic flowchart illustrating a process of identifying whether an account is a reimbursement account according to an embodiment of the present application;

fig. 7 is a system architecture diagram of a blockchain according to an embodiment of the present application;

FIG. 8 is a flow chart illustrating text processing according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a text processing technology, a natural language processing technology, machine learning/deep learning and the like.

The scheme provided by the application belongs to the text processing technology and machine learning/deep learning belonging to the field of artificial intelligence. The text recognition model and the transaction classification model can be trained through machine learning/deep learning, and the probability that the text to be recognized belongs to the target type can be recognized through the text recognition model and the transaction classification model based on the word group set after the text to be recognized is segmented. Subsequently, whether the transaction account number belongs to the target type or not is judged based on the probability that the text to be identified belongs to the target type, and business operation is executed based on the judgment result.

Fig. 1 is a system architecture diagram of text processing according to an embodiment of the present invention. The server 10f establishes a connection with a user terminal cluster through the switch 10e and the communication bus 10d, and the user terminal cluster may include: user terminal 10a, user terminal 10 b. The database 10g stores transaction text for a number of transaction account numbers. For a transaction text of a transaction account, the server 10f performs word segmentation processing on the transaction text of the transaction account by using a preset word group set to obtain one or more word groups, wherein the preset word groups are obtained by performing recognition processing on a text of a target type. And the server performs text recognition processing on the one or more word groups subjected to word segmentation to obtain the type of the transaction text of the transaction account, and if the type of the transaction text is the target type, the transaction account is determined to be the transaction account of the target type.

The terminal device 10a, the terminal device 10b, the terminal device 10c, and the like shown in fig. 1 may be an intelligent device having a display function, such as a mobile phone, a tablet computer, a notebook computer, a palm computer, a Mobile Internet Device (MID), a wearable device, and the like. The terminal device cluster and the server 10f may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The application can be applied to a reimbursement account number identification system (or a money laundering account number identification system or a gambling account number identification system), when the account number needs to be identified whether to be reimbursed (or washed or gambled), transaction text of the account number is extracted from a transaction pool, and the transaction text is identified by adopting the scheme of the application to determine the type of the account number, wherein the type can identify whether the account number is reimbursed (or washed or gambled). If it is recognized that the account is involved in an reimbursement (or money laundering, or gambling), the account may be reported and subsequently impacted, for example, by restricting the account from conducting transactions, etc.

According to the text processing method, the transaction text and the like of the account number to be identified can be stored on the block chain, and depending on the integrity and the transparency of the block chain, the reliability of the transaction text can be ensured, so that the reliability of the identification result of the account number to be identified is improved; subsequently, if the type of the transaction account to be identified is determined to be the target type, the account to be identified can be stored in the block chain, so that the identification result is prevented from being maliciously tampered, and the identification result is made to have traceability.

In the following, fig. 2 a-2 d are taken as examples to specifically describe how to identify whether a certain account is a sales account belonging to a sales type. Please refer to fig. 2 a-2 d, which are schematic diagrams of a text processing scenario provided in the present application. As shown in fig. 2a, the interface 20a in fig. 2a is the primary interface of the reimbursement account number recognition system, and the primary interface contains the recognized transaction account number and the transaction account number to be recognized. The processing state of the identified transaction account number is: after processing, the processing state of the transaction account number to be identified is as follows: the identification result of the unprocessed, and unprocessed transaction account number is also null. The auditor can select the transaction account number for the current identification. As shown in fig. 2a, assume that the auditor selects transaction account number "5870" for identifying whether the account is an reimbursement account number.

Of course, in addition to manually selecting the transaction account number for current identification, the system may automatically pull the transaction account number to be identified from the transaction account number pool to determine whether the pulled transaction account number is a biography account number.

When the auditor selects the transaction account number "5870", as shown in fig. 2b, an interface 20b is displayed, where the interface 20b includes the transaction text in the transaction data of the transaction account number "5870". As can be seen from fig. 2b, the transaction text mainly includes: nickname of the transaction account number, nickname of the opposite side of the transaction, text information related in the transaction process such as the transaction remark and the like. When the auditor clicks the button "start" in the interface 20b, as shown in fig. 2c, the biography account number recognition system extracts the transaction text 20c of the transaction account number "5870". The pass account number recognition system invokes the preset lexicon 20d to segment the transaction text 20c to divide the transaction text 20c into a set of phrases 20 e. The preset lexicon 20d comprises conventional vocabularies and vocabularies exclusive to the marketing field, wherein the vocabularies exclusive to the marketing field are obtained by identifying texts belonging to the marketing type, and the texts belonging to the marketing type can be marketing news crawled from the internet or marketing popular science articles crawled from a marketing hit website. By identifying the texts belonging to the marketing type, new marketing words and vocabularies which are special for the marketing field can be found, and then the transaction texts can be segmented more accurately based on the preset word bank 20 d.

After the biography account number recognition system acquires the phrase set 20e of the transaction text 20c, it then recognizes whether the transaction text 20c is a biography text based on the phrase set 20e and the artificial intelligence model. The specific identification process is as follows: the biography account number recognition system converts each phrase in the phrase set 20e into a word vector, combines all the word vectors into a word vector matrix, and performs convolution operation and pooling operation on the word vector matrix to extract hidden features of the transaction text 20 c. And then, carrying out full connection processing on the hidden features of the transaction text 20c, and outputting the probability that the transaction text 20c belongs to the reimbursement type and the probability that the transaction text 20c does not belong to the reimbursement type, wherein the sum of the probability that the transaction text 20c belongs to the reimbursement type and the probability that the transaction text 20c does not belong to the reimbursement type in the output result is 1 in general.

Assuming that the artificial intelligence model recognizes that the probability that the transaction text 20c belongs to the reimbursement type is 0.8 and the probability that the transaction text does not belong to the reimbursement type is 0.2, the reimbursement type transaction text of the transaction text 20c of the transaction account number "5870" can be determined. Based on this, the biography account number recognition system may determine that the transaction account number "5870" is a biography-type transaction account number.

As shown in fig. 2d, since the transaction account number "5870" is determined as the reimbursement account number, the processing status of displaying the transaction account number "5870" on the primary interface is as follows: processed, and the recognition result of the transaction account number "5870" is: yes, meaning that the transaction account number "5870" is a reimbursement account number.

Since the transaction account number "5870" is recognized as a promotional account number, the transaction account number "5870" may be subsequently transacted, for example, prohibiting the transaction account number "5870" from any transfer transactions, even prohibiting the transaction account number "5870" from having a social session with other transaction account numbers, and so on.

A specific process of obtaining a target transaction account (such as the transaction account "5870" in the foregoing embodiment) and a target transaction text (such as the transaction text 20c in the foregoing embodiment), and performing word segmentation on the target transaction text by using a preset word group set (such as the preset lexicon 20d in the foregoing embodiment) to obtain one or more target word groups (such as the word group set 20e in the foregoing embodiment) may refer to the following embodiments corresponding to fig. 3 to fig. 8.

Referring to fig. 3, which is a schematic flowchart of a text processing provided in an embodiment of the present application, because the text processing of the present application relates to an artificial intelligence model to identify a text type, the following steps are described with a better-performance server as an execution subject, the cancellation account identification system in the above embodiment may be applied to the server, and the text processing includes the following steps:

and step S101, acquiring a target transaction text of the target transaction account.

Specifically, when the server receives a detection request for a target transaction account number (transaction account number "5870" in the embodiment described above in fig. 2 a-2 d), the server extracts transaction text (referred to as target transaction text) of the target transaction account number from the transaction data pool, wherein the target transaction text may include a nickname of the target transaction account number, a nickname of a transaction partner, a transaction remark, and the like.

In order to detect the transaction account number as real-time as possible, the server may extract the transaction account number in which a transaction action occurs within one hour in an hour unit, use the extracted transaction account number as a target transaction account number, and further generate a detection request for the target transaction account number.

Of course, instead of the server automatically pulling the target transaction account for detection, the server may also generate a detection request for the target transaction account in response to a user operation, for example, the user manually selects the target transaction account for current detection, and then the server generates the detection request for the target transaction account.

The target transaction text may be the last month of the target transaction account number, or the last three months of the target transaction account number, or all transaction text since the target transaction text was created.

Step S102, a preset phrase set is used for carrying out word segmentation processing on the target transaction text to obtain one or more target phrases; the preset phrase set is obtained by identifying and processing a text of a target type.

Specifically, the server performs word segmentation processing on the obtained target transaction text based on a preset word group set to divide the target transaction text into one or more word groups, and the divided one or more word groups are all referred to as target word groups (as the word group set 20e in the embodiment corresponding to fig. 2a to fig. 2 d). The preset phrase set comprises some conventional phrases and phrases belonging to target types.

The preset phrase set is obtained after the text of the target type is identified, and when the application is applied to identifying the reimbursement account number, the target type can be the reimbursement type, namely the preset phrase set is obtained after the text of the reimbursement type is identified; when the method is applied to identifying the money laundering account, the target type can be the money laundering type, namely the preset phrase set is obtained by identifying and processing the text of the money laundering type; when the application is applied to identifying a gambling account, the target type can be a gambling type, namely, the preset phrase set is obtained by identifying and processing the text of the gambling type.

The target type of text may include, among others, target type news crawled from a network, target type text crawled from a specialized website (e.g., promotional science text crawled from a specialized marketing website that strikes a promotion when the target type is a promotion type), and target type complaint text.

The word segmentation processing of the target transaction text based on the preset word group set can be specifically realized by adopting a jieba word segmentation algorithm, and the specific process of the jieba word segmentation algorithm is described as follows:

the first step is to construct prefix words of a preset phrase set: each preset phrase in the preset phrase set also carries a word frequency, and the word frequency is the frequency of the preset phrase in the target type of text or a default value. Reading all preset phrases in the preset phrase set, constructing prefix words of each preset phrase, setting the word frequency of the prefix words to be the word frequency carried originally for the prefix words existing in the preset phrase set, and setting the word frequency of the prefix words to be 0 for the affix words not existing in the preset phrase set, wherein the word frequency is used for subsequently calculating the probability of each path.

For example, the preset phrase is: if the study is good, the prefix words of the preset phrases comprise: good, good learning, and good learning.

The second step is to construct a directed acyclic graph of the target transaction text: traversing all phrases possibly formed by each character in the target transaction text, if subscripts of the phrases are recorded in the phrases existing in the preset phrase set, the characters possibly have various existing word forming possibilities, forming a directed acyclic graph, calculating each word forming probability according to the directed acyclic graph, and enabling the path of the maximum probability to be the best word segmentation possibility. jieba adopts a dit structure representation dag, and the final dag is stored in a dictionary structure of { k: [ k, j ], m: [ m, p, q ], … }, wherein k and m are positions of characters in the target transaction text, and a list corresponding to k stores end positions of possible words beginning with k in the target transaction text.

And the third step is dynamic planning to determine the maximum probability path. The main function for calculating the maximum probability path in the jieba participle is calc (self, sense, DAG, route), and the function calculates the maximum probability path according to the constructed directed acyclic graph. The function is a bottom-up dynamic programming problem that computes the log-probability scores of clauses [ idx-N-1 ] in a manner that traverses each word (idx) of the target traded text in reverse order starting with the last word (N-1) of the target traded text. The case with the highest probability log score is then saved in route as (probability log, last word position) such a tuple. In the function, the probability logarithm is a logarithm value of the sum of word frequencies of hit prefix words, and the probability logarithm value is used for calculation, so that the underflow problem can be effectively prevented.

At this point, one or more target phrases of the target transaction text may be obtained.

The following explains the generation process of the preset phrase set:

the method comprises the steps of obtaining a text of a target type, identifying a new word set belonging to the target type in the text, obtaining an original word bank, and adding the identified new word set to the original word bank to obtain a preset word group set. The new word set can be regarded as a new word which is special for the target type and does not appear in the original word stock, and the preset word group set is updated by continuously adding the new word set to the original word stock, so that the word segmentation accuracy can be ensured.

The updating frequency of the preset phrase set can be one month or one week, namely, a new target type text is crawled at intervals of one updating period, so that a new word set is identified, and a new preset phrase set is obtained.

The following describes a specific process of determining a new word set of a target type based on a text of the target type:

the server can divide the target type text into a plurality of character sequences according to the n-gram principle, and each character sequence comprises n characters. For example, the text is: AAABBC, and n is 2, the server may divide the text into 3 character sequences, respectively: AA. AB and BC.

The server respectively determines the phrase evaluation indexes of each character sequence, and the phrase evaluation indexes comprise: the system comprises a phrase frequency, a phrase solidity and a phrase freedom, wherein the phrase evaluation index is used for evaluating whether a character sequence can be judged as a phrase.

And the server selects a character sequence which can be judged as a phrase from all the character sequences according to each character phrase evaluation index, and combines all the character sequences into a first phrase set. And the server performs duplication removal on the first phrase set according to the original word stock, the character sequences left after duplication removal are all called second phrases, and all the second phrases are combined into a second phrase set. It can be known that the second phrase sets after deduplication are all new words that the original word stock does not have, and in order to further screen out new words of the target type, the server also needs to screen out new word sets belonging to the target type from the plurality of second phrases according to the text source of the text corresponding to each second phrase.

Taking a character sequence as an example, called a target character sequence, to describe how to calculate a phrase evaluation index of the target character sequence, and how to determine whether the target character sequence belongs to the first phrase set according to the phrase evaluation index of the target character sequence:

and the server counts the phrase frequency of the target character sequence in the text of the target type.

The server calculates the degree of solidity (called phrase solidity) of the target character sequence, which is used to measure the degree of closeness between characters in a character sequence, for example, the degree of solidity of character sequences such as "colored glaze" and "apple" is very high, but the degree of solidity of character sequences such as "glory" is low. The process of calculating the degree of solidity of the target character sequence is as follows: the target character sequence is first divided into different combination pairs, for example, 'abcd' can be divided into ('a', 'bcd', ('ab', 'cd'), ('abc', 'D'), then the degree of solidity D (s1, s2) ═ P (s1s2)/(P (s1) × P (s2)) is calculated for each combination pair, and finally, the lowest one of the degrees of solidity of the combination pairs is taken as the degree of solidity of the target character sequence. Where p (x) represents the phrase frequency of x in the text.

Taking the word "movie theater" as an example, the specific calculation formula is as follows:

where C (cinema) represents the "cinema" solidity and p (cinema) represents the frequency of occurrence of the word in the text.

The server calculates the degree of freedom (called phrase degree of freedom) of the target character sequence, which is used to measure the degree of freedom that the character sequence can run independently. For example, the solidification degree of chocolate in chocolate is high and is as high as chocolate, but the adjacent words on the right side are very single and the free application degree is almost zero, so that the chocolate cannot be independently used. Word combinations of individual words should have richer temporary words. The calculation formula of the degree of freedom is as follows:

F(w)＝min{H_L(w),H_R(w)} (2)

wherein F (w) is the degree of freedom of the character sequence w, and H_L(w)、H_R(w) are the left and right neighbourhood entropy of the character sequence w, respectively. The calculation formula of the information entropy is as follows:

thus, the phrase frequency, the phrase solidity and the phrase freedom of the target character sequence are determined.

If the phrase frequency of the target character sequence is greater than a preset frequency threshold, the phrase solidity of the target character sequence is greater than a preset solidity threshold, and the phrase freedom of the target character sequence is greater than a preset freedom threshold, determining that the target character sequence belongs to a first phrase set; otherwise, if the phrase frequency of the target character sequence is not greater than the preset frequency threshold, or the phrase solidity of the target character sequence is not greater than the preset solidity threshold, or the phrase degree of freedom of the target character sequence is not greater than the preset degree of freedom threshold, it may be determined that the target character sequence does not belong to the first phrase set.

In general, it can be determined that the target character sequence belongs to the first phrase set only if the phrase frequency, the phrase solidity and the phrase freedom of the target character sequence are all greater than the threshold.

Optionally, in addition to determining whether the target character sequence belongs to the first phrase set by using the above-mentioned strategy, the following strategy may also be used to determine whether the target character sequence belongs to the first phrase set:

if the phrase frequency of the target character sequence is greater than a preset frequency threshold, or the phrase solidity of the target character sequence is greater than a preset solidity threshold, or the phrase freedom of the target character sequence is greater than a preset freedom threshold, determining that the target character sequence belongs to a first phrase set; otherwise, if the phrase frequency of the target character sequence is not greater than the preset frequency threshold, the phrase solidity of the target character sequence is not greater than the preset solidity threshold, and the phrase freedom of the target character sequence is not greater than the preset freedom threshold, it may be determined that the target character sequence does not belong to the first phrase set.

In general, the target character sequence can be determined to belong to the first phrase set as long as at least one of the phrase frequency, the phrase solidity, and the phrase freedom of the target character sequence is greater than a threshold value.

The server performs duplication removal on the first phrase set according to the original word stock to obtain a second phrase set, and the specific process comprises the following steps: the server determines the intersection between the original word stock and the first phrase set, deletes phrases in the intersection in the first phrase set, calls the remaining phrases in the first phrase as second phrases, and combines all the second phrases into a second phrase set.

The following describes how to determine whether the target second word group belongs to the target type new word set according to the text source of the target second word group by taking the second word group as an example (called the target second word group):

the text source of the target second phrase may be considered as a text source of the aforementioned text belonging to the target type, and the text source may be a first text source or a second text source, and the first text source and the second text source are divided according to the text application scenario. For example, text belonging to a first text source is a target type of news crawled from a network, or a target type of text crawled from a specialized website; the text belonging to the second text source is complaint text or the like about the target type.

If the text source corresponding to the target second phrase is the first text source, the target second phrase can be directly determined to belong to a new word set; if the text source of the target second phrase is a second text source, acquiring a transaction text of each transaction account in a transaction account set, wherein the transaction account set comprises a plurality of transaction accounts, the types of the transaction accounts in the transaction account set are determined, and the types of the transaction accounts are either of the target type or not of the target type.

And selecting a transaction account with a transaction text containing a target second phrase from the transaction account set, calling the selected transaction account as a to-be-determined transaction account, and selecting the to-be-determined transaction account belonging to a target type from the multiple to-be-determined transaction accounts. And the server calculates the ratio of the number of the transaction account numbers to be determined belonging to the target type to the number of all the transaction account numbers to be determined, and if the ratio is greater than a preset threshold value, the target second phrase is determined to belong to a new word set.

In short, whether the current second phrase is a new word set is deduced in a reverse way through the rest of the transaction texts of the determined type of transaction account.

Step S103, performing text recognition processing on the one or more target phrases to obtain the type of the target transaction text.

Specifically, a word vector model is called, each target word group after the target transaction text is segmented is converted into a word vector, and all the word vectors are combined into a word matrix. For example, the dimension of the word vector is 1 × m, and n target word groups are obtained after the target transaction text is segmented, so the dimension of the word matrix is: n × m.

The output of the trained text classification model is the probability that the transaction text belongs to the target type, and the probability that it does not belong to the target type. Calling convolution layers in the trained text classification model, performing convolution operation on the combined word matrix to obtain convolution characteristics, calling pooling layers in the text classification model, and performing pooling operation on the convolution characteristics to obtain pooling characteristics of the target transaction text; calling a full link layer in a text classification model, performing full connection processing on the pooled features to obtain the probability (called as a first probability) that the target transaction text belongs to the target type, and if the first probability is greater than or equal to a first probability threshold, determining that the type of the target transaction text is the target type; otherwise, if the first probability is smaller than the first probability threshold, determining that the type of the target transaction text is not the target type.

And step S104, if the type of the target transaction text is the target type, determining the target transaction account as the transaction account of the target type.

Specifically, if the type of the target transaction text is the target type, the server may directly determine the target transaction account as the transaction account belonging to the target type.

Besides directly determining the type of the target transaction account, the following strategy can be adopted to determine the type of the target transaction account:

if the type of the target transaction text is a target type, the server acquires a transaction flow of the target transaction account (referred to as a target transaction flow), wherein the target transaction flow comprises transaction time (referred to as target transaction time), transaction resource data volume (referred to as target transaction resource data volume) and the like, and the transaction flow can be a transaction flow of the target transaction account in a month or three months or all flows since the target transaction account was created.

Generating transaction characteristics according to the target transaction time and the target transaction resource data quantity, inputting the transaction characteristics into a trained transaction classification model, and outputting the probability that the transaction flow belongs to the target type and the probability that the transaction flow does not belong to the target type. And the transaction classification model outputs the probability (called as a second probability) that the target transaction flow is the target type, and if the sum of the first probability and the second probability is not less than a preset second probability threshold, the target transaction account is determined as the transaction account belonging to the target type.

Optionally, if the second probability is not less than the second probability threshold, the target transaction account is determined to be a transaction account belonging to the target type.

In general, if it is determined whether the type of the target transaction account is the target type, the target transaction text of the target transaction account is referred to, and the target transaction flow of the target transaction account is referred to for common determination. The type of the target transaction account is judged through the transaction data of multiple dimensions, and the identification accuracy of the target transaction account can be improved.

Referring to fig. 4, fig. 4 is a schematic diagram of a model evaluation effect provided by an embodiment of the present application, where a dotted line in fig. 4 represents Precision (Precision) and a solid line represents Recall (Recall). As can be seen from FIG. 4, the Precision (Precision) and Recall (Recall) of the test data set are both relatively high; furthermore, the accuracy of the present application on the test data set, accuracuracy, also reached 0.991, KS (Kolmogorov-Smirnov, kolmogolov-Smirnov test) reached 0.964: auc (area under the curve ROC) reached 0.996. Meanwhile, when the high-score account is subjected to verification and trial check in the actual large-scale disk, the accuracy rate of the method is good.

Referring to fig. 5, fig. 5 is a schematic flow chart of updating a thesaurus provided in the embodiment of the present application, and this embodiment mainly describes a process of adding some new posting words to the thesaurus to update the thesaurus, where updating the thesaurus includes the following steps:

in step S201, the flow starts.

Step S202, pulling the text data.

Specifically, the server may pull text from a dedicated promotional website, and crawl promotional news over the web, and pull promotional complaint text. The pulled text data may correspond to a text belonging to the target type in the application.

Step S203, preprocessing the text data and filtering meaningless words or symbols in the text data.

In step S204, a new word in the text data is found.

Specifically, the text data is divided into a plurality of character sequences, and 3 word formation evaluation indexes of each character sequence are calculated, wherein the 3 word formation evaluation indexes are word frequency, freezing degree and degree of freedom. And (4) screening out phrases in the text data according to the 3 word-forming evaluation indexes, and then removing duplication with the existing word stock to obtain a batch of undetermined new words. The undetermined new word may correspond to the second phrase in the present application.

The specific process of calculating the word frequency, the degree of solidity and the degree of freedom of each character sequence can be referred to step S102 in the corresponding embodiment of fig. 3.

After a batch of undetermined new words are found, the risk qualitative judgment is carried out on the batch of undetermined new words. Only undetermined new words meeting the risk qualitative judgment condition can be used as new words belonging to the biography type and then added to the word stock.

Step S205, if the text where the pending new word is located is news crawled from the internet or text pulled from a special marketing website, it can be determined that the pending new word is a new word belonging to a marketing type.

Step S206, if the text where the undetermined new word is located is a biography complaint text, recalling a batch of transaction texts containing the account number of the undetermined new word.

And step S207, determining whether the new word to be qualified is a new word belonging to a distribution type according to the malicious concentration.

Specifically, from the set of recalled accounts, an account that is of the biographical type is selected (which may be a manual selection here). And calculating the malicious concentration (the malicious concentration is equal to the number of the upload recall account numbers/the total recall account numbers) of the undetermined new word, and determining whether the undetermined new word is a new word of the upload type according to a certain threshold standard.

Step S208, updating the word stock.

Specifically, the new words belonging to the marketing type and selected by the 2 methods are added into a word stock to form a new word stock, and the selected new words belonging to the marketing type can correspond to the word group in the new word set in the application. The new lexicon may be used for model training or for audit determination.

In step S209, the flow ends.

Referring to fig. 6, fig. 6 is a schematic flowchart illustrating a process of identifying whether an account is a reimbursement account according to an embodiment of the present application, where the identification process includes the following steps:

in step S301, the flow starts.

Step S302, extracting the transaction text of the account to be identified.

Specifically, a transaction text of the account to be identified in the last month is obtained, wherein the transaction text comprises data such as a transfer text, a red packet text, a nickname text and the like, and the transaction text is the model characteristic of the account to be identified.

Step S303, transaction text preprocessing.

Specifically, neutral words and stop words in the transaction text are filtered, such as neutral words of "happy birthday", "happy hundred years" and "blessing", and stop words of "ground", "get", and the like.

Step S304, loading the updated word stock, and segmenting the preprocessed transaction text based on the updated word stock so as to divide the transaction text into a plurality of word groups. Because the updated word stock contains the new words of the biography and marketing class, the word segmentation result of the transaction text is more accurate.

The updated word stock is the word stock to which the new words belonging to the biography type are added in the embodiment corresponding to fig. 5.

Step S305, each word group after word segmentation is subjected to thermal coding.

Each phrase after thermal coding corresponds to a thermal coding vector, and the vector has only 1, and the rest values are 0.

Step S306, reducing the dimension of each thermal coding vector to convert each thermal coding vector into a word vector.

Step S307, inputting the word vectors of all the word groups into a trained text cnn (text convolutional neural network) model, and performing convolution operation on the word vectors of all the word directions by a convolution layer in the text cnn model to obtain convolution characteristics.

Wherein the text cnn model may correspond to a text classification model in the present application.

And S308, performing pooling operation on the convolution characteristics by a pooling layer in the text cnn model to obtain pooling characteristics.

And S309, fully connecting the pooled features by a full connection layer in the text cnn model to obtain full connection features.

Optionally, the convolution layer in the text cnn model performs convolution operation on the word vectors of all word directions to obtain convolution characteristics. And performing pooling operation on the word vectors of all word directions by a pooling layer in the text cnn model to obtain pooling characteristics. And fully connecting the convolution characteristic and the pooling characteristic to obtain a fully-connected characteristic.

In step S310, the normalization layer determines a probability score according to the full-link characteristics.

Specifically, the probability score represents the probability that the transaction text is of the type of reimbursement.

In addition to using the text cnn model to determine the probability that the transaction text is of the reimbursement type, other text classification models such as fast text can be used to determine the probability that the transaction text is of the reimbursement type.

Step S311, based on the transaction flow of the transaction account number to be identified and the trained transaction model, determining the probability that the transaction flow is a reimbursement type.

And determining whether the account to be identified is a reimbursement account or not by combining the probability that the transaction text of the account to be identified is the reimbursement type and the probability that the transaction flow of the account to be identified is the reimbursement type, for example, if the sum of the probability that the transaction text of the account to be identified is the reimbursement type and the probability that the transaction flow of the account to be identified is the reimbursement type is greater than a threshold value, determining that the account to be identified is the reimbursement account.

In step S312, the process ends.

After the wind control decision engine quasi-real-time platform deploys the set of model, the large plate can identify 1w + malicious biography account numbers every day, thereby effectively attacking the malicious risk of biography and distribution in large plate transaction and reducing misjudgment cases and customer complaints rate of the biography and distribution attack.

Referring to fig. 7, fig. 7 is a system architecture diagram of a blockchain according to an embodiment of the present invention. The server in the foregoing embodiment may be node 1, or node 2, or node 3, or node 4 in fig. 7, and all the nodes may be combined into a blockchain system, and each node includes a hardware layer, an intermediate layer, an operating system layer, and an application layer. As can be seen from fig. 7, the blockchain data stored by each node in the blockchain system is the same. It will be appreciated that the nodes may comprise computer devices. The following embodiments are described with a target blockchain node as an execution subject, where the target blockchain node is any one of a plurality of nodes in a blockchain system, and the target blockchain node may correspond to a server in the foregoing embodiments.

Referring to fig. 8 together, fig. 8 is a schematic flowchart of a text processing method provided in an embodiment of the present application, where the embodiment mainly describes a combination of identification of a transaction account and a block chain technique, and the text processing method includes the following steps S401 to S405:

step 401, when a detection request of a target transaction account is detected, determining a first block corresponding to a block height in a block chain, and reading an original transaction text of the target transaction account in the first block.

Specifically, when a target block link point detects a detection request of a target transaction account, the block height carried by the detection request is extracted. The target block chain node acquires a block chain, and extracts a block corresponding to the block height from the block chain, which is called a first block. The first block stores original transaction text of the target transaction account. And the target block chain node extracts the original transaction text of the target transaction account from the block body of the first block.

The original transaction text may include a nickname for the target transaction account number, a nickname for the counterparty of the transaction, a transaction remark text, etc.

Step S402, a filtering word bank is obtained, and the original transaction text is filtered according to the filtering word bank to obtain the target transaction text.

Specifically, the target block chain node acquires a filtering word bank, and filters the original transaction text according to the filtering word bank to obtain a target transaction text of the target transaction account. The filtering word stock contains neutral words and stop words, for example, words such as "happy birthday", "conjugal felicity" and "blessing" are all neutral words, and words such as "what", "place", "get", "and" or "are all stop words. Thus, both neutral words and stop words are filtered out in the original transaction text.

Step S403, a preset phrase set is adopted to perform word segmentation processing on the target transaction text to obtain one or more target phrases; the preset phrase set is obtained after the text of the target type is identified.

Step S404, performing text recognition processing on the one or more target phrases to obtain a type of the target transaction text, and if the type of the target transaction text is the target type, determining the target transaction account as a transaction account of the target type.

The specific processes of step S403 to step S404 may refer to step S102 to step S104 in the corresponding embodiment of fig. 3.

Step S405, packaging the target transaction account number into a second block, and storing the second block in the block chain.

Specifically, if the type of the target transaction account is determined to be the target type, the target block link point stores the determined target transaction account belonging to the target type in a block, calculates the tacle root of the target transaction account, and obtains the hash value of the last block of the current block chain, and the target block link point stores the tacle root of the target transaction account, the hash value of the last block of the current block chain, and the current timestamp in a block header. And the target block chain node combines the block head and the block body storing the target transaction account number into a second block, stores the second block to the block chain maintained by the target block chain node, and broadcasts the second block to other nodes so that the other nodes add the second block to the block chain maintained by each node, thereby synchronizing the block chains maintained by each node.

When the target type is a pass-through type, subsequently, the node which needs to execute the pass-through transaction account striking can read the second block from the block chain, and read the target transaction account from the second block, so as to strike the target transaction account.

Therefore, by means of the integrity attribute and the non-tampering attribute of the block chain, the original transaction text acquired by the link point of the target block chain can be guaranteed to be credible and not to be tampered, so that the type of the target transaction account identified based on the original transaction text is also credible, and the safety and the accuracy of the identification process of the target transaction account can be guaranteed.

Further, please refer to fig. 9, which is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application. As shown in fig. 9, the text processing apparatus 1 can be applied to a server or a target blockchain node in the above-described embodiments corresponding to fig. 3 to 8. Specifically, the text processing apparatus 1 may be a computer program (including program code) running in a computer device, for example, the text processing apparatus 1 is an application software; the text processing device 1 can be used for executing the corresponding steps in the method provided by the embodiment of the application.

The text processing apparatus 1 may include: the system comprises an acquisition module 11, a word segmentation module 12, a first recognition module 13 and a determination module 14.

The acquisition module 11 is configured to acquire a target transaction text of a target transaction account;

the word segmentation module 12 is configured to perform word segmentation processing on the target transaction text by using a preset word group set to obtain one or more target word groups; the preset phrase set is obtained by identifying and processing a text of a target type;

the first identification module 13 is configured to perform text identification processing on the one or more target phrases to obtain a type of the target transaction text;

and a determining module 14, configured to determine the target transaction account number as a target type transaction account number if the type of the target transaction text is the target type.

In one possible embodiment, the text processing apparatus 1 may further include: a second identification module 15 and an update module 16.

The obtaining module 11 is further configured to obtain a text of a target type;

the second recognition module 15 is configured to recognize a new word set belonging to the target type in the text;

and the updating module 16 is configured to obtain an original word bank, and add the new word set to the original word bank to obtain the preset phrase set.

In a possible implementation, the second recognition module 15, when being configured to recognize a set of new words in the text that belong to the target type, is specifically configured to:

dividing the text into a plurality of character sequences, and identifying each character sequence to obtain a phrase evaluation index of each character sequence;

selecting a first phrase set from the plurality of character sequences according to the phrase evaluation index of each character sequence;

carrying out duplication removal processing on the first phrase set to obtain a second phrase set; the second phrase set comprises a plurality of second phrases;

and determining a new word set belonging to the target type from the plurality of second word groups according to the text source of each second word group.

In a possible implementation manner, the target second phrase is any one of a plurality of second phrases, a text source of the target second phrase is a first text source or a second text source, and the first text source and the second text source are divided according to a text application scenario;

the second identifying module 15 is specifically configured to, when determining whether the target second phrase belongs to the new word set according to the text source of the target second phrase:

and if the text source of the target second phrase is the first text source, determining that the target second phrase belongs to the new word set.

In a possible embodiment, the second identification module 15 is further configured to:

if the text source of the target second phrase is a second text source, acquiring a transaction text of each transaction account in the transaction account set;

selecting a plurality of to-be-determined transaction accounts of which the transaction texts contain the target second phrase from the transaction account set, and selecting the to-be-determined transaction accounts of which the types are target from the plurality of to-be-determined accounts;

and if the ratio of the number of the transaction account numbers to be determined belonging to the target type to the number of the transaction account numbers to be determined is greater than a threshold value, determining that the target second phrase belongs to the new word set.

In one possible implementation, the target character sequence is any one of a plurality of character sequences, and the phrase evaluation index of the target character sequence comprises a phrase frequency, a phrase solidity and a phrase freedom;

the second identifying module 15 is specifically configured to, when determining whether the target character sequence belongs to the first word group set according to the word group evaluation index of the target character sequence:

if the phrase frequency of the target character sequence is greater than the frequency threshold, the phrase solidification degree is greater than the solidification degree threshold, and the phrase freedom degree is greater than the freedom degree threshold, determining that the target character sequence belongs to a first phrase set; or,

and if the phrase frequency of the target character sequence is greater than the frequency threshold, or the phrase solidification degree is greater than the solidification degree threshold, or the phrase freedom degree is greater than the freedom degree threshold, determining that the target character sequence belongs to the first phrase set.

In a possible implementation manner, when the second identifying module 15 is configured to perform deduplication processing on the first phrase set to obtain a second phrase set, specifically, to:

determining an intersection between the original word stock and the first phrase;

and deleting the phrases corresponding to the intersection in the first phrase set, and combining the remaining phrases in the first phrase set into the second phrase set.

In a possible implementation manner, when the first recognition module 13 performs text recognition processing on the one or more target phrases to obtain the type of the target transaction text, the first recognition module is specifically configured to:

converting each target phrase into a word vector respectively, and combining all the word vectors into a word matrix;

calling a text classification model to identify the word matrix to obtain a first probability that the type of the target transaction text is the target type;

if the first probability is not less than a first probability threshold, determining that the type of the target transaction text is a target type.

In a possible implementation manner, when the determining module 14 is configured to determine the target transaction account number as a transaction account number belonging to a target type if the type of the target transaction text is the target type, specifically, to:

if the type of the target transaction text is a target type, acquiring a target transaction flow of the target transaction account, wherein the target transaction flow comprises a target transaction resource data volume and a target transaction time;

calling a transaction classification model to identify the target transaction resource data volume and the target transaction time, and obtaining a second probability that the type of the target transaction flow is a target type;

and if the sum of the first probability and the second probability is not less than a second probability threshold, determining the target transaction account as the transaction account belonging to the target type.

In a possible implementation manner, the obtaining module 11, when configured to obtain a target transaction text of a target transaction account, is specifically configured to:

when a detection request of a target transaction account is detected, determining a first block corresponding to the height of the block in a block chain, and reading an original transaction text of the target transaction account in the first block; the detection request includes the block height;

acquiring a filtering word bank, and filtering the original transaction text according to the filtering word bank to obtain the target transaction text;

the text processing apparatus 1 further comprises: the module 17 is encapsulated.

And the packaging module 17 is configured to package the target transaction account into a second block, and store the second block in the block chain.

According to an embodiment of the present invention, the steps involved in the methods shown in fig. 3-8 may be performed by the modules in the text processing apparatus shown in fig. 9. For example, steps S101-S104 shown in fig. 3 may be performed by the obtaining module 11, the word segmentation module 12, the first recognition module 13, the determination module 14, the second recognition module 15, and the updating module 16 shown in fig. 9, respectively; as another example, steps S401, S405 shown in fig. 8 may be performed by the obtaining module 11 and the encapsulating module 17 shown in fig. 9.

Further, please refer to fig. 10, which is a schematic structural diagram of a computer device according to an embodiment of the present application. The server or target block link point in the corresponding embodiments of fig. 3-8 described above may be a computer device 1000. As shown in fig. 10, the computer device 1000 may include: a user interface 1002, a processor 1004, an encoder 1006, and a memory 1008. Signal receiver 1016 is used to receive or transmit data via cellular interface 1010, WIFI interface 1012. The encoder 1006 encodes the received data into a computer-processed data format. The memory 1008 has stored therein a computer program by which the processor 1004 is arranged to perform the steps of any of the method embodiments described above. The memory 1008 may include volatile memory (e.g., dynamic random access memory DRAM) and may also include non-volatile memory (e.g., one time programmable read only memory OTPROM). In some instances, the memory 1008 can further include memory located remotely from the processor 1004, which can be connected to the computer device 1000 via a network. The user interface 1002 may include: a keyboard 1018, and a display 1020.

In the computer device 1000 shown in fig. 10, the processor 1004 may be configured to call the memory 1008 to store a computer program to implement:

acquiring a target transaction text of a target transaction account;

In one embodiment, the processor 1004 further performs the following steps:

acquiring a text of a target type, and identifying a new word set belonging to the target type in the text;

and acquiring an original word stock, and adding the new word set to the original word stock to obtain the preset word group set.

In one embodiment, the processor 1004, when performing the step of identifying the new word set belonging to the target type in the text, specifically performs the following steps:

In one embodiment, the target second phrase is any one of a plurality of second phrases, a text source of the target second phrase is a first text source or a second text source, and the first text source and the second text source are divided according to a text application scenario;

when the processor 1004 determines whether the target second word group belongs to the new word set according to the text source of the target second word group, the following steps are specifically executed:

In one embodiment, the processor 1004 further performs the following steps:

In one embodiment, the target character sequence is any one of a plurality of character sequences, and the phrase evaluation index of the target character sequence comprises a phrase frequency, a phrase solidity and a phrase freedom;

when the processor 1004 executes the phrase evaluation index according to the target character sequence and judges whether the target character sequence belongs to the first word group set, the following steps are specifically executed:

In an embodiment, when the processor 1004 performs the deduplication processing on the first phrase set to obtain the second phrase set, the following steps are specifically performed:

In an embodiment, when the processor 1004 executes text recognition processing on the one or more target phrases to obtain the type of the target transaction text, specifically execute the following steps:

In an embodiment, when the processor 1004 determines that the target transaction account number is a transaction account number of a target type if the type of the target transaction text is the target type, specifically, the following steps are performed:

In one embodiment, when the processor 1004 executes the step of obtaining the target transaction text of the target transaction account, the following steps are specifically executed:

the processor 1004 further performs the following steps:

packaging the target transaction account number into a second block, and storing the second block in the block chain.

It should be understood that the computer device 1000 described in this embodiment of the present application may perform the description of the text processing method in the embodiment corresponding to fig. 3 to fig. 8, and may also perform the description of the text processing apparatus 1 in the embodiment corresponding to fig. 9, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: an embodiment of the present application further provides a computer storage medium, and the computer storage medium stores the aforementioned computer program executed by the text processing apparatus 1, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the text processing method in the embodiment corresponding to fig. 3 to 8 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium referred to in the present application, reference is made to the description of the embodiments of the method of the present application. By way of example, program instructions may be deployed to be executed on one computer device or on multiple computer devices at one site or distributed across multiple sites and interconnected by a communication network, and the multiple computer devices distributed across the multiple sites and interconnected by the communication network may be combined into a blockchain network.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device can execute the method in the embodiment corresponding to fig. 3 to fig. 8, and therefore, the detailed description thereof will not be repeated here.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method of text processing, comprising:

acquiring a target transaction text of a target transaction account;

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein the identifying a set of new words in the text that belong to the target type comprises:

4. The method of claim 3, wherein the target second word group is any one of a plurality of second word groups, the text source of the target second word group is the first text source or the second text source, and the first text source and the second text source are divided according to the text application scenario;

the method for judging whether the target second phrase belongs to the new word set or not according to the text source of the target second phrase comprises the following steps:

5. The method of claim 4, further comprising:

6. The method of claim 3, wherein the target character sequence is any one of a plurality of character sequences, and the phrase evaluation index of the target character sequence includes a phrase frequency, a phrase solidity and a phrase freedom;

the method for judging whether the target character sequence belongs to the first phrase set or not according to the phrase evaluation index of the target character sequence comprises the following steps:

7. The method of claim 3, wherein the performing de-duplication processing on the first set of phrases to obtain a second set of phrases comprises:

8. The method of claim 1, wherein said performing text recognition processing on the one or more target phrases to obtain the type of the target transaction text comprises:

9. The method of claim 8, wherein determining the target transaction account number as a transaction account number of a target type if the type of the target transaction text is the target type comprises:

10. The method of claim 1, wherein obtaining the target transaction text of the target transaction account number comprises:

the method further comprises:

11. A text processing apparatus, comprising:

12. A computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1-10.

13. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause a computer device having the processor to perform the method of any one of claims 1-10.