CN113011875B - Text processing method, text processing device, computer equipment and storage medium - Google Patents

Text processing method, text processing device, computer equipment and storage medium Download PDF

Info

Publication number
CN113011875B
CN113011875B CN202110038717.5A CN202110038717A CN113011875B CN 113011875 B CN113011875 B CN 113011875B CN 202110038717 A CN202110038717 A CN 202110038717A CN 113011875 B CN113011875 B CN 113011875B
Authority
CN
China
Prior art keywords
target
text
transaction
phrase
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110038717.5A
Other languages
Chinese (zh)
Other versions
CN113011875A (en
Inventor
赵薇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110038717.5A priority Critical patent/CN113011875B/en
Publication of CN113011875A publication Critical patent/CN113011875A/en
Application granted granted Critical
Publication of CN113011875B publication Critical patent/CN113011875B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q20/00Payment architectures, schemes or protocols
    • G06Q20/38Payment protocols; Details thereof
    • G06Q20/382Payment protocols; Details thereof insuring higher security of transaction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Accounting & Taxation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Security & Cryptography (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a text processing method, a text processing device, computer equipment and a storage medium. The text processing method comprises the following steps: acquiring a target transaction text of a target transaction account; word segmentation processing is carried out on the target transaction text by adopting a preset phrase set to obtain one or more target phrases; the preset phrase set is obtained after the identification processing of the text of the target type; text recognition processing is carried out on one or more target phrases to obtain the type of the target transaction text; and if the type of the target transaction text is the target type, determining the target transaction account number as the transaction account number of the target type. By adopting the method and the device, the efficiency and the accuracy of identifying the transaction account type can be improved.

Description

Text processing method, text processing device, computer equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a text processing method, a text processing device, a computer device, and a storage medium.
Background
Identifying the type of each transaction account in the mass transaction accounts can ensure the security of the transaction system. Currently, the transaction behavior under the transaction account is manually checked based on past experience to determine the type of the transaction account.
Because the number of transaction accounts is huge, a large number of transaction behaviors are contained under each transaction account, the identification efficiency is reduced only by manually identifying the type of the transaction account, and the manual identification has a great influence on the subjective effect, so that the accuracy of the manual identification is low.
Disclosure of Invention
The embodiment of the application provides a text processing method, a text processing device, computer equipment and a storage medium, which can improve the efficiency and accuracy of identifying the type of a transaction account.
In one aspect, an embodiment of the present application provides a text processing method, including:
acquiring a target transaction text of a target transaction account;
word segmentation processing is carried out on the target transaction text by adopting a preset phrase set to obtain one or more target phrases; the preset phrase set is obtained after the identification processing of the text of the target type;
performing text recognition processing on the one or more target phrases to obtain the type of the target transaction text;
and if the type of the target transaction text is a target type, determining the target transaction account as the transaction account of the target type.
An aspect of an embodiment of the present application provides a text processing apparatus, including:
The acquisition module is used for acquiring target transaction text of the target transaction account;
the word segmentation module is used for carrying out word segmentation processing on the target transaction text by adopting a preset phrase set to obtain one or more target phrases; the preset phrase set is obtained after the identification processing of the text of the target type;
the first recognition module is used for carrying out text recognition processing on the one or more target phrases to obtain the type of the target transaction text;
and the determining module is used for determining the target transaction account number as the transaction account number of the target type if the type of the target transaction text is the target type.
In one aspect, a computer device is provided, where the computer device includes a memory and a processor, where the memory stores a computer program, and where the computer program when executed by the processor causes the processor to perform the method in the foregoing embodiments.
In one aspect, the embodiments of the present application provide a computer storage medium storing a computer program, where the computer program includes program instructions that, when executed by a processor, perform the method in the foregoing embodiments.
In one aspect, the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, where the computer instructions are stored in a computer readable storage medium, and where the computer instructions, when executed by a processor of a computer device, perform the method in the above embodiments.
According to the method and the device, the type of the transaction account is automatically identified by the terminal according to the transaction text under the transaction account, manual participation is not needed, and the identification efficiency and accuracy of the transaction account can be improved; furthermore, the method and the device determine the type of the transaction account based on the transaction text under the transaction account, so that the recognition mode of the transaction account can be enriched; in the transaction text recognition process, the method and the device divide the transaction text based on the preset phrase set determined after the text recognition of the target type, compared with the conventional phrase set, the accuracy of word division can be guaranteed, and the accuracy of account recognition can be further improved by a more accurate word division result.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a system architecture diagram for text processing according to an embodiment of the present invention;
FIGS. 2 a-2 d are schematic diagrams of a text processing scenario provided by examples of the present application;
FIG. 3 is a schematic flow chart of text processing according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a model evaluation effect according to an embodiment of the present application;
fig. 5 is a schematic flow chart of updating word stock according to an embodiment of the present application;
fig. 6 is a schematic flow chart of identifying whether an account is a sales account according to an embodiment of the present application;
FIG. 7 is a system architecture diagram of a blockchain provided in an embodiment of the present application;
FIG. 8 is a schematic flow chart of text processing according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a text processing device according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a text processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The proposal provided by the application belongs to the text processing technology and machine learning/deep learning which belong to the field of artificial intelligence. According to the method and the device, the text recognition model and the transaction classification model are trained through machine learning/deep learning, and the text recognition model and the transaction classification model can recognize the probability that the text to be recognized belongs to the target type based on the word group set of the text to be recognized after word segmentation. And then, judging whether the transaction account belongs to the target type or not based on the probability that the text to be recognized belongs to the target type, and executing business operation based on a judging result.
Fig. 1 is a system architecture diagram for text processing according to an embodiment of the present invention. The server 10f establishes a connection with a cluster of user terminals through the switch 10e and the communication bus 10d, which may include: user terminal 10a, user terminal 10b, user terminal 10c. The database 10g stores transaction text for a plurality of transaction accounts. For the transaction text of a transaction account, the server 10f performs word segmentation processing on the transaction text of the transaction account by adopting a preset phrase set to obtain one or more phrases, wherein the preset phrases are obtained after recognition processing is performed on the text of the target type. And the server carries out text recognition processing on one or more word groups after word segmentation to obtain the type of the transaction text of the transaction account, and if the type of the transaction text is a target type, the transaction account is also determined to be the transaction account of the target type.
The terminal devices 10a, 10b, 10c, etc. shown in fig. 1 may be smart devices having a display function, such as mobile phones, tablet computers, notebook computers, palm computers, mobile internet devices (MIDs, mobile internet device), wearable devices, etc. The terminal device cluster and the server 10f may be directly or indirectly connected through wired or wireless communication, which is not limited herein.
The method and the device can be applied to a sales account identification system (or a money laundering account identification system or a gambling account identification system), when whether a certain account is involved in sales (or money laundering or gambling) needs to be identified, transaction text of the account is extracted from a transaction pool, and the transaction text is identified by adopting the scheme of the method and the device to determine the type of the account, and the type can identify whether the account is involved in sales (or money laundering or gambling). If it is identified that the account relates to a round-robin (or money laundering or gambling), the account may be reported and subsequently hit, e.g. limit the account to conduct transactions, etc.
According to the text processing method, transaction texts and the like related to the account to be identified can be stored on the blockchain, and the credibility of the transaction texts can be ensured depending on the complete and transparent characteristics of the blockchain, so that the credibility of the identification result of the account to be identified is improved; subsequently, if the type of the transaction account to be identified is judged to be the target type, the account to be identified can be stored on the blockchain, so that the identification result is prevented from being tampered maliciously, and the identification result has traceability.
The following will take fig. 2 a-2 d as an example, to describe how to identify whether a certain account is a sales account belonging to the sales type. Please refer to fig. 2 a-2 d, which are schematic diagrams of a text processing scenario provided in the examples of the present application. As shown in fig. 2a, the interface 20a in fig. 2a is a main interface of the sales account identification system, in which the identified completed transaction account and the transaction account to be identified are contained. The processing state of the identified transaction account number is as follows: the processing state of the processed transaction account number to be identified is as follows: the identification result of the unprocessed transaction account number is also blank. The auditor may select the transaction account number for the current identification. As shown in fig. 2a, assume that an auditor selects transaction account number "5870" for identifying whether the account is a reimbursement account number.
Of course, instead of manually selecting the transaction account for current identification, the system may automatically pull the transaction account to be identified from the transaction account pool to determine whether the pulled transaction account is a sales account.
After the auditor selects transaction account number "5870", as shown in fig. 2b, an interface 20b is displayed, and the interface 20b includes transaction text in the transaction data of transaction account number "5870". As can be seen from fig. 2b, the transaction text mainly comprises: a nickname of the transaction account, a nickname of the other party of the transaction party, text information related in the transaction process of the transaction remark and the like. After the auditor clicks on the button "Start" in the interface 20b, the sales account identification system extracts the transaction text 20c of the transaction account "5870" as shown in FIG. 2 c. The biographical account number recognition system invokes the preset thesaurus 20d to word the transaction text 20c to divide the transaction text 20c into sets of phrases 20e. The preset word stock 20d includes a conventional word and a word specific to the marketing field, where the word specific to the marketing field is obtained by identifying a text belonging to the marketing type, and the text belonging to the marketing type may be marketing news crawled from a network, marketing science popularization articles crawled from a marketing hit website, and the like. By identifying the text belonging to the posting type, the posting new words and the words specific to the posting area can be found, and the transaction text can be segmented more accurately based on the preset word stock 20 d.
After the sales account identification system obtains the phrase set 20e of the transaction text 20c, it then identifies whether the transaction text 20c is a sales text based on the phrase set 20e and the artificial intelligence model. The specific identification process is as follows: the biographical account number recognition system converts each phrase in the phrase set 20e into a word vector, combines all the word vectors into a word vector matrix, and performs convolution and pooling operations on the word vector matrix to extract hidden features of the transaction text 20 c. And then, fully connecting the hidden features of the transaction text 20c, outputting the probability that the transaction text 20c belongs to the type of the biography and the probability that the transaction text 20c does not belong to the type of the biography, wherein the sum of the probability that the transaction text 20c belongs to the type of the biography and the probability that the transaction text does not belong to the type of the biography in the output result is 1.
Assuming that the artificial intelligence model recognizes that the probability that the transaction text 20c belongs to the biography type is 0.8 and the probability that the transaction text does not belong to the biography type is 0.2, the biography type transaction text of the transaction text 20c of the transaction account number "5870" can be determined. Based on this, the reimbursement account identification system may determine that transaction account "5870" is a reimbursement type transaction account.
As shown in fig. 2d, since the transaction account number "5870" is determined to be a sales account number, the processing state of displaying the transaction account number "5870" on the main interface is: the identification result of the processed transaction account number "5870" is: and is, that is, the transaction account number "5870" is a sales account number.
Since it is identified that transaction account number "5870" is a sales account number, transaction hits may be subsequently conducted on transaction account number "5870", e.g., disabling transaction account number "5870" from conducting any transfer transactions, even disabling transaction account number "5870" from conducting social sessions with other transaction account numbers, etc.
The specific process of obtaining the target transaction account number (e.g. the transaction account number "5870" in the above embodiment) and the target transaction text (e.g. the transaction text 20c in the above embodiment) and using the preset phrase set (e.g. the preset word stock 20d in the above embodiment) to word the target transaction text so as to obtain one or more target phrases (e.g. the phrase set 20e in the above embodiment) may refer to the embodiments corresponding to fig. 3-8 below.
Referring to fig. 3, which is a schematic flow chart of a text process provided in the embodiment of the present application, since the text process of the present application involves identifying a text type by an artificial intelligence model, the following steps are described with a server with better performance as an execution subject, and the sales account identifying system in the above embodiment may be applied to the server, and the text process includes the following steps:
step S101, obtaining a target transaction text of the target transaction account.
Specifically, when the server receives a detection request for a target transaction account number (e.g., transaction account number "5870" in the embodiment of fig. 2 a-2 d described above), the server extracts transaction text for the target transaction account number (referred to as target transaction text) from the transaction data pool, wherein the target transaction text may include a nickname for the target transaction account number, a nickname for a transaction partner, a transaction remark, and the like.
In order to detect the transaction account number as real time as possible, the server may extract the transaction account number in units of hours, where the transaction behavior occurs within one hour, and use the extracted transaction account number as the target transaction account number, thereby generating a detection request for the target transaction account number.
Of course, instead of the server automatically pulling the target transaction account number for detection, the server may also generate a detection request for the target transaction account number in response to a user operation, e.g., the user manually selects the target transaction account number for current detection, and the server generates a detection request for the target transaction account number.
The target transaction text may be the transaction text of the target transaction account for the last month, or the transaction text of the last three months, or all the transaction text since the target transaction text was created.
Step S102, word segmentation processing is carried out on the target transaction text by using a preset phrase set to obtain one or more target phrases; the preset phrase set is obtained by identifying and processing the text of the target type.
Specifically, the server performs word segmentation processing on the obtained target transaction text based on the preset phrase set, so as to divide the target transaction text into one or more phrases, where the divided one or more phrases are all referred to as target phrases (e.g. phrase set 20e in the corresponding embodiment of fig. 2 a-2 d). The preset phrase set comprises some conventional phrases and phrases belonging to the target type.
The preset phrase set is obtained after the identification processing of the text of the target type, and when the method is applied to the identification of the expense account, the target type can be the expense account, namely the preset phrase set is obtained after the identification processing of the text of the expense account; when the method is applied to identifying the money laundering account, the target type can be a money laundering type, namely, the preset phrase set is obtained by identifying and processing a text of the money laundering type; when the method and the device are applied to identifying the gambling account, the target type can be the gambling type, namely, the preset phrase set is obtained by identifying the text of the gambling type.
The text of the target type may include, among other things, news of the target type crawled from the network, text of the target type crawled from a specialized website (e.g., a popular science text crawled from a popular website that specialized strikes a expense when the target type is a expense type), and complaint text of the target type.
The word segmentation processing on the target transaction text based on the preset phrase set can be realized by specifically adopting a jieba word segmentation algorithm, and the specific process of the jieba word segmentation algorithm is described below:
the first step is to construct a prefix word of a preset phrase set: each preset phrase in the preset phrase set also carries word frequency, and the word frequency is the frequency of the preset phrase in the text of the target type or a default value. Reading all preset phrases in a preset phrase set, constructing prefix words of each preset phrase, setting the word frequency of the prefix words in the preset phrase set to be originally carried word frequency, setting the word frequency of the prefix words to be 0 for the word frequency of the prefix words which are not in the preset phrase set, and using the word frequency of the prefix words for calculating the probability of each path later.
For example, the preset phrase is: well learned, then the prefix word of the preset phrase includes: good, good learning.
The second step is to construct a directed acyclic graph of the target transaction text: and traversing all phrases possibly formed by each character in the target transaction text, if the phrases existing in the preset phrase set record the subscripts, possibly forming multiple types of words of each character, forming a directed acyclic graph, calculating each word forming probability according to the directed acyclic graph, and enabling the path of the maximum probability to be the best word forming probability. jieba uses the subject structure representation dag, and the final dag is stored in a dictionary structure of { k: [ k, j ], m: [ m, p, q ], … }, where k and m are the positions of the characters in the target transaction text and the list corresponding to k stores the end positions of possible words beginning with k in the target transaction text.
The third step is dynamic planning, determining the most probable path. The main function of calculating the maximum probability path in jieba segmentation is calc (self, sense, DAG, route), and the function calculates the maximum probability path according to the established directed acyclic graph. The function is a bottom-up dynamic programming problem that computes the probabilistic logarithmic score of clauses [ idx-N-1 ] by traversing each word (idx) of the target transaction text in reverse order starting with the last word (N-1) of the target transaction text. The case where the probability logarithm score is highest is then saved in route as such a tuple (probability logarithm, word last position). In the function, the probability logarithm is the logarithm value of the sum of word frequencies of hit prefix words, and the probability logarithm value is used for calculation, so that the underflow problem can be effectively prevented.
Thus, one or more target phrases of the target transaction text can be obtained.
The generation process of the preset phrase set is further described below:
and acquiring a text of the target type, identifying a new word set belonging to the target type in the text, acquiring an original word stock, and adding the identified new word set to the original word stock to obtain a preset phrase set. The new word set can be considered to be special for the target type and new words which do not appear in the original word stock are continuously added into the original word stock to update the preset phrase set, so that the word segmentation accuracy can be ensured.
The update frequency of the preset phrase set can be one month or one week, namely, a new target type text is crawled every one update period, so that a new word set is identified, and a new preset phrase set is obtained.
The following describes a specific procedure of how to determine a new word set of a target type based on the text of the target type:
the server may divide the text of the target type into a plurality of character sequences according to an n-gram principle, each character sequence containing n characters. For example, the text is: AAABBC, and n=2, the server may divide the text into 3 character sequences of: AA. AB, BC.
The server respectively determines phrase evaluation indexes of each character sequence, wherein the phrase evaluation indexes comprise: phrase frequency, phrase solidification degree and phrase freedom degree, wherein phrase evaluation index is used for evaluating whether a character sequence can be judged as a phrase.
The server selects character sequences which can be judged as phrases from all character sequences according to the evaluation index of each character phrase, and combines all character sequences into a first phrase set. The server de-duplicated the first phrase set according to the original word stock, and the remaining character sequences after de-duplication are all called second phrases, and all the second phrases are combined into a second phrase set. It can be known that the second phrase sets after duplication removal are new words not in the original word stock, and in order to further screen new words of the target type, the server also needs to screen the new word set belonging to the target type from a plurality of second phrase sets according to the text source of the text corresponding to each second phrase.
In the following, a character sequence is taken as an example and referred to as a target character sequence to describe how to calculate the phrase evaluation index of the target character sequence, and how to determine whether the target character sequence belongs to the first phrase set according to the phrase evaluation index of the target character sequence.
The server counts phrase frequencies of the target character sequence in the text of the target type.
The server calculates the degree of solidification (called phrase solidification) of the target character sequence, which is used to measure the degree of tightness between characters within a character sequence, for example, the degree of solidification of character sequences such as "colored glaze" and "apple" is very high, but the degree of solidification of character sequences such as "glory" is relatively low. The process of calculating the degree of solidification of the target character sequence is: the target character sequence is first split into different combination pairs, for example, 'abcd' can be split into ('a', 'bcd', 'ab', 'cd', 'abc', 'D', then the degree of solidification D (s 1, s 2) =p (s 1s 2)/(P (s 1) ×p (s 2)) of each combination pair is calculated, and finally the smallest one of the degree of solidification of the combination pairs is taken as the degree of solidification of the target character sequence. Where P (x) represents the phrase frequency of x in the text.
Taking the word "cinema" as an example, the specific calculation formula is as follows:
where C (cinema) represents the "cinema" coagulability and p (cinema) represents the frequency of occurrence of the word in the text.
The server calculates the degree of freedom (called phrase degrees of freedom) of the target character sequence, which is used to measure the degree to which the character sequence can be independently operated. For example, the solidification degree of the chocolate in the chocolate is high as that of the chocolate, but the words adjacent to the right are single, the free use degree is almost zero, so the chocolate cannot be independently formed into words. Word combinations of individual words should have a richer character. The calculation formula of the degree of freedom is as follows:
F(w)=min{H L (w),H R (w)} (2)
Wherein F (w) is the degree of freedom of the character sequence w, and H L (w)、H R And (w) is the left and right adjacent word information entropy of the character sequence w respectively. The information entropy is calculated as follows:
thus, the phrase frequency, phrase solidification degree and phrase freedom degree of the target character sequence are determined.
If the phrase frequency of the target character sequence is greater than a preset frequency threshold, the phrase solidification degree of the target character sequence is greater than a preset solidification degree threshold, and the phrase freedom degree of the target character sequence is greater than a preset freedom degree threshold, the target character sequence can be determined to belong to the first phrase set; otherwise, if the phrase frequency of the target character sequence is not greater than the preset frequency threshold, or the phrase solidification degree of the target character sequence is not greater than the preset solidification degree threshold, or the phrase freedom degree of the target character sequence is not greater than the preset freedom degree threshold, it can be determined that the target character sequence does not belong to the first phrase set.
In general, it is determined that the target character sequence belongs to the first phrase set only if the phrase frequency, the phrase solidification degree, and the phrase degree of freedom of the target character sequence are all greater than a threshold value.
Alternatively, in addition to using the above strategy to determine whether the target character sequence belongs to the first phrase set, the following strategy may be used to determine whether the target character sequence belongs to the first phrase set:
If the phrase frequency of the target character sequence is greater than a preset frequency threshold, or the phrase solidification degree of the target character sequence is greater than a preset solidification degree threshold, or the phrase freedom degree of the target character sequence is greater than a preset freedom degree threshold, determining that the target character sequence belongs to the first phrase set; otherwise, if the phrase frequency of the target character sequence is not greater than the preset frequency threshold, the phrase solidification degree of the target character sequence is not greater than the preset solidification degree threshold, and the phrase degree of freedom of the target character sequence is not greater than the preset degree of freedom threshold, it can be determined that the target character sequence does not belong to the first phrase set.
In general, it can be determined that the target character sequence belongs to the first phrase set as long as at least one of the phrase frequency, the phrase solidification degree, and the phrase degree of freedom of the target character sequence is greater than a threshold value.
The server de-duplicated the first phrase set according to the original word stock, and the specific process for obtaining the second phrase set is as follows: the server determines an intersection between the original word stock and the first phrase set, deletes the phrases in the intersection in the first phrase set, refers to the remaining phrases in the first phrase as second phrases, and combines all the second phrases as second phrase set.
Taking a second phrase (referred to as a target second phrase) as an example, how to determine whether the target second phrase belongs to a new word set of a target type according to the text source of the target second phrase is described below:
the text source of the target second phrase may be considered as the text source of the text belonging to the target type in the foregoing description, and the text source may be a first text source or a second text source, the first text source and the second text source being divided according to the text application scenario. For example, the text belonging to the first text source is news of the target type crawled from the network, or text of the target type crawled from a dedicated website; text belonging to the second text source is complaint text about the target type, etc.
If the text source corresponding to the target second phrase is the first text source, the target second phrase can be directly determined to belong to the new word set; if the text source of the target second phrase is the second text source, acquiring the transaction text of each transaction account in the transaction account set, wherein the transaction account set comprises a plurality of transaction accounts, the types of the transaction accounts in the transaction account set are determined, and the types of the transaction accounts are either of a target type or a non-target type.
And selecting a transaction account number of which the transaction text contains the target second phrase from the transaction account number set, wherein the selected transaction account number is called a transaction account number to be determined, and selecting the transaction account number to be determined belonging to the target type from a plurality of transaction account numbers to be determined. The server calculates the ratio between the number of the transaction accounts to be determined, which belong to the target type, and the number of all the transaction accounts to be determined, and if the ratio is larger than a preset threshold value, the target second phrase is determined to belong to the new word set.
In short, whether the current second phrase is a new word set is back-deduced through the transaction text of the rest of the transaction accounts with the determined types.
And step S103, performing text recognition processing on the one or more target phrases to obtain the type of the target transaction text.
Specifically, a word vector model is called, each target phrase after word segmentation of the target transaction text is respectively converted into a word vector, and all the word vectors are combined into a word matrix. For example, the dimension of the word vector is 1×m, and n target phrases are obtained after the target transaction text is segmented, so the dimension of the word matrix is: n×m.
The output of the trained text classification model is the probability that the transaction text belongs to the target type and the probability that the transaction text does not belong to the target type. Calling a convolution layer in the trained text classification model, performing convolution operation on the combined word matrix to obtain convolution characteristics, calling a pooling layer in the text classification model, and performing pooling operation on the convolution characteristics to obtain pooling characteristics of the target transaction text; invoking a full link layer in the text classification model, and performing full connection processing on the pooled features to obtain the probability (called first probability) that the target transaction text belongs to the target type, and if the first probability is greater than or equal to a first probability threshold, determining that the type of the target transaction text is the target type; otherwise, if the first probability is smaller than the first probability threshold, determining that the type of the target transaction text is not the target type.
Step S104, if the type of the target transaction text is a target type, determining the target transaction account as the transaction account of the target type.
Specifically, if the type of the target transaction text is a target type, the server may directly determine the target transaction account number as a transaction account number belonging to the target type.
In addition to directly determining the type of the target transaction account, the type of the target transaction account may also be determined using the following policies:
if the type of the target transaction text is the target type, the server acquires a transaction running water of the target transaction account (called target transaction running water), wherein the target transaction running water comprises transaction time (called target transaction time), transaction resource data amount (called target transaction resource data amount) and the like, and the transaction running water can be a transaction running water of the target transaction account for about one month, or can be a transaction running water of the target transaction account for about three months, or can be all running water since the target transaction account is created.
Generating transaction characteristics according to the target transaction time and the target transaction resource data quantity, inputting the transaction characteristics into a trained transaction classification model, and outputting the trained transaction classification model to be the probability that the transaction running water belongs to the target type and the probability that the transaction running water does not belong to the target type. The transaction classification model outputs the probability (called second probability) that the target transaction flow is of the target type, and if the sum of the first probability and the second probability is not smaller than a preset second probability threshold value, the target transaction account is determined to be the transaction account belonging to the target type.
Optionally, if the second probability is not less than the second probability threshold, determining the target transaction account as the transaction account belonging to the target type.
In general, if it is determined whether the type of the target transaction account is a target type, it is determined with reference to not only the target transaction text of the target transaction account, but also the target transaction flow of the target transaction account. The type of the target transaction account is judged through the transaction data with multiple dimensions, so that the identification accuracy of the target transaction account can be improved.
Referring to fig. 4, fig. 4 is a schematic diagram of a model evaluation effect provided in an embodiment of the present application, where a dotted line in fig. 4 indicates Precision (Precision), and a solid line indicates Recall (Recall). As can be seen from fig. 4, the Precision (Precision) and Recall (Recall) of the present application on the test dataset are both relatively high; furthermore, the accuracy of the present application on the test dataset, accuracy, was also 0.991, ks (Kolmogorov-Smirnov, kolmogorov-Smirnov test) was 0.964: auc (area under the curve, area under ROC curve) reaches 0.996. Meanwhile, when the actual large disc is verified and the high-resolution account is checked, the accuracy of the method is good.
Referring to fig. 5, fig. 5 is a schematic flow chart of a word stock update provided in an embodiment of the present application, in which the embodiment mainly describes a process of adding some new words to the word stock to update the word stock, and the word stock update includes the following steps:
in step S201, the flow starts.
Step S202, pull text data.
In particular, the server may pull text from a specialized distribution website, and crawl distribution news on the network, and pull complaint text for the distribution. The pulled text data may correspond to a text belonging to a target type in the present application.
Step S203, preprocessing is performed on the text data, and nonsensical words or symbols in the text data are filtered.
Step S204, a new word in the text data is found.
Specifically, the text data is divided into a plurality of character sequences, 3 word-forming evaluation indexes of each character sequence are calculated, and the 3 word-forming evaluation indexes are word frequency, solidification degree and freedom degree respectively. According to the 3 word forming evaluation indexes, the word groups in the text data can be screened out, and then a batch of new words with undetermined properties can be obtained after duplication removal with the existing word stock. Wherein, the pending new word may correspond to the second phrase in the present application.
The specific process of calculating the word frequency, the degree of solidification and the degree of freedom of each character sequence can be referred to as step S102 in the corresponding embodiment of fig. 3.
After a batch of pending new words is found, the risk qualitative judgment is carried out on the batch of pending new words. Only the new word with undetermined property meeting the risk qualitative judgment condition can be used as a new word belonging to the type of the propagation and further added to the word stock.
In step S205, if the text in which the pending new word is located is the news crawled from the network or the text pulled from the specific marketing website, the pending new word may be determined to be a new word belonging to the marketing type.
In step S206, if the text in which the pending new word is located is a complaint text of a marketing class, a batch of transaction texts is recalled to include the account number of the pending new word.
Step S207, determining whether the new word to be qualified is a new word belonging to the type of the biography.
Specifically, from the batch of recalled accounts, accounts belonging to the biographical type (here, manual screening may be adopted) are screened out. And calculating the malicious concentration of the undetermined new word (the malicious concentration is equal to the number of the recall account numbers of the marketing class/the total recall account numbers), and determining whether the undetermined new word is a new word of the marketing class according to a certain threshold standard.
Step S208, updating the word stock.
Specifically, new words which belong to the type of the biography and are screened out by the 2 methods are added into a word stock to form a new word stock, and the screened new words which belong to the type of the biography and are screened out can correspond to the phrases in the new word set in the application. The new word stock may be used in model training or audit decisions.
Step S209, the flow ends.
Referring to fig. 6, fig. 6 is a flowchart of identifying whether an account is a legend account according to the embodiment of the present application, and the identification process involves the following steps:
in step S301, the flow starts.
Step S302, extracting transaction text of an account to be identified.
Specifically, a transaction text of the account to be identified, which includes data such as a transfer text, a red package text, a nickname text and the like, is acquired, wherein the transaction text is the model feature of the account to be identified.
Step S303, preprocessing transaction text.
Specifically, neutral words and stop words in the transaction text, such as neutral words like "happy birthday", "happy and happy century", and "blessing", stop words like "earth", "get", are filtered.
Step S304, loading the updated word stock, and dividing the pre-processed transaction text into a plurality of word groups based on the updated word stock. Because the updated word stock contains the new word of the marketing class, the word segmentation result of the transaction text is more accurate.
The updated word stock is the word stock added with new words belonging to the type of propagation in the embodiment corresponding to fig. 5.
Step S305, performing thermal encoding on each word group after word segmentation.
After thermal encoding, each phrase corresponds to a thermal encoding vector, and the vector has only 1 and the rest values are 0.
In step S306, each of the thermally encoded vectors is reduced in dimension to convert each of the thermally encoded vectors into a word vector.
Step S307, word vectors of all phrases are input into a trained text cnn (text, convolutional neural networks, text convolutional neural network) model, and a convolutional layer in the text cnn model carries out convolutional operation on the word vectors of all word directions to obtain convolutional features.
The text cnn model may correspond to a text classification model in the present application.
And step S308, a pooling layer in the text cnn model carries out pooling operation on the convolution characteristics to obtain pooling characteristics.
And step S309, fully connecting the pooling features by a fully connecting layer in the text cnn model to obtain fully connected features.
Optionally, a convolution layer in the text cnn model carries out convolution operation on word vectors of all word directions to obtain convolution characteristics. And the pooling layer in the text cnn model carries out pooling operation on word vectors of all word directions to obtain pooling characteristics. And fully connecting the convolution characteristic and the pooling characteristic to obtain a fully connected characteristic.
In step S310, the normalization layer determines a probability score according to the fully connected features.
Specifically, the probability score represents the probability that the transaction text is of the type of biography.
In addition to using the text cnn model to determine the probability that the transaction text is of the type of a biographical sale, other text classification models such as fast text may be used to determine the probability that the transaction text is of the type of a biographical sale.
Step S311, determining the probability that the transaction running water is of the type of the expense, based on the transaction running water of the transaction account to be identified and the trained transaction model.
And determining whether the account to be identified is a reimbursement account by combining the probability that the transaction text of the account to be identified is a reimbursement type and the probability that the transaction running water of the account to be identified is a reimbursement type, for example, if the sum of the probability that the transaction text of the account to be identified is a reimbursement type and the probability that the transaction running water of the account to be identified is a reimbursement type is greater than a threshold value, determining that the account to be identified is a reimbursement account.
In step S312, the flow ends.
After the wind control decision engine quasi-real-time platform deploys the set of models, the large disc can identify 1w+ malicious marketing accounts every day, so that marketing malicious risks in large disc transactions are effectively hit, and misjudgment cases and customer complaint rates of marketing hits are reduced.
Referring to fig. 7, fig. 7 is a system architecture diagram of a blockchain provided in an embodiment of the present application. The server in the foregoing embodiment may be node 1, or node 2, or node 3, or node 4 in fig. 7, all of which may be combined into a blockchain system, each of which includes a hardware layer, a middle layer, an operating system layer, and an application layer. As can be seen in FIG. 7, the blockchain data stored by each node in the blockchain system is the same. It will be appreciated that the nodes may comprise computer devices. The following embodiments describe a target blockchain node as an execution body, where the target blockchain node is any node of multiple nodes in the blockchain system, and the target blockchain node may correspond to the server in the foregoing embodiments.
Referring to fig. 8, fig. 8 is a schematic flow chart of text processing provided in the embodiment of the present application, where the embodiment mainly describes the combination of recognition of a transaction account and blockchain technology, and the text processing includes the following steps S401 to S405:
step 401, when a detection request of a target transaction account is detected, determining a first block corresponding to the block height in a block chain, and reading an original transaction text of the target transaction account in the first block.
Specifically, when the target block link point detects the detection request of the target transaction account, the block height carried by the detection request is extracted. The target block chain link point obtains the block chain and extracts the block corresponding to the block height from the block chain, which is called the first block. The first block stores original transaction text of the target transaction account number. The target blockchain node extracts the original transaction text of the target transaction account number from the blockvolume of the first block.
The original transaction text may include a nickname of the target transaction account number, a nickname of the transaction partner, transaction remark text, and the like.
Step S402, a filtering word stock is obtained, and filtering processing is carried out on the original transaction text according to the filtering word stock, so that the target transaction text is obtained.
Specifically, the target blockchain node acquires a filtering word stock, and filters the original transaction text according to the filtering word stock to obtain the target transaction text of the target transaction account. The filtering word library comprises neutral words and stop words, for example, words such as happy birthday, happy and happy, blessing and the like are all neutral words, and words such as 'ground', 'get', 'and', 'or' are all stop words. Thus, both neutral and stop words are filtered out in the original transaction text.
Step S403, word segmentation processing is carried out on the target transaction text by adopting a preset phrase set to obtain one or more target phrases; the preset phrase set is obtained after the identification processing of the text of the target type.
Step S404, performing text recognition processing on the one or more target phrases to obtain a type of the target transaction text, and if the type of the target transaction text is a target type, determining the target transaction account as a transaction account of the target type.
The specific process of step S403 to step S404 may be referred to as step S102 to step S104 in the corresponding embodiment of fig. 3.
Step S405, encapsulating the target transaction account number into a second block, and storing the second block in the blockchain.
Specifically, if the type of the target transaction account number is determined to be the target type, the target block link point stores the determined target transaction account number belonging to the target type into the block body, calculates the merck root of the target transaction account number, acquires the hash value of the last block of the current block chain, and stores the merck root of the target transaction account number, the hash value of the last block of the current block chain and the current timestamp into the block header. The target block chain link point combines the block header and the block body storing the target transaction account number into a second block, stores the second block to the block chain maintained by the target block chain node, and broadcasts the second block to the rest of nodes so that the rest of nodes add the second block to the block chains maintained by the nodes respectively, and the block chains maintained by the nodes achieve synchronization.
When the target type is a reimbursement type, subsequently, the node needing to execute reimbursement transaction account number striking can read a second block from the block chain, and read the target transaction account number from the second block, so as to strike the target transaction account number.
The above-mentioned knowledge, depending on the complete attribute and the non-tamperable attribute of the blockchain, can ensure that the original transaction text obtained by the target blockchain link point is trusted and not tampered, so that the type of the target transaction account identified based on the original transaction text is also trusted, and can ensure the safety and accuracy of the identification process of the target transaction account.
Further, please refer to fig. 9, which is a schematic diagram of a text processing apparatus according to an embodiment of the present application. As shown in fig. 9, the text processing apparatus 1 may be applied to a server or a target blockchain node in the above-described embodiments corresponding to fig. 3 to 8. In particular, the text processing device 1 may be a computer program (comprising program code) running in a computer apparatus, for example the text processing device 1 is an application software; the text processing device 1 may be used to perform the respective steps in the method provided by the embodiments of the present application.
The text processing apparatus 1 may include: the system comprises an acquisition module 11, a word segmentation module 12, a first identification module 13 and a determination module 14.
An obtaining module 11, configured to obtain a target transaction text of a target transaction account;
the word segmentation module 12 is configured to perform word segmentation processing on the target transaction text by using a preset phrase set to obtain one or more target phrases; the preset phrase set is obtained after the identification processing of the text of the target type;
the first recognition module 13 is configured to perform text recognition processing on the one or more target phrases to obtain a type of the target transaction text;
and the determining module 14 is configured to determine the target transaction account number as a transaction account number of a target type if the type of the target transaction text is the target type.
In a possible embodiment, the text processing device 1 may further include: a second identification module 15 and an update module 16.
The acquiring module 11 is further configured to acquire a text of a target type;
a second recognition module 15, configured to recognize a new word set belonging to the target type in the text;
and the updating module 16 is configured to obtain an original word stock, and add the new word set to the original word stock to obtain the preset phrase set.
In a possible implementation manner, the second identifying module 15 is specifically configured to, when used for identifying a new word set belonging to the target type in the text:
dividing the text into a plurality of character sequences, and carrying out recognition processing on each character sequence to obtain a phrase evaluation index of each character sequence;
selecting a first phrase set from a plurality of character sequences according to phrase evaluation indexes of each character sequence;
performing de-duplication treatment on the first phrase set to obtain a second phrase set; the second phrase set includes a plurality of second phrases;
and determining a new word set belonging to the target type from the plurality of second word groups according to the text source of each second word group.
In one possible implementation, the target second phrase is any one of a plurality of second phrases, the text source of the target second phrase is a first text source or a second text source, and the first text source and the second text source are divided according to the text application scene;
the second recognition module 15 is specifically configured to, when determining whether the target second phrase belongs to the new phrase set according to the text source of the target second phrase:
And if the text source of the target second phrase is the first text source, determining that the target second phrase belongs to the new word set.
In a possible embodiment, the second identification module 15 is further configured to:
if the text source of the target second phrase is the second text source, acquiring the transaction text of each transaction account in the transaction account set;
selecting a plurality of to-be-determined transaction accounts of which the transaction text contains the target second phrase from the transaction account set, and selecting a to-be-determined transaction account belonging to a target type from the plurality of to-be-determined accounts;
and if the ratio of the number of the transaction accounts to be determined to the number of the transaction accounts to be determined, which belong to the target type, is greater than a threshold value, determining that the target second phrase belongs to the new word set.
In one possible implementation, the target character sequence is any one of a plurality of character sequences, and the phrase evaluation index of the target character sequence includes phrase frequency, phrase solidification degree and phrase freedom degree;
the second recognition module 15 is specifically configured to, when determining whether the target character sequence belongs to the first word group set according to the phrase evaluation index of the target character sequence:
If the phrase frequency of the target character sequence is greater than the frequency threshold value, the phrase solidification degree is greater than the solidification degree threshold value, and the phrase freedom degree is greater than the freedom degree threshold value, determining that the target character sequence belongs to the first phrase set; or,
if the phrase frequency of the target character sequence is greater than the frequency threshold, or the phrase solidification degree is greater than the solidification degree threshold, or the phrase freedom degree is greater than the freedom degree threshold, determining that the target character sequence belongs to the first phrase set.
In a possible implementation manner, the second recognition module 15 is specifically configured to, when configured to perform a deduplication process on the first phrase set to obtain a second phrase set:
determining an intersection between the original word stock and the first phrase;
deleting the phrase corresponding to the intersection in the first phrase set, and combining the remaining phrases in the first phrase set into the second phrase set.
In a possible implementation manner, when the first recognition module 13 performs text recognition processing on the one or more target phrases to obtain the type of the target transaction text, the method is specifically used for:
each target phrase is respectively converted into word vectors, and all the word vectors are combined into a word matrix;
Invoking a text classification model to identify the word matrix, and obtaining a first probability that the type of the target transaction text is a target type;
and if the first probability is not smaller than a first probability threshold, determining that the type of the target transaction text is a target type.
In one possible implementation manner, the determining module 14 is specifically configured to, when determining the target transaction account number as a transaction account number belonging to the target type if the type of the target transaction text is the target type:
if the type of the target transaction text is a target type, acquiring target transaction running water of the target transaction account, wherein the target transaction running water comprises target transaction resource data quantity and target transaction time;
invoking a transaction classification model to identify the target transaction resource data amount and target transaction time, and obtaining a second probability that the type of the target transaction running water is a target type;
and if the sum of the first probability and the second probability is not smaller than a second probability threshold, determining the target transaction account as the transaction account belonging to the target type.
In one possible implementation manner, the obtaining module 11 is specifically configured to, when used for obtaining the target transaction text of the target transaction account number:
When a detection request of a target transaction account is detected, determining a first block corresponding to the block height in a block chain, and reading an original transaction text of the target transaction account in the first block; the detection request includes the block height;
obtaining a filtered word stock, and filtering the original transaction text according to the filtered word stock to obtain the target transaction text;
the text processing apparatus 1 further comprises: and a packaging module 17.
And the packaging module 17 is configured to package the target transaction account number into a second block, and store the second block in the blockchain.
According to one embodiment of the invention, the steps involved in the methods shown in fig. 3-8 may be performed by the modules in the text processing device shown in fig. 9. For example, steps S101 to S104 shown in fig. 3 may be performed by the acquisition module 11, the word segmentation module 12, the first recognition module 13, the determination module 14, the second recognition module 15, and the update module 16 shown in fig. 9, respectively; as another example, steps S401, S405 shown in fig. 8 may be performed by the acquisition module 11 and the encapsulation module 17 shown in fig. 9.
According to the method and the device, the type of the transaction account is automatically identified by the terminal according to the transaction text under the transaction account, manual participation is not needed, and the identification efficiency and accuracy of the transaction account can be improved; furthermore, the method and the device determine the type of the transaction account based on the transaction text under the transaction account, so that the recognition mode of the transaction account can be enriched; in the transaction text recognition process, the method and the device divide the transaction text based on the preset phrase set determined after the text recognition of the target type, compared with the conventional phrase set, the accuracy of word division can be guaranteed, and the accuracy of account recognition can be further improved by a more accurate word division result.
Further, please refer to fig. 10, which is a schematic structural diagram of a computer device according to an embodiment of the present application. The server or target blockchain node in the corresponding embodiments of fig. 3-8 described above may be the computer device 1000. As shown in fig. 10, the computer device 1000 may include: a user interface 1002, a processor 1004, an encoder 1006, and a memory 1008. Signal receiver 1016 is used to receive or transmit data via cellular interface 1010, WIFI interface 1012, a. The encoder 1006 encodes the received data into a computer-processed data format. The memory 1008 has stored therein a computer program, by which the processor 1004 is arranged to perform the steps of any of the method embodiments described above. The memory 1008 may include volatile memory (e.g., dynamic random access memory, DRAM) and may also include non-volatile memory (e.g., one-time programmable read only memory, OTPROM). In some examples, memory 1008 may further include memory located remotely from processor 1004, which may be connected to computer device 1000 via a network. The user interface 1002 may include: a keyboard 1018 and a display 1020.
In the computer device 1000 shown in fig. 10, the processor 1004 may be configured to invoke the storage of a computer program in the memory 1008 to implement:
acquiring a target transaction text of a target transaction account;
word segmentation processing is carried out on the target transaction text by adopting a preset phrase set to obtain one or more target phrases; the preset phrase set is obtained after the identification processing of the text of the target type;
performing text recognition processing on the one or more target phrases to obtain the type of the target transaction text;
and if the type of the target transaction text is a target type, determining the target transaction account as the transaction account of the target type.
In one embodiment, the processor 1004 also performs the steps of:
acquiring a text of a target type, and identifying a new word set belonging to the target type in the text;
and obtaining an original word stock, and adding the new word set to the original word stock to obtain the preset phrase set.
In one embodiment, the processor 1004, when executing the recognition of the new set of words in the text that belong to the target type, specifically performs the steps of:
dividing the text into a plurality of character sequences, and carrying out recognition processing on each character sequence to obtain a phrase evaluation index of each character sequence;
Selecting a first phrase set from a plurality of character sequences according to phrase evaluation indexes of each character sequence;
performing de-duplication treatment on the first phrase set to obtain a second phrase set; the second phrase set includes a plurality of second phrases;
and determining a new word set belonging to the target type from the plurality of second word groups according to the text source of each second word group.
In one embodiment, the target second phrase is any one of a plurality of second phrases, the text source of the target second phrase is a first text source or a second text source, and the first text source and the second text source are divided according to the text application scene;
the processor 1004, when executing the text source according to the target second phrase, determines whether the target second phrase belongs to the new word set, specifically executes the following steps:
and if the text source of the target second phrase is the first text source, determining that the target second phrase belongs to the new word set.
In one embodiment, the processor 1004 also performs the steps of:
if the text source of the target second phrase is the second text source, acquiring the transaction text of each transaction account in the transaction account set;
Selecting a plurality of to-be-determined transaction accounts of which the transaction text contains the target second phrase from the transaction account set, and selecting a to-be-determined transaction account belonging to a target type from the plurality of to-be-determined accounts;
and if the ratio of the number of the transaction accounts to be determined to the number of the transaction accounts to be determined, which belong to the target type, is greater than a threshold value, determining that the target second phrase belongs to the new word set.
In one embodiment, the target character sequence is any one of a plurality of character sequences, and the phrase evaluation index of the target character sequence comprises phrase frequency, phrase solidification degree and phrase freedom degree;
the processor 1004, when executing the phrase evaluation index according to the target character sequence, determines whether the target character sequence belongs to the first word group set, specifically executes the following steps:
if the phrase frequency of the target character sequence is greater than the frequency threshold value, the phrase solidification degree is greater than the solidification degree threshold value, and the phrase freedom degree is greater than the freedom degree threshold value, determining that the target character sequence belongs to the first phrase set; or,
if the phrase frequency of the target character sequence is greater than the frequency threshold, or the phrase solidification degree is greater than the solidification degree threshold, or the phrase freedom degree is greater than the freedom degree threshold, determining that the target character sequence belongs to the first phrase set.
In one embodiment, the processor 1004, when performing the deduplication processing on the first phrase set to obtain the second phrase set, specifically performs the following steps:
determining an intersection between the original word stock and the first phrase;
deleting the phrase corresponding to the intersection in the first phrase set, and combining the remaining phrases in the first phrase set into the second phrase set.
In one embodiment, when the processor 1004 performs text recognition processing on the one or more target phrases to obtain the type of the target transaction text, the following steps are specifically performed:
each target phrase is respectively converted into word vectors, and all the word vectors are combined into a word matrix;
invoking a text classification model to identify the word matrix, and obtaining a first probability that the type of the target transaction text is a target type;
and if the first probability is not smaller than a first probability threshold, determining that the type of the target transaction text is a target type.
In one embodiment, the processor 1004, when executing the determining that the target transaction account number is a transaction account number belonging to the target type if the type of the target transaction text is the target type, specifically executes the following steps:
If the type of the target transaction text is a target type, acquiring target transaction running water of the target transaction account, wherein the target transaction running water comprises target transaction resource data quantity and target transaction time;
invoking a transaction classification model to identify the target transaction resource data amount and target transaction time, and obtaining a second probability that the type of the target transaction running water is a target type;
and if the sum of the first probability and the second probability is not smaller than a second probability threshold, determining the target transaction account as the transaction account belonging to the target type.
In one embodiment, the processor 1004, when executing the target transaction text for obtaining the target transaction account number, specifically performs the following steps:
when a detection request of a target transaction account is detected, determining a first block corresponding to the block height in a block chain, and reading an original transaction text of the target transaction account in the first block; the detection request includes the block height;
obtaining a filtered word stock, and filtering the original transaction text according to the filtered word stock to obtain the target transaction text;
the processor 1004 also performs the steps of:
and packaging the target transaction account number into a second block, and storing the second block in the blockchain.
It should be understood that the computer device 1000 described in the embodiments of the present application may perform the description of the text processing method in the embodiment corresponding to fig. 3 to 8, and may also perform the description of the text processing apparatus 1 in the embodiment corresponding to fig. 9, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.
Furthermore, it should be noted here that: the embodiment of the present application further provides a computer storage medium, in which the aforementioned computer program executed by the text processing apparatus 1 is stored, and the computer program includes program instructions, when executed by a processor, can execute the description of the text processing method in the embodiment corresponding to fig. 3 to 8, and therefore, the description will not be repeated here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer storage medium related to the present application, please refer to the description of the method embodiments of the present application. As an example, the program instructions may be deployed to be executed on one computer device or on multiple computer devices at one site or, alternatively, distributed across multiple sites and interconnected by a communication network, where the multiple computer devices distributed across multiple sites and interconnected by a communication network may be combined into a blockchain network.
According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device can perform the method in the embodiment corresponding to fig. 3 to 8, which will not be described in detail herein.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of computer programs, which may be stored on a computer-readable storage medium, and which, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.
The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims (9)

1. A text processing method, comprising:
periodically acquiring a text of a target type, and identifying phrases in the text to obtain a first phrase set;
obtaining an original word stock, and performing duplication removal processing on the first phrase set according to the original word stock to obtain a second phrase set comprising a plurality of second phrases;
determining a text source of each second phrase, wherein the text sources comprise a first text source and a second text source, and the first text source and the second text source are divided according to the reliability of a text application scene;
if the text source of the target second phrase is the first text source, determining that the target second phrase belongs to a new word set of the target type, wherein the target second phrase is any second phrase in the second phrase set;
if the text source of the target second phrase is the second text source, acquiring a transaction text of each transaction account in a transaction account set, selecting a plurality of transaction accounts which contain the target second phrase and belong to a target type from the transaction account set, and if the ratio of the number of the plurality of transaction accounts to the number of all transaction accounts in the transaction account set is greater than a threshold value, determining that the target second phrase belongs to the new word set; wherein the type of each transaction account in the transaction account set is determined;
Adding the new word set to the original word stock to obtain a preset phrase set;
acquiring a target transaction text of a target transaction account;
word segmentation processing is carried out on the target transaction text by adopting a preset phrase set to obtain one or more target phrases;
performing text recognition processing on the one or more target phrases to obtain the type of the target transaction text, wherein the text recognition processing comprises the following steps: calculating a first probability that the type of the target transaction text is a target type, and if the first probability is not smaller than a first probability threshold, determining that the type of the target transaction text is the target type;
if the type of the target transaction text is a target type, acquiring target transaction running water of the target transaction account, wherein the target transaction running water comprises target transaction resource data quantity and target transaction time;
invoking a transaction classification model to identify the target transaction resource data amount and target transaction time, and obtaining a second probability that the type of the target transaction running water is a target type;
and if the sum of the first probability and the second probability is not smaller than a second probability threshold, determining the target transaction account as the transaction account of the target type.
2. The method of claim 1, wherein the identifying the phrase in the text to obtain the first set of phrases comprises:
dividing the text into a plurality of character sequences, and carrying out recognition processing on each character sequence to obtain a phrase evaluation index of each character sequence;
and selecting a first phrase set from the plurality of character sequences according to the phrase evaluation index of each character sequence.
3. The method of claim 2, wherein the target character sequence is any one of a plurality of character sequences, and the phrase evaluation index of the target character sequence includes phrase frequency, phrase solidification degree, and phrase freedom degree;
the method for judging whether the target character sequence belongs to the first phrase set according to the phrase evaluation index of the target character sequence comprises the following steps:
if the phrase frequency of the target character sequence is greater than the frequency threshold value, the phrase solidification degree is greater than the solidification degree threshold value, and the phrase freedom degree is greater than the freedom degree threshold value, determining that the target character sequence belongs to the first phrase set; or,
if the phrase frequency of the target character sequence is greater than the frequency threshold, or the phrase solidification degree is greater than the solidification degree threshold, or the phrase freedom degree is greater than the freedom degree threshold, determining that the target character sequence belongs to the first phrase set.
4. The method of claim 1, wherein the performing the de-duplication process on the first phrase set according to the original word stock to obtain a second phrase set including a plurality of second phrases includes:
determining an intersection between the original word stock and the first phrase set;
deleting the phrases corresponding to the intersection set from the first phrase set, and combining the remaining phrases in the first phrase set into the second phrase set.
5. The method of claim 1, wherein the calculating the first probability that the type of the target transaction text is a target type comprises:
each target phrase is respectively converted into word vectors, and all the word vectors are combined into a word matrix;
and calling a text classification model to identify the word matrix, and obtaining a first probability that the type of the target transaction text is the target type.
6. The method of claim 1, wherein the obtaining the target transaction text of the target transaction account number comprises:
when a detection request of a target transaction account is detected, determining a first block corresponding to the block height in a block chain, and reading an original transaction text of the target transaction account in the first block; the detection request includes the block height;
Obtaining a filtered word stock, and filtering the original transaction text according to the filtered word stock to obtain the target transaction text;
the method further comprises:
and packaging the target transaction account number into a second block, and storing the second block in the blockchain.
7. A text processing apparatus, comprising:
the acquisition module is used for periodically acquiring a text of a target type, identifying phrases in the text, obtaining a first phrase set and acquiring a target transaction text of a target transaction account;
the word segmentation module is used for carrying out word segmentation processing on the target transaction text by adopting a preset phrase set to obtain one or more target phrases;
the first recognition module is used for carrying out text recognition processing on the one or more target phrases to obtain the type of the target transaction text;
the first recognition module is specifically configured to, when performing text recognition processing on the one or more target phrases to obtain a type of the target transaction text: calculating a first probability that the type of the target transaction text is a target type, and if the first probability is not smaller than a first probability threshold, determining that the type of the target transaction text is the target type;
The determining module is used for acquiring target transaction running water of the target transaction account if the type of the target transaction text is a target type, wherein the target transaction running water comprises target transaction resource data quantity and target transaction time; invoking a transaction classification model to identify the target transaction resource data amount and target transaction time, and obtaining a second probability that the type of the target transaction running water is a target type; if the sum of the first probability and the second probability is not smaller than a second probability threshold, determining the target transaction account as the transaction account of the target type;
the second recognition module is used for carrying out duplication removal processing on the first phrase set according to the original word stock to obtain a second phrase set comprising a plurality of second phrases; determining a text source of each second phrase, wherein the text sources comprise a first text source and a second text source, and the first text source and the second text source are divided according to the reliability of a text application scene; if the text source of the target second phrase is the first text source, determining that the target second phrase belongs to a new word set of the target type, wherein the target second phrase is any second phrase in the second phrase set; if the text source of the target second phrase is the second text source, acquiring a transaction text of each transaction account in a transaction account set, selecting a plurality of transaction accounts which contain the target second phrase and belong to a target type from the transaction account set, and if the ratio of the number of the plurality of transaction accounts to the number of all transaction accounts in the transaction account set is greater than a threshold value, determining that the target second phrase belongs to the new word set; wherein the type of each transaction account in the transaction account set is determined;
And the updating module is used for acquiring the original word stock and adding the new word set to the original word stock to obtain the preset phrase set.
8. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1-6.
9. A computer storage medium storing a computer program comprising program instructions which, when executed by a processor, cause a computer device having the processor to perform the method of any of claims 1-6.
CN202110038717.5A 2021-01-12 2021-01-12 Text processing method, text processing device, computer equipment and storage medium Active CN113011875B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110038717.5A CN113011875B (en) 2021-01-12 2021-01-12 Text processing method, text processing device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110038717.5A CN113011875B (en) 2021-01-12 2021-01-12 Text processing method, text processing device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113011875A CN113011875A (en) 2021-06-22
CN113011875B true CN113011875B (en) 2024-03-29

Family

ID=76384527

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110038717.5A Active CN113011875B (en) 2021-01-12 2021-01-12 Text processing method, text processing device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113011875B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115147227B (en) * 2022-08-29 2022-12-27 支付宝(杭州)信息技术有限公司 Transaction risk detection method, device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457708A (en) * 2019-08-16 2019-11-15 腾讯科技(深圳)有限公司 Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence
CN110598157A (en) * 2019-09-20 2019-12-20 北京字节跳动网络技术有限公司 Target information identification method, device, equipment and storage medium
CN111429219A (en) * 2020-03-25 2020-07-17 京东数字科技控股有限公司 Data confirmation method, device, equipment and storage medium
CN111553167A (en) * 2020-04-28 2020-08-18 腾讯科技(深圳)有限公司 Text type identification method and device and storage medium
CN111666502A (en) * 2020-07-08 2020-09-15 腾讯科技(深圳)有限公司 Abnormal user identification method and device based on deep learning and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9880997B2 (en) * 2014-07-23 2018-01-30 Accenture Global Services Limited Inferring type classifications from natural language text

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457708A (en) * 2019-08-16 2019-11-15 腾讯科技(深圳)有限公司 Vocabulary mining method, apparatus, server and storage medium based on artificial intelligence
CN110598157A (en) * 2019-09-20 2019-12-20 北京字节跳动网络技术有限公司 Target information identification method, device, equipment and storage medium
CN111429219A (en) * 2020-03-25 2020-07-17 京东数字科技控股有限公司 Data confirmation method, device, equipment and storage medium
CN111553167A (en) * 2020-04-28 2020-08-18 腾讯科技(深圳)有限公司 Text type identification method and device and storage medium
CN111666502A (en) * 2020-07-08 2020-09-15 腾讯科技(深圳)有限公司 Abnormal user identification method and device based on deep learning and storage medium

Also Published As

Publication number Publication date
CN113011875A (en) 2021-06-22

Similar Documents

Publication Publication Date Title
CN110162593B (en) Search result processing and similarity model training method and device
CN108491720B (en) Application identification method, system and related equipment
CN114265979B (en) Method for determining fusion parameters, information recommendation method and model training method
CN106874253A (en) Recognize the method and device of sensitive information
EP4310695A1 (en) Data processing method and apparatus, computer device, and storage medium
CN109685153A (en) A kind of social networks rumour discrimination method based on characteristic aggregation
CN112749749A (en) Classification method and device based on classification decision tree model and electronic equipment
CN113628059A (en) Associated user identification method and device based on multilayer graph attention network
CN114422211B (en) HTTP malicious traffic detection method and device based on graph attention network
CN112995414B (en) Behavior quality inspection method, device, equipment and storage medium based on voice call
CN114330966A (en) Risk prediction method, device, equipment and readable storage medium
CN112507167A (en) Method and device for identifying video collection, electronic equipment and storage medium
CN112328657A (en) Feature derivation method, feature derivation device, computer equipment and medium
CN111460783A (en) Data processing method and device, computer equipment and storage medium
Yuan et al. A novel approach for malicious URL detection based on the joint model
CN112163493A (en) Video false face detection method and electronic device
CN113011875B (en) Text processing method, text processing device, computer equipment and storage medium
CN115168568B (en) Data content identification method, device and storage medium
CN110889467A (en) Company name matching method and device, terminal equipment and storage medium
CN113409096B (en) Target object identification method and device, computer equipment and storage medium
CN113033209B (en) Text relation extraction method and device, storage medium and computer equipment
CN112069392B (en) Method and device for preventing and controlling network-related crime, computer equipment and storage medium
CN116451050A (en) Abnormal behavior recognition model training and abnormal behavior recognition method and device
CN113688232A (en) Method and device for classifying bidding texts, storage medium and terminal
CN117573809B (en) Event map-based public opinion deduction method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40045927

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant