CN115358340A

CN115358340A - Credit credit collection short message distinguishing method, system, equipment and storage medium

Info

Publication number: CN115358340A
Application number: CN202211047111.9A
Authority: CN
Inventors: 邓超; 胡栩喆
Original assignee: Lianyang Guorong Shanghai Technology Co ltd
Current assignee: Lianyang Guorong Shanghai Technology Co ltd
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2022-11-18

Abstract

The embodiment of the invention discloses a credit promulgation short message distinguishing method, a system, equipment and a storage medium, wherein a sample library is established by labeling short message text samples, then word segmentation and vectorization processing are carried out on the samples to obtain text word vectors, each text word vector is used as training data to train a classification model after being aligned with a corresponding label, and finally prediction distinguishing is carried out on short message texts to be distinguished by utilizing the classification model. According to the embodiment of the invention, the classification model is trained according to the machine learning classification algorithm, the text is predicted, the complicated manual analysis and template construction processes are optimized, the frequent modification of the template is avoided, the text matching efficiency for judging the credit promulgation short message is effectively improved, and the classification accuracy is enhanced.

Description

Credit credit collection short message distinguishing method, system, equipment and storage medium

Technical Field

The embodiment of the invention relates to the field of machine learning, in particular to a credit promissory note distinguishing method, system, equipment and storage medium.

Background

In a credit business scene, a user can continuously receive notification information sent by a bank or a financial institution after initiating a credit application, and in the short messages, a prompt receiving short message which is not received by the user on time for repayment is called a credit prompt receiving short message, and the short messages have an important reference function for distinguishing the credit grade of the user; therefore, the method has important practical significance and technical value on how to screen out credit earning information from complicated notification information.

In the existing text classification and judgment technology, when a text is classified in a keyword matching mode, the text screening effect is gradually reduced due to the existence of negative words, text replacement, synonym replacement and other factors, and the classification efficiency is low; meanwhile, the keyword matching mode needs complicated text analysis by technical engineers, and the text matching template is continuously updated, so that the work is difficult, huge, tedious and tedious, and the work efficiency is low and the labor cost is increased rapidly.

The existing text representation model adopts a bag-of-words model to perform text representation on a text, so that the characteristics of the text cannot be well expressed, sentence meaning expression is inaccurate, text vectors obtained through the bag-of-words model are often low in discrimination due to high similarity of short messages sent by financial institutions, and the training effect of a subsequent classification model is poor.

Therefore, samples are extracted, a sample library with wide coverage is constructed, an optimized text representation method is utilized, a classification model is trained, and the method has very practical working significance and service value for screening and classifying the text of the credit earning class.

Disclosure of Invention

Therefore, the embodiment of the invention provides a credit prompter receiving short message distinguishing method, a system, equipment and a storage medium, which aim to solve the problems of low distinguishing and matching efficiency and poor classification accuracy of the credit prompter receiving short message in the prior art.

In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:

according to a first aspect of the embodiments of the present invention, there is provided a credit promulgation short message determination method, including:

labeling the first short message text sample to obtain a second short message text sample, and establishing a sample library;

performing word segmentation processing on the second short message text sample to obtain a third short message text sample;

vectorizing the third short message text sample to obtain a corresponding text word vector;

aligning each text word vector with the corresponding label to be used as training data to obtain a classification model;

and processing data of the short message text to be judged, inputting the data into the classification model for prediction, and obtaining a judgment result.

Further, labeling the first short message text sample, including:

and for the first short message text sample, marking the credit press-in type short message text as 1, and marking the non-credit press-in type text as 0.

Further, performing word segmentation processing on the second short message text sample to obtain a third short message text sample, including:

performing first segmentation processing on the second short message text sample by using a stop word stock and a user-defined word stock to obtain a first segmentation result, wherein the first segmentation result comprises at least one first phrase, and the first phrases are separated by spaces;

calculating a first TF-IDF value of each first phrase in the first segmentation result;

judging whether the first TF-IDF value exceeds a first preset word segmentation threshold value or not;

if the first TF-IDF value exceeds a first preset word segmentation threshold value, adding the first phrase as a stop word into a stop word bank;

judging whether a user-defined word is not distinguished according to the first word segmentation result;

if the self-defined word is not distinguished in the first word segmentation result, adding the self-defined word into a self-defined word library and improving the word segmentation weight of the self-defined word;

and performing second word segmentation processing on the second short message text sample by using the updated stop word bank and the user-defined word bank to obtain a third short message text sample.

Further, performing word segmentation processing on the second short message text sample to obtain a third short message text sample, further comprising:

if the first TF-IDF value does not exceed a first preset word segmentation threshold value, judging whether a user-defined word is not distinguished according to the first word segmentation result;

and if the user-defined word does not exist in the first word segmentation result and is not distinguished, directly utilizing the first word segmentation result to obtain a third short message text sample.

Further, vectorizing the third short message text sample to obtain a corresponding text word vector, including:

obtaining a first matrix by using the third short message text sample;

constructing a first central word matrix and a first context matrix according to the total word segmentation number and the word vector dimension of the third short message text sample;

performing a first matrix multiplication operation by using the first matrix and the first central word matrix to obtain a second central word matrix;

performing a second matrix multiplication operation by using the second central word matrix and the first context matrix to obtain a first inner product matrix;

normalizing the first inner product matrix, and adjusting the first headword matrix and the first context matrix by using a normalization processing result to obtain a first vector quantization model;

inputting the third short message text sample into the first vector quantization model to obtain a first word segmentation word vector of each text;

and summing the first word segmentation word vectors of all the texts, and averaging by using a summation result to obtain the text word vectors.

Further, the data processing is carried out on the short message text to be distinguished, and the short message text to be distinguished is input into the classification model for prediction, so as to obtain a distinguishing result, and the method comprises the following steps:

performing word segmentation processing on the short message text to be judged to obtain a word segmentation result to be judged; vectorizing the word segmentation result to be distinguished to obtain a text vector to be distinguished;

inputting the text vector to be judged into the classification model, and predicting whether the short message text to be judged is a credit collection short message text or not;

if the short message text to be judged is a credit acquisition type short message text, the judgment result is 1;

and if the short message text to be judged is a non-credit collection type short message text, the judgment result is 0.

Further, performing word segmentation processing on the short message text to be distinguished to obtain a word segmentation result to be distinguished, including:

performing third segmentation processing on the short message text to be distinguished by utilizing a stop word stock and a user-defined word stock to obtain a third segmentation result, wherein the third segmentation result comprises at least one second phrase, and the second phrases are separated by spaces;

calculating a second TF-IDF value of each second phrase in the third segmentation result;

judging whether the second TF-IDF value exceeds a second preset word segmentation threshold value or not;

if the second TF-IDF value exceeds a second preset word segmentation threshold value, adding the second phrase as a stop word into a stop word bank; judging whether a user-defined word is not distinguished according to the third word distinguishing result;

if the second TF-IDF value does not exceed a second preset word segmentation threshold value, directly judging whether a user-defined word is not distinguished according to the third word segmentation result;

if the third word segmentation result contains the self-defined word which is not differentiated, adding the self-defined word into a self-defined word library and improving the word segmentation weight of the self-defined word; performing fourth word segmentation processing on the short message text to be distinguished by using the updated stop word bank and the user-defined word bank to obtain a word segmentation result to be distinguished;

and if the third word segmentation result does not contain the self-defined word which is not distinguished, directly utilizing the third word segmentation result to obtain a word segmentation result to be distinguished.

Further, vectorizing the word segmentation result to be distinguished to obtain a text vector to be distinguished, including:

obtaining a second matrix by using the word segmentation result to be distinguished;

constructing a third central word matrix and a second context matrix according to the total word segmentation number and the word vector dimension of the word segmentation result to be judged;

performing a third matrix multiplication operation by using the second matrix and the third central word matrix to obtain a fourth central word matrix;

performing a fourth matrix multiplication operation by using the fourth central word matrix and the second context matrix to obtain a second inner product matrix;

normalizing the second inner product matrix, and adjusting the third headword matrix and the second context matrix by using a normalization processing result to obtain a second directional quantization model;

inputting the word segmentation result to be judged into the second directional quantization model to obtain a word vector of a second word segmentation of each text;

and summing the word vectors of the second participles of each text, and averaging by using the summation result to obtain the text vector to be judged.

Further, before labeling the first short message text sample to obtain a second short message text sample, the method further includes:

screening financial short message texts from all short message texts through regular matching;

and carrying out duplicate removal processing on the financial short message text according to the text similarity to obtain the first short message text sample.

Further, by regular matching, financial short message texts are screened from all the short message texts, including:

obtaining financial short message characteristics according to the contents of all short message texts;

matching the financial short message characteristics from an existing database to obtain a corresponding characteristic short message text;

carrying out data cleaning processing on the characteristic short message text;

analyzing the linguistic characteristics and structural characteristics of the text by using the cleaned data, and extracting keywords of different types of short message texts;

and performing regular matching according to the keywords, and screening out the financial short message text.

Further, the removing the duplicate of the financial short message text according to the text similarity to obtain the first short message text sample includes:

according to the dimension of a single character, splitting the financial short message text into at least one characteristic character and forming a text character string;

calculating a corresponding Hash value for each characteristic character in the text character string to obtain a binary number string;

taking the occurrence frequency of the characteristic characters as weight, and carrying out weighting processing on the numeric string to obtain a weighted numeric string;

accumulating the sequence values of the weighted digit string to form a weighted accumulated digit string;

performing dimensionality reduction on the weighted and accumulated digit string, taking the digit with each digit larger than 0 as 1, and taking the rest digits as 0 to obtain a SimHash value;

partitioning processing is carried out by utilizing the SimHash value, and index construction of a Hashmap data structure is carried out according to a key value pair mode;

calculating the similarity between the text of the financial short message to be stored and the text of the financial short message stored in the corresponding Hashmap partition based on the SimHash value;

judging whether the similarity reaches a preset threshold value or not;

if the similarity does not reach a preset threshold value, reserving the short message text to be stored in the financial class as the first short message text sample;

and if the similarity reaches a preset threshold value, discarding the text to be stored in the financial short message.

According to a second aspect of the embodiments of the present invention, there is provided a credit earning short message discrimination system, including:

the matching module is used for screening financial short message texts from all the short message texts through regular matching;

the duplication removing module is used for carrying out duplication removing processing on the financial short message text according to the text similarity to obtain a first short message text sample;

the sample library construction module is used for marking the first short message text sample to obtain a second short message text sample and establishing a sample library;

the word segmentation module is used for carrying out word segmentation processing on the second short message text sample to obtain a third short message text sample;

the vectorization module is used for vectorizing the third short message text sample to obtain a corresponding text word vector;

the training module is used for aligning each text word vector with the corresponding label and then using the aligned text word vector as training data to obtain a classification model;

and the judging module is used for processing data of the short message text to be judged, inputting the data into the classification model for prediction, and obtaining a judging result.

carrying out data cleaning processing on the characteristic short message text;

Further, the method for performing deduplication processing on the financial short message text according to the text similarity to obtain a first short message text sample includes:

taking the occurrence frequency of the characteristic character as weight, and carrying out weighting processing on the numeric string to obtain a weighted numeric string;

performing dimensionality reduction on the weighted and accumulated digit string, taking the digit with each digit greater than 0 as 1, and taking the rest digits as 0 to obtain a SimHash value;

judging whether the similarity reaches a preset threshold value or not;

Further, labeling the first short message text sample to obtain a second short message text sample, and establishing a sample library, including:

for the first short message text sample, marking a credit press-to-receive type short message text as 1, and marking a non-credit press-to-receive type text as 0 to obtain a second short message text sample;

and constructing a sample library according to the second short message text sample.

Further, the word segmentation processing is performed on the second short message text sample to obtain a third short message text sample, including:

performing first word segmentation processing on the second short message text sample by using a stop word stock and a user-defined word stock to obtain a first word segmentation result, wherein the first word segmentation result comprises at least one first phrase, and the first phrases are separated by spaces;

and if the self-defined words are not distinguished in the first word segmentation result, directly utilizing the first word segmentation result to obtain a third short message text sample.

If the self-defined word is not distinguished in the first word segmentation result, adding the self-defined word to a self-defined word library and improving the word segmentation weight of the self-defined word;

obtaining a first matrix by using the third short message text sample;

and summing the first word segmentation word vectors of all the texts, and averaging by using the summation result to obtain the text word vectors.

performing word segmentation processing on the short message text to be distinguished to obtain a word segmentation result to be distinguished;

vectorizing the word segmentation result to be judged to obtain a text vector to be judged;

inputting the text vector to be distinguished into the classification model, and predicting whether the short message text to be distinguished is a credit collection short message text;

performing third word segmentation processing on the short message text to be distinguished by utilizing a stop word stock and a user-defined word stock to obtain a third word segmentation result, wherein the third word segmentation result comprises at least one second phrase, and the second phrases are separated by spaces;

if the second TF-IDF value does not exceed a second preset word segmentation threshold value, directly judging whether a user-defined word is not distinguished according to a third word segmentation result;

if the user-defined word is not distinguished in the third word segmentation result, adding the user-defined word into a user-defined word library and improving the word segmentation weight of the user-defined word; performing fourth word segmentation processing on the short message text to be judged by using the updated stop word stock and the user-defined word stock to obtain a word segmentation result to be judged;

inputting the word segmentation result to be judged into the second directional quantization model to obtain a second word segmentation vector of each text;

and summing the second word segmentation word vectors of each text, and averaging by using a summation result to obtain the text vector to be judged.

According to a third aspect of the embodiments of the present invention, there is provided credit collection short message discrimination apparatus, including: a processor and a memory;

the memory is to store one or more program instructions;

the processor is configured to run one or more program instructions to perform the steps of the credit extended short message determination method according to any one of the above embodiments.

According to a fourth aspect of the embodiments of the present invention, there is provided a computer-readable storage medium, having stored thereon a computer program, which when executed by a processor, implements the steps of a credit promissory note determination method as described in any one of the above.

The embodiment of the invention has the following advantages:

the embodiment of the invention discloses a credit promulgation short message distinguishing method, a system, equipment and a storage medium, wherein a sample library is established by labeling short message text samples, then word segmentation and vectorization processing are carried out on the samples to obtain text word vectors, each text word vector is used as training data to train a classification model after being aligned with a corresponding label, and finally prediction distinguishing is carried out on short message texts to be distinguished by utilizing the classification model. The embodiment of the invention trains the classification model according to the machine learning classification algorithm, predicts the text, optimizes the complicated manual analysis and template construction process, avoids frequent template modification, effectively improves the text matching efficiency for credit collection short message discrimination, and enhances the classification accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, the proportions, the sizes, and the like shown in the specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical essence, and any modifications of the structures, changes of the proportion relation, or adjustments of the sizes, should still fall within the scope of the technical contents disclosed in the present invention without affecting the efficacy and the achievable purpose of the present invention.

Fig. 1 is a schematic diagram of a logic structure of a credit promulgation short message determination system according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a credit prompter short message determination method according to an embodiment of the present invention;

fig. 3 is a schematic flow chart illustrating a process of screening financial short message texts from all short message texts according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a process of performing deduplication processing on a financial short message text according to text similarity to obtain a first short message text sample according to an embodiment of the present invention;

fig. 5 is a schematic flow chart illustrating a process of performing word segmentation processing on a second short message text sample to obtain a third short message text sample according to the embodiment of the present invention;

fig. 6 is a schematic flow chart illustrating vectorization of a third short message text sample to obtain a corresponding text word vector according to an embodiment of the present invention;

fig. 7 is a schematic flow chart illustrating a process of performing data processing on a short message text to be distinguished, and inputting the processed short message text to a classification model for prediction to obtain a distinguishing result according to an embodiment of the present invention;

fig. 8 is a schematic flow chart illustrating word segmentation processing performed on a short message text to be distinguished to obtain a word segmentation result to be distinguished according to an embodiment of the present invention;

fig. 9 is a schematic flow diagram of vectorizing a word segmentation result to be distinguished to obtain a text vector to be distinguished according to an embodiment of the present invention.

Detailed Description

The present invention is described in terms of specific embodiments, and other advantages and benefits of the present invention will become apparent to those skilled in the art from the following disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an embodiment of the present invention provides a credit collection short message determination system, which specifically includes: the system comprises a matching module 1, a duplication eliminating module 2, a sample base constructing module 3, a word segmentation module 4, a vectorization module 5, a training module 6 and a discrimination module 7.

Further, the matching module 1 is used for screening financial texts from all short message texts through regular matching; the duplication elimination module 2 is used for carrying out duplication elimination processing on the financial short message text according to the text similarity to obtain a first short message text sample; the sample library construction module 3 is used for labeling the first short message text sample to obtain a second short message text sample and establishing a sample library; the word segmentation module 4 is used for performing word segmentation processing on the second short message text sample to obtain a third short message text sample; the vectorization module 5 is used for vectorizing the third short message text sample to obtain a corresponding text word vector; the training module 6 is used for aligning each text word vector with the corresponding label and then using the aligned text word vector as training data to obtain a classification model; the judging module 7 is used for processing data of the short message text to be judged, inputting the data into the classification model for prediction, and obtaining a judging result.

The embodiment of the invention discloses a credit collection prompting short message distinguishing system, which establishes a sample library by labeling short message text samples, then carries out word segmentation and vectorization processing on the samples to obtain text word vectors, aligns each text word vector with a corresponding label and then uses the aligned text word vectors as training data to train a classification model, and finally carries out prediction distinguishing on short message texts to be distinguished by using the classification model. The embodiment of the invention trains the classification model according to the machine learning classification algorithm, performs classification prediction on the text, optimizes the complicated manual analysis and template construction process, avoids frequent template modification, effectively improves the text matching efficiency for credit earning short message discrimination, and enhances the classification accuracy.

Corresponding to the credit prompter receiving short message judging system, the embodiment of the invention also discloses a credit prompter receiving short message judging method. The credit prompter receiving short message discrimination method disclosed in the embodiment of the invention is described in detail below by combining the credit prompter receiving short message discrimination system described above.

Referring to fig. 2, the following describes specific steps of a credit prompter short message determination method according to an embodiment of the present invention.

And the matching module 1 screens financial texts from all short message texts through regular matching.

Referring to fig. 3, the above steps specifically include: obtaining financial short message characteristics according to the contents of all short message texts; matching the financial short message characteristics from the existing database to obtain corresponding characteristic short message texts; carrying out data cleaning processing on the characteristic short message text; analyzing the linguistic characteristics and structural characteristics of the text by using the cleaned data, and extracting keywords of different types of short message texts; and performing regular matching according to the keywords, and screening out the financial short message text.

The embodiment of the invention can reduce the subsequent manual operation amount and the training set data amount and improve the model training efficiency.

And the duplication eliminating module 2 is used for carrying out duplication eliminating processing on the financial short message text according to the text similarity to obtain a first short message text sample.

Referring to fig. 4, the above steps specifically include: according to the dimension of a single character, splitting the financial short message text into at least one characteristic character and forming a text character string; mapping each characteristic character through a Hash function, and calculating a corresponding Hash value of each characteristic character in the text character string to obtain an n-bit binary number string, wherein the commonly used digits are 32, 64 and 128; taking the occurrence frequency of the characteristic character as weight, carrying out weighting processing on the numeric string, wherein the corresponding position of the numeric string is 1, the weight is positive, the corresponding position is 0, and the weight is negative, so as to obtain a weighted numeric string; accumulating the sequence values of the weighted digit string to form a weighted accumulated digit string; performing dimensionality reduction on the weighted and accumulated digit string, taking the digit with each digit greater than 0 as 1, and taking the rest digits as 0 to obtain a SimHash value, wherein the SimHash belongs to a local sensitive Hash algorithm, and the generated SimHash value can represent the original content to a certain extent; partitioning processing is carried out by using the SimHash value, and index construction of a Hashmap data structure is carried out according to a key value pair mode; calculating the similarity between the text of the financial short message to be stored and the text of the financial short message stored in the corresponding Hashmap partition based on the SimHash value, and taking the Hamming distance between SimHash labels as the similarity between the texts, wherein the Hamming distance refers to the number of the two binary strings with the same position and different positions; judging whether the similarity reaches a preset threshold value, wherein the preset threshold value is a preset Hamming distance; if the similarity does not reach the preset threshold value, the short message text to be stored in the financial class is reserved as a first short message text sample; and if the similarity reaches a preset threshold, discarding the text to be stored in the financial short message.

The embodiment of the invention removes the short message text with over-high similarity in the financial short message text through the text similarity algorithm, solves the problem of over-high sample similarity in the sample library, enables the same sample scale to cover wider sample characteristics, and improves the efficiency of subsequent operation and training.

And the sample library construction module 3 labels the first short message text sample to obtain a second short message text sample, and establishes a sample library.

The steps specifically include: manually judging whether the first short message text sample is a credit press-to-receive short message, marking the credit press-to-receive short message as 1, marking the non-credit press-to-receive short message as 0 to obtain a second short message text sample, and constructing a sample library with the marked text, wherein the sample library is about twenty thousand in scale, and the data proportion of two labels is ensured to be one to one.

According to the embodiment of the invention, the sample library is constructed through the marked samples, data support is provided for the subsequent word segmentation, vectorization and training processes, and meanwhile, the proportion of credit collection short messages in the sample library is controlled, so that the training effect of the discrimination model is ensured.

And the word segmentation module 4 carries out word segmentation processing on the second short message text sample to obtain a third short message text sample.

Referring to fig. 5, the above steps specifically include: performing first word segmentation processing on the second short message text sample by using the stop word stock and the user-defined word stock to obtain a first word segmentation result, wherein the first word segmentation result comprises a plurality of first phrases, and the first phrases are separated by spaces; calculating a first TF-IDF value of each first phrase in the first segmentation result, wherein TF-IDF (term frequency-inverse document frequency) is a common weighting technology for information retrieval and data mining, is commonly used for mining keywords in articles, is simple and efficient in algorithm, and is commonly used for the initial text data cleaning by industry; the TF-IDF has two meanings, one layer is the Term Frequency (abbreviated as TF), and the other layer is the Inverse Document Frequency (abbreviated as IDF); judging whether the first TF-IDF value exceeds a first preset word segmentation threshold value or not; if the first TF-IDF value exceeds a first preset word segmentation threshold value, adding the first phrase as a stop word into a stop word bank; if the first TF-IDF value does not exceed a first preset word segmentation threshold value, judging whether a user-defined word is not distinguished according to a first word segmentation result, wherein the user-defined word is generally a special word which is difficult to automatically distinguish by a word segmentation tool and is related to a financial theme; judging whether a user-defined word is not distinguished according to the first word segmentation result; and if the user-defined word is not distinguished in the first word segmentation result, adding the user-defined word into a user-defined word bank, improving the word segmentation weight of the user-defined word, and performing second word segmentation on the second short message text sample by using the updated stop word bank and the updated user-defined word bank to obtain a third short message text sample. And if no self-defined word is existed in the first segmentation result and is not distinguished, directly utilizing the first segmentation result to obtain a third short message text sample.

The embodiment of the invention removes meaningless words by the word segmentation method, ensures that special words related to credit charging are distinguished, and continuously updates the stop word bank and the user-defined word bank according to the word segmentation result. Compared with the conventional word segmentation method, the word segmentation method has the advantages that the iteration performance and the self-optimization performance are realized, the word segmentation result is simpler, and the word distinction of related fields is ensured.

And vectorizing the third short message text sample by using a vectorizing module 5 to obtain a corresponding text word vector.

Referring to fig. 6, the foregoing steps specifically include: obtaining a first matrix by using a third short message text sample, wherein the first matrix is a one-hot coding matrix of an iterative word; constructing a first central word matrix and a first context matrix according to the total word segmentation number and the word vector dimension of a third short message text sample, wherein the first central word matrix is obtained by mapping all words to a D-dimensional space to form a mapping matrix with the shape of V x D and mapping iterative words; performing first matrix multiplication operation by using the first matrix and the first central word matrix to obtain a second central word matrix; performing a second matrix multiplication operation by using the second central word matrix and the first context matrix to obtain a first inner product matrix; performing Softmax normalization on the first inner product matrix, wherein the Softmax normalization function can "compress" a K-dimensional vector z containing any real number into another K-dimensional real vector, so that each element ranges between (0, 1), and the sum of all elements is 1; adjusting a first central word matrix and a first context matrix by using a normalization processing result, wherein the normalization result represents the correlation between an iterated word and a corresponding word, the first central word matrix and the first context matrix are adjusted by using a back propagation algorithm in a neural network, so that the word correlation between the iterated word and the context of the iterated word is as large as possible, all participles are used as iterated words to traverse, the loss function minimization of the model is gradually realized, and a first vector quantization model is obtained; inputting a third short message text sample into the first vector quantization model to obtain a first word segmentation word vector of each text; and summing the first word segmentation word vectors of all the texts, and averaging by using the summation result to obtain the text word vectors.

The embodiment of the invention uses Word2vec Skip-gram model based on text context, wherein, word2vec is a relevant model for generating Word vector, the model is a shallow and double-layer neural network for training to reconstruct linguistic Word text, and input words at adjacent positions need to be guessed, after training is completed, the Word2vec model can be used for mapping each Word to a vector for representing the relation between words, the vector is a hidden layer of the neural network, the Skip-gram model is one of Word2vec, and the basic principle is to predict the words in context by using the current words. Through the method, the text can be more scientifically and accurately vectorized and represented.

And the training module 6 aligns each text word vector with the corresponding label and then uses the aligned text word vector as training data to obtain a classification model.

The steps specifically include: aligning the vectorized short message text with the corresponding manual label; inputting the aligned vectors serving as training data into a logistic regression model for training; the logistic regression model is fitted by adopting a gradient descent method, an optimal solution is obtained, and meanwhile, L2 is used as a regular term for adjusting the overfitting problem and finally obtaining a trained classification model.

The judgment module 7 processes data of the short message text to be judged, and inputs the data into the classification model for prediction to obtain a judgment result.

Referring to fig. 7, the above steps specifically include: performing word segmentation processing on the short message text to be distinguished to obtain a word segmentation result to be distinguished; vectorizing the word segmentation result to be distinguished to obtain a text vector to be distinguished; inputting the text vector to be judged into a classification model, and predicting whether the short message text to be judged is a credit collection type short message text or not; if the short message text to be judged is a credit press-receiving type short message text, the judgment result is 1; and if the short message text to be judged is the non-credit acquisition type short message text, the judgment result is 0.

Further, referring to fig. 8, the word segmentation processing performed on the short message text to be distinguished to obtain a word segmentation result to be distinguished specifically includes: performing third word segmentation processing on the short message text to be distinguished by using the stop word stock and the user-defined word stock to obtain a third word segmentation result, wherein the third word segmentation result comprises at least one second phrase, and the second phrases are separated by spaces; calculating a second TF-IDF value of each second phrase in the third segmentation result; judging whether the second TF-IDF value exceeds a second preset word segmentation threshold value or not; if the second TF-IDF value exceeds a second preset word segmentation threshold value, adding the second phrase as a stop word into a stop word bank; judging whether the self-defined words are not distinguished according to the third word distinguishing result; if the second TF-IDF value does not exceed the second preset word segmentation threshold value, directly judging whether a user-defined word is not distinguished according to a third word segmentation result; if the third word segmentation result has the self-defined word which is not distinguished, adding the self-defined word into the self-defined word library and improving the word segmentation weight of the self-defined word; performing fourth word segmentation processing on the short message text to be distinguished by using the updated stop word bank and the user-defined word bank to obtain a word segmentation result to be distinguished; if no self-defined word is present in the third word segmentation result and is not distinguished, directly utilizing the third word segmentation result to obtain a word segmentation result to be distinguished;

further, referring to fig. 9, vectorizing the word segmentation result to be distinguished to obtain a text vector to be distinguished specifically includes: obtaining a second matrix by using the word segmentation result to be distinguished; constructing a third central word matrix and a second context matrix according to the total word segmentation number and the word vector dimension of the word segmentation result to be judged; performing a third matrix multiplication operation by using the second matrix and the third central word matrix to obtain a fourth central word matrix; performing a fourth matrix multiplication operation by using the fourth central word matrix and the second context matrix to obtain a second inner product matrix; normalizing the second inner product matrix, and adjusting the third headword matrix and the second context matrix by using a normalization processing result to obtain a second directional quantization model; inputting the word segmentation result to be judged into a second directional quantization model to obtain a second word segmentation word vector of each text; summing the word vectors of the second word segments of each text, and averaging by using a summation result to obtain a text vector to be judged;

the embodiment of the invention discloses a credit collection short message distinguishing method, which comprises the steps of establishing a sample library by labeling short message text samples, then carrying out word segmentation and vectorization processing on the samples to obtain text word vectors, aligning each text word vector with a corresponding label to serve as training data to train a classification model, and finally carrying out prediction distinguishing on short message texts to be distinguished by utilizing the classification model. According to the embodiment of the invention, the classification model is trained according to the machine learning classification algorithm, the text is classified and predicted, the complicated manual analysis and template construction processes are optimized, the frequent modification of the template is avoided, the text matching efficiency for judging credit promissory short messages is effectively improved, and the classification accuracy is enhanced.

In addition, the embodiment of the invention also provides credit prompter receiving short message judgment equipment, which comprises: a processor and a memory;

the memory is to store one or more program instructions;

the processor is used for running one or more program instructions to execute the steps of the credit extended short message judgment method.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for identifying a credit extended short message is implemented as in any one of the above methods.

In an embodiment of the invention, the processor may be an integrated circuit chip having signal processing capability. The Processor may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

The various methods, steps, and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The processor reads the information in the storage medium and completes the steps of the method in combination with the hardware.

The storage medium may be a memory, for example, which may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.

The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory.

The volatile Memory may be a Random Access Memory (RAM) which serves as an external cache. By way of example and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (ddr Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchlronous DRAM (SLDRAM), and Direct Rambus RAM (DRRAM).

The storage media described in connection with the embodiments of the invention are intended to comprise, without being limited to, these and any other suitable types of memory.

Those skilled in the art will recognize that the functionality described in this disclosure may be implemented in a combination of hardware and software in one or more of the examples described above. When software is applied, the corresponding functionality may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Although the invention has been described in detail with respect to the general description and the specific embodiments, it will be apparent to those skilled in the art that modifications and improvements may be made based on the invention. Accordingly, it is intended that all such modifications and alterations be included within the scope of this invention as defined in the appended claims.

Claims

1. A credit collection short message judgment method is characterized by comprising the following steps:

marking the first short message text sample to obtain a second short message text sample, and establishing a sample library;

and processing data of the short message text to be distinguished, inputting the data into the classification model for prediction, and obtaining a distinguishing result.

2. The method as claimed in claim 1, wherein the step of performing word segmentation on the second short message text sample to obtain a third short message text sample comprises:

3. The credit promulgation method of the short message as claimed in claim 2, wherein vectorizing the third short message text sample to obtain a corresponding text word vector comprises:

obtaining a first matrix by using the third short message text sample;

4. The credit promulgation short message judgment method as claimed in claim 3, wherein the data processing is performed on the short message text to be judged, and the processed data is input into the classification model for prediction to obtain a judgment result, and the method comprises the following steps:

vectorizing the word segmentation result to be distinguished to obtain a text vector to be distinguished;

5. The credit extended short message discriminating method as claimed in claim 4, wherein the word segmentation processing is performed on the short message text to be discriminated to obtain a word segmentation result to be discriminated, comprising:

6. The method as claimed in claim 5, wherein vectorizing the word segmentation result to be determined to obtain a text vector to be determined comprises:

7. The credit promulgation method of short messages as claimed in any one of claims 1 to 6, wherein before labeling a first short message text sample to obtain a second short message text sample, the method further comprises:

and carrying out duplication elimination processing on the financial short message text according to the text similarity to obtain the first short message text sample.

8. A credit promulgation short message distinguishing system, the system comprising:

and the judging module is used for processing data of the short message text to be judged, inputting the data into the classification model for prediction and obtaining a judging result.

9. A credit prompter message discriminating device, said device comprising: a processor and a memory;

the memory for storing one or more program instructions;

the processor is configured to execute one or more program instructions to perform the steps of the credit prompter collection short message determination method according to any one of claims 1 to 7.

10. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the steps of a credit extended message discriminating method as claimed in any one of claims 1 to 7.