CN117633231A - Model-based classification method, device and equipment for bank system text - Google Patents

Model-based classification method, device and equipment for bank system text Download PDF

Info

Publication number
CN117633231A
CN117633231A CN202311659524.7A CN202311659524A CN117633231A CN 117633231 A CN117633231 A CN 117633231A CN 202311659524 A CN202311659524 A CN 202311659524A CN 117633231 A CN117633231 A CN 117633231A
Authority
CN
China
Prior art keywords
text
bank system
classified
washed
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311659524.7A
Other languages
Chinese (zh)
Inventor
田荟双
李鑫
李金金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN202311659524.7A priority Critical patent/CN117633231A/en
Publication of CN117633231A publication Critical patent/CN117633231A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a classification method, a device and equipment of a model-based banking system text, wherein the method comprises the following steps: acquiring a bank system text to be classified; data cleaning treatment is carried out on the bank system text to be classified, and the cleaned bank system text to be classified is obtained; extracting keywords in the washed bank system text to be classified to obtain a keyword combination; inputting the keyword combinations into a preset classification model to obtain feature vectors of the keyword combinations; based on a preset classification model, determining cosine similarity between the feature vector of the keyword combination and the vector of each preset system classification label, and determining the system classification label corresponding to the system classification label vector with the highest cosine similarity as the system classification label of the bank system text to be classified. The time cost can be reduced; the classification is more accurate, and the system classification label of the bank system text to be classified can be accurately determined.

Description

Model-based classification method, device and equipment for bank system text
Technical Field
The embodiment of the application relates to a text processing technology, a natural language processing technology and an intelligent model technology, in particular to a method, a device and equipment for classifying bank system texts based on models.
Background
The bank system text is stored in the bank, and the bank system text is the text for making the internal system of the bank. The bank system texts are required to be classified, and further the system classification labels of the bank system texts are determined. For example, the institutional classification tag is an internal daily work category, the institutional classification tag is a business work processing category, the institutional classification tag is an internal management category, and so on.
In the prior art, after a bank system text is read in a manual mode, a system classification label of the bank system text is manually determined.
However, in the above manner, the manner of manually determining the system classification label of the bank system text depends on the experience of manual classification, and a lot of time and cost are required, and the classification accuracy is low.
Disclosure of Invention
The embodiment of the application provides a classification method, a classification device and classification equipment for a bank system text based on a model, which are used for solving the problems of high time cost and low classification accuracy of a system classification label for determining the bank system text.
In a first aspect, an embodiment of the present application provides a method for classifying a model-based banking system text, including:
Acquiring a bank system text to be classified; carrying out data cleaning treatment on the bank system text to be classified to obtain cleaned bank system text to be classified;
extracting keywords in the washed bank system text to be classified to obtain a keyword combination; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed bank system text to be classified;
inputting the keyword combination into a preset classification model to obtain a feature vector of the keyword combination;
and determining cosine similarity between the feature vector of the keyword combination and the vector of each preset system classification label based on the preset classification model, and determining the system classification label corresponding to the system classification label vector with the highest cosine similarity as the system classification label of the bank system text to be classified.
In a second aspect, an embodiment of the present application provides a model-based classification apparatus for a banking system text, including:
the first acquisition unit is used for acquiring the bank system text to be classified;
the first processing unit is used for carrying out data cleaning processing on the bank system text to be classified to obtain cleaned bank system text to be classified;
The first extraction unit is used for extracting keywords in the washed bank system text to be classified to obtain a keyword combination; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed bank system text to be classified;
the first determining unit is used for inputting the keyword combination into a preset classification model to obtain a feature vector of the keyword combination;
and the second determining unit is used for determining cosine similarity between the feature vector of the keyword combination and the vector of each preset system classification label based on the preset classification model, determining the system classification label corresponding to the system classification label vector with the highest cosine similarity, and taking the system classification label of the bank system text to be classified as the system classification label of the bank system text to be classified.
In a third aspect, an embodiment of the present application provides an electronic device, including: a memory, a processor;
a memory; a memory for storing the processor-executable instructions;
wherein the processor is configured to perform the method of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions for implementing the method of the first aspect when executed by a processor.
In a fifth aspect, embodiments of the present application provide a computer program product comprising: a computer program stored in a readable storage medium, from which it can be read by at least one processor of an electronic device, the at least one processor executing the computer program causing the electronic device to perform the method of the first aspect.
The method, the device and the equipment for classifying the bank system texts based on the model acquire the bank system texts to be classified; data cleaning treatment is carried out on the bank system text to be classified, and the cleaned bank system text to be classified is obtained; extracting keywords in the washed bank system text to be classified to obtain a keyword combination; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed bank system text to be classified. Automatically cleaning the data of the bank system text to be classified to remove unnecessary words; extracting keywords in the washed bank system text to be classified, and representing the bank system text to be classified based on the keywords; therefore, the data volume of the bank system text to be classified is reduced, and the calculated amount in the classification process is reduced. Inputting the keyword combinations into a preset classification model to obtain feature vectors of the keyword combinations; based on a preset classification model, determining cosine similarity between the feature vector of the keyword combination and the vector of each preset system classification label, and determining the system classification label corresponding to the system classification label vector with the highest cosine similarity as the system classification label of the bank system text to be classified. Determining a system classification label of the bank system text to be classified based on a model mode; the time cost can be reduced; and the classification is more accurate, and the system classification label of the bank system text to be classified can be accurately determined.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flowchart of a method for classifying a model-based banking system text according to an embodiment of the present application;
FIG. 2 is a flow chart of another method for classifying model-based banking documents provided in an embodiment of the present application;
fig. 3 is a schematic structural diagram of a classification device for a model-based banking system text according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 5 is a block diagram of an electronic device provided in an embodiment of the present application.
Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
The bank system text is stored in the bank, and the bank system text is the text for making the internal system of the bank. The bank system texts are required to be classified, and further the system classification labels of the bank system texts are determined. For example, the institutional classification label is an internal daily work category, the institutional classification label is a business work processing category, the institutional classification label is an internal management category, the institutional classification label is a regulatory institutional category, the institutional classification label is a management approach category, and so on.
In one example, after the bank system text is read manually, the system classification label of the bank system text is manually determined.
However, in the above manner, the manner of manually determining the system classification label of the bank system text depends on the experience of manual classification, and the number of the bank system texts is large and the text is long; for example, the number of the bank system texts is tens of thousands, the number of words of each bank system text reaches thousands of words, and the system classification labels are many, so that the whole classification process is manually completed, thus the time consumption is long, and a large amount of time and cost are required to be consumed. In addition, in the manual classification process, the manual mode is seriously dependent on experience of operators due to the fact that the classification labels are many, and the classification accuracy is low.
The embodiment of the application provides a classification method, a classification device and classification equipment for a model-based banking system text, which aim to solve the technical problems in the prior art.
The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a classification method of a model-based banking system text according to an embodiment of the present application, as shown in fig. 1, where the method includes:
101. acquiring a bank system text to be classified; and carrying out data cleaning treatment on the bank system text to be classified to obtain the cleaned bank system text to be classified.
The execution body of the present embodiment is an electronic device, or a terminal device, or a classification device or device of a model-based banking text, or other devices or devices that may execute the present embodiment. The embodiment is described with an execution body as an electronic device.
Firstly, the electronic equipment needs to acquire a bank system text to be classified, and the electronic equipment can send a first acquisition instruction to a server; and the server sends the stored bank system text to be classified to the electronic equipment.
Or, the hardware storage device of the electronic device stores the bank system text to be classified, and the electronic device invokes the bank system text to be classified from the hardware storage device.
And then, the electronic equipment performs data cleaning processing on the to-be-classified bank system text, so that punctuation marks, numbers, redundant data and the like in the to-be-classified bank system text are removed, and the cleaned to-be-classified bank system text is obtained.
102. Extracting keywords in the washed bank system text to be classified to obtain a keyword combination; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed bank system text to be classified.
Illustratively, keywords in the washed bank system text to be classified are extracted. For example, keywords in the washed bank system text to be classified can be extracted based on a preset keyword extraction model; or extracting keywords in the washed bank system text to be classified based on a text analysis mode; or extracting keywords in the washed bank system text to be classified based on a natural language processing mode, or extracting keywords in the washed bank system text to be classified based on a text neural network embedded model (such as word2vec representation), or extracting keywords in the washed bank system text to be classified based on a linear discriminant analysis model (such as LDA representation) and the like.
Among them, natural language processing is a branch of computer science, based on which natural language can be processed and understood. Natural language processing is an important component of artificial intelligence and linguistics; the natural language processing process is to utilize a computer to perform natural language text analysis, so that communication between natural language and the computer is realized, and the functions of an intelligent robot and the like are realized.
Further obtaining a keyword combination; wherein the keyword combination comprises at least one keyword. In this embodiment, for each washed banking text to be classified, the washed banking text to be classified is characterized based on a keyword combination of the washed banking text to be classified.
103. And inputting the keyword combination into a preset classification model to obtain the feature vector of the keyword combination.
Illustratively, a classification model is trained in advance, the keyword combination of each washed bank system text to be classified is input into a preset classification model, and the feature vector of the keyword combination of each washed bank system text to be classified is output. The method comprises the steps of processing the keyword combination of the washed bank system text to be classified based on a preset classification model, and further representing the keyword combination by adopting a feature vector.
For example, the preset classification model is a deep learning model.
Deep learning, among other things, refers to the formation of more abstract high-level representation attributes or features by combining low-level features to discover a distributed feature representation of data. Compared with the method of constructing the features by using manual rules, the method utilizes big data to learn the features, and can more describe the internal information rich in data.
Deep learning is an effective way to model and analyze across data; the machine learning model has higher universality based on deep learning, can handle complex problems, and has improved classification accuracy and prediction performance. Deep learning is essentially a construction of an artificial neural network model containing multiple hidden layers, inheriting the training and operation modes of a traditional neural network model, but generally has a greater number of hidden layers in the deep learning model, so that more features can be extracted from the original input text, the vector representation of the input text in the original vector space is converted to a new vector space, and more features are extracted to control the output result. In addition, deep learning can be based on a large amount of data to obtain more characteristic information, so that classification and prediction accuracy is remarkably improved.
104. Based on a preset classification model, determining cosine similarity between the feature vector of the keyword combination and each preset system classification label vector, and determining a system classification label corresponding to the system classification label vector with the highest cosine similarity, wherein the system classification label is the system classification label of the bank system text to be classified.
Illustratively, each preset institutional classification tag is already provided in the preset classification model, and each preset institutional classification tag is subjected to feature processing to obtain a vector of each preset institutional classification tag. Therefore, based on a preset classification model, the cosine similarity between the feature vector of the keyword combination and each preset system classification label vector can be calculated.
And then determining the system classification label corresponding to the vector of the system classification label with the highest cosine similarity as the system classification label of the bank system text to be classified.
In the embodiment, acquiring a bank system text to be classified; data cleaning treatment is carried out on the bank system text to be classified, and the cleaned bank system text to be classified is obtained; extracting keywords in the washed bank system text to be classified to obtain a keyword combination; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed bank system text to be classified. Automatically cleaning the data of the bank system text to be classified to remove unnecessary words; extracting keywords in the washed bank system text to be classified, and representing the bank system text to be classified based on the keywords; therefore, the data volume of the bank system text to be classified is reduced, and the calculated amount in the classification process is reduced. Inputting the keyword combinations into a preset classification model to obtain feature vectors of the keyword combinations; based on a preset classification model, determining cosine similarity between the feature vector of the keyword combination and the vector of each preset system classification label, and determining the system classification label corresponding to the system classification label vector with the highest cosine similarity as the system classification label of the bank system text to be classified. Determining a system classification label of the bank system text to be classified based on a model mode; the time cost can be reduced; and the classification is more accurate, and the system classification label of the bank system text to be classified can be accurately determined.
Fig. 2 is a flowchart of another classification method of a model-based banking system text according to an embodiment of the present application, as shown in fig. 2, including:
201. and acquiring the bank system text to be classified.
The execution body of the present embodiment is an electronic device, or a terminal device, or a classification device or device of a model-based banking text, or other devices or devices that may execute the present embodiment. The embodiment is described with an execution body as an electronic device.
The step can be referred to the above step 101, and will not be described again.
202. And carrying out data cleaning treatment on the bank system text to be classified to obtain the cleaned bank system text to be classified.
In one example, step 202 includes the following process:
removing noise information in the bank system text to be classified based on a regular matching mode to obtain the bank system text to be classified after denoising; wherein the noise information includes data and punctuation marks; word segmentation processing is carried out on the denoised bank system text to be classified, so that words of the denoised bank system text to be classified are obtained; and removing stop words in the words to obtain the washed bank system text to be classified.
Firstly removing noise information in a bank system text to be classified based on a regular matching mode to obtain the bank system text to be classified after denoising; the noise information includes data, punctuation marks and other redundant information. These numbers and punctuation marks are both noisy and do not aid in the classification process. Other redundant information is, for example, advertisements, etc.
And then, carrying out word segmentation processing on the denoised bank system text to be classified to obtain each word of the denoised bank system text to be classified.
Special symbols are then removed from the words, which also need to be deleted because they are also redundant nonsensical information for the classification task. And removing stop words in the words to obtain the washed bank system text to be classified.
Through the data cleaning process, numbers, punctuation coincidence, other redundant information, special symbols and the like in the text can be removed; and further, the information is prevented from interfering the performance of the subsequent classification model, and meanwhile, the burden of the classification model can be reduced to a certain extent.
203. Determining the total word number in the washed bank system text to be classified; wherein the total word number represents the total number of words in the bank system text; and determining the occurrence times of each word in the washed bank system text to be classified, and obtaining the occurrence times information of each word.
The TF-IDF algorithm (Term Frequency-Inverse Document Frequency) in this embodiment, for example, represents the bank system text as a vector of real-valued components and its vector space. Wherein each vector corresponds to a term, the dimension of the vector corresponds to the size of the word, which is equivalent to representing the bank system text as a point in space, and the value of TF-IDF is the importance weight of the word.
Firstly, counting the total number of words in the washed bank system text to be classified, and further obtaining the total word number in the washed bank system text to be classified. And then, counting and determining the occurrence times of each word in the washed bank system text to be classified according to each word in the washed bank system text to be classified, and obtaining the occurrence times information of each word.
204. And determining the ratio of the occurrence number information of each word to the total word number, and obtaining word frequency information of each word.
For each word in the washed bank system text to be classified, the ratio between the occurrence number information of each word and the total word number is determined as word frequency information of each word.
205. Determining total number information of the bank system texts in the preset corpus, wherein the total number information represents the total number of the bank system texts in the preset corpus; determining the text quantity of each word, wherein the text quantity characterization of each word comprises the quantity of banking system texts comprising each word in a preset corpus; and determining the inverse document frequency of each word according to the total number information and the text quantity of each word.
In one example, the inverse document frequency for each term; for idf (t) =log (N/(N) t +1)); wherein t represents the word t; n is total information, N t Is the number of text for word t.
Illustratively, a pre-set corpus is provided, which includes a plurality of banking texts. And counting the total number of the bank system texts in the preset corpus, and further obtaining total number information N of the bank system texts in the preset corpus.
And then, counting the number of the bank system texts comprising the word in the preset corpus aiming at each word in the washed bank system texts to be classified, and further obtaining the text number of the word. That is, if the word t is expressed, the text number n of the word t is obtained t
Then, aiming at each word t in the washed bank system text to be classified, according to the total number information N and the text quantity N of each word t Determining the inverse document frequency idf (t) =log (N/(N) t +1)). In the above formula, if a word is more common, the denominator of the inverse document frequency idf (t) is larger, and the inverse text is largerThe smaller the gear frequency, the closer the inverse document frequency is to 0; the fact that the denominator of the inverse document frequency idf (t) is added with 1 is to avoid that the denominator of the inverse document frequency idf (t) is 0 (namely, all the bank system texts do not contain the words). log represents the logarithm of the value obtained.
206. And determining importance weight of each word according to the word frequency information of each word and the inverse document frequency of each word.
In one example, each term has an importance weight of Wherein t represents the word t; d, representing the washed bank system text to be classified; tf (t, d) is word frequency information of the word t, and idf (t) is inverse document frequency of the word t.
Illustratively, for each word t in the washed bank system text to be classified, determining importance weight of each word according to word frequency information (tf (t, d)) of each word and inverse document frequency idf (t) of each word
207. And sequencing the words in the washed bank system text to be classified according to the importance weight from high to low to obtain sequenced words.
Illustratively, sorting the words in the washed bank system text to be classified based on the importance weight of the words from high to low to obtain the sorted words.
208. Determining the words with the importance weights of the first M as keywords in the washed bank system text to be classified so as to obtain keyword combinations; wherein M is a positive integer greater than or equal to 1. The keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed bank system text to be classified.
Illustratively, the words with importance weights ranked as the top M are determined as keywords in the washed banking system text to be classified, so as to obtain keyword combinations. In this embodiment, the keyword-based combination is used to characterize the bank system text to be classified after cleaning. For example, m=8.
209. And inputting the keyword combination into a preset classification model to obtain the feature vector of the keyword combination.
In one example, the preset classification model is a BERT model with a fully connected layer.
The following procedure is also included before step 209:
acquiring a text set to be trained; the text set to be trained comprises at least one bank system text to be trained, and the bank system text to be trained has an initial system classification label.
And (5) cleaning the data to obtain the cleaned bank system text to be trained.
Extracting keywords in the washed bank system text to be trained to obtain a keyword combination of the washed bank system text to be trained; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed banking system text to be trained.
And (3) inputting the keyword combination of the washed bank system text to be trained and the initial system classification label of the washed bank system text to be trained into an initial model to obtain the identified system classification label of the washed bank system text to be trained.
If the initial system classification label of the washed to-be-trained bank system text is matched with the identified system classification label of the washed to-be-trained bank system text, the washed to-be-trained bank system text is endowed with a first preset value.
If the initial system classification label of the washed to-be-trained bank system text is not matched with the recognized system classification label of the washed to-be-trained bank system text, a second preset value is given to the washed to-be-trained bank system text.
And modifying parameters of the initial model based on the first preset value and the second preset value to obtain a preset classification model.
Illustratively, training is required to obtain a preset classification model. The preset classification model is a BERT model, wherein the BERT model has a softmax full connectivity layer. The BERT model uses the Chinese general field data training model, and for the bank system text, a PAIR pre-training method is provided, the BERT model is used as a basis, a text set to be trained is constructed, and the pre-training of the BERT model is continued, so that a preset classification model is obtained.
Acquiring a text set to be trained; the text set to be trained comprises at least one bank system text to be trained, and the bank system text to be trained has an initial system classification label.
And (5) cleaning the data to obtain the cleaned bank system text to be trained. In one example, noise information in a bank system text to be trained is removed based on a regular matching mode, and the bank system text to be trained after denoising is obtained; wherein the noise information includes data and punctuation marks; word segmentation processing is carried out on the denoised bank system text to be trained, so that words of the denoised bank system text to be trained are obtained; and removing stop words in the words to obtain the washed bank system text to be trained.
Then, extracting keywords in the washed bank system text to be trained to obtain a keyword combination of the washed bank system text to be trained; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed banking system text to be trained.
In one example, determining the total word number in the washed bank system text to be trained; wherein the total word number represents the total number of words in the bank system text; and determining the occurrence times of each word in the washed banking system text to be trained, and obtaining the occurrence times information of each word.
And determining the ratio of the occurrence number information of each word to the total word number, and obtaining word frequency information of each word.
Determining total number information of the bank system texts in the preset corpus, wherein the total number information represents the total number of the bank system texts in the preset corpus; determining the text quantity of each word, wherein the text quantity characterization of each word comprises the quantity of banking system texts comprising each word in a preset corpus; and determining the inverse document frequency of each word according to the total number information and the text quantity of each word.
And determining importance weight of each word according to the word frequency information of each word and the inverse document frequency of each word.
And sequencing the words in the washed banking system text to be trained according to the importance weight from high to low to obtain sequenced words.
Determining the words with the importance weights of the first M as keywords in the washed banking system text to be trained so as to obtain keyword combinations; wherein M is a positive integer greater than or equal to 1. The keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed banking system text to be trained.
The bank system texts to be trained after cleaning comprise texts with correct initial system classification labels and texts with incorrect initial system classification labels.
For example, if 1200 to-be-trained bank system texts are included in the to-be-trained text set, the initial system classification labels of 1000 to-be-trained bank system texts are correct, and the initial system classification labels of 200 to-be-trained bank system texts are wrong.
And then, the keyword combination of the washed bank system text to be trained and the initial system classification label of the washed bank system text to be trained are input into an initial model, and the identified system classification label of the washed bank system text to be trained is obtained. The initial model at this time is the BERT model.
The BERT (Bidirectional Encoder Representation From Transformers) model is an unsupervised training language model facing natural language processing tasks. The BERT model extracts semantics using bi-directional encoding, and utilizes context information for each word in the encoding process of the input text. The BERT model has a stronger semantic information extraction capability than a unidirectional encoder that can extract semantics only using preamble information.
The main model structure of BERT is a stack of reformers, typically divided into two stages, preimpregnation and finishing. The BERT model has three parts of token casting, position embedding and segment embedding; token queuing is a word vector, and the first word is a CLS mark which can be used for subsequent classification tasks; position embedding is encoded by sin and cos sine and cosine functions, and there are sequences that can extend the position code to invisible lengths using a learnable position coding scheme, such as sequences that occur during testing are longer than any text in the training samples; segment embedding is used to distinguish between two sentences, because pre-training does not only Mask LM but also classification tasks with two sentences as input;
And adding a softmax full-connection layer into the BERT model, and giving a first preset value of 1 to the washed to-be-trained bank system text if the initial system classification label of the washed to-be-trained bank system text is matched with the recognized system classification label of the washed to-be-trained bank system text based on the softmax full-connection layer. Based on the softmax full-connection layer, if the initial system classification label of the washed to-be-trained bank system text is not matched with the system classification label identified by the washed to-be-trained bank system text, a second preset value is given to the washed to-be-trained bank system text, and the second preset value is 0.
And then, modifying parameters of the initial model based on the first preset value and the second preset value to obtain a preset classification model. I.e. modifying the parameters of the BERT model with softmax full connectivity layer.
And further obtaining a preset classification model.
Thus, the keyword combination of the bank system text to be classified after washing obtained in step 208 is input into a preset classification model, and the feature vector of the keyword combination is output based on the BERT model in the preset classification model.
210. Based on a preset classification model, determining cosine similarity between the feature vector of the keyword combination and each preset system classification label vector, and determining a system classification label corresponding to the system classification label vector with the highest cosine similarity, wherein the system classification label is the system classification label of the bank system text to be classified.
Illustratively, based on the full connection layer in the preset classification model, the cosine similarity between the feature vector of the keyword combination of the bank system text to be classified and each preset system classification label vector obtained in step 208 is determined. And then determining the system classification label corresponding to the vector of the system classification label with the highest cosine similarity as the system classification label of the bank system text to be classified.
Based on the similarity matching mode, the system classification label of the bank system text to be classified can be accurately determined. Among them, similarity matching is a widely used technique in the field of computer science, which refers to determining a relationship between two or more data by comparing the degree of similarity between them, such as "whether they match" or "strength of relationship"; data such as numbers, words or images may be compared to determine similarity between them; for text data, similarity matching is a text comparison technology, and an algorithm is used for measuring the matching degree between two text segments, and is mainly used for scenes such as text search, similar file comparison, content sharing and the like. Similarity matching may use character similarity, sentence similarity, or the degree of coincidence of keywords to compare the similarity of texts.
In the embodiment, acquiring a bank system text to be classified; and carrying out data cleaning treatment on the bank system text to be classified to obtain the cleaned bank system text to be classified. Then, based on a TF-IDF mode, extracting keywords in the washed bank system text to be classified to obtain a keyword combination; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed bank system text to be classified. Thus, extracting the characteristics of the bank system text to be classified; the pretreatment system based on TF-IDF carries out preliminary feature extraction on the bank system text to be classified, and can effectively improve the classification precision of the system classification labels. Then, inputting the keyword combination into a preset classification model to obtain a feature vector of the keyword combination; based on a preset classification model, determining cosine similarity between the feature vector of the keyword combination and the vector of each preset system classification label, and determining the system classification label corresponding to the system classification label vector with the highest cosine similarity as the system classification label of the bank system text to be classified. By using a natural language processing method, the bank system texts are classified through deep learning, so that the bank system text classification efficiency can be effectively improved, and the bank system text classification accuracy is improved.
And in the process of training to obtain a preset classification model, inputting the bank system text to be trained and the initial system classification label into the BERT model to perform one classification, so as to train the model to judge whether the input bank system text and the system classification label are matched or not. And the retraining process is completed based on the full connection layer, so that the model learns more characteristic information, and the purpose of improving the model effect is achieved. And a network with correlation between the bank system texts can be pre-trained, so that the classification effect of the bank system texts can be effectively improved.
Fig. 3 is a schematic structural diagram of a classification device for a model-based banking system text according to an embodiment of the present application, and as shown in fig. 3, the device includes:
a first obtaining unit 31 is configured to obtain a banking text to be classified.
The first processing unit 32 is configured to perform data cleaning processing on the bank system text to be classified, so as to obtain cleaned bank system text to be classified.
A first extracting unit 33, configured to extract keywords in the washed bank system text to be classified, so as to obtain a keyword combination; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed bank system text to be classified.
The first determining unit 34 is configured to input the keyword combinations into a preset classification model, and obtain feature vectors of the keyword combinations.
The second determining unit 35 is configured to determine, based on a preset classification model, a cosine similarity between the feature vector of the keyword combination and a vector of each preset system classification label, and determine a system classification label corresponding to the system classification label vector with the highest cosine similarity, which is a system classification label of a bank system text to be classified.
In one example, the first processing unit 32 is specifically configured to:
removing noise information in the bank system text to be classified based on a regular matching mode to obtain the bank system text to be classified after denoising; wherein the noise information includes data and punctuation marks; word segmentation processing is carried out on the denoised bank system text to be classified, so that words of the denoised bank system text to be classified are obtained; and removing stop words in the words to obtain the washed bank system text to be classified.
In one example, the first extraction unit 33 is specifically configured to:
determining the total word number in the washed bank system text to be classified; wherein the total word number represents the total number of words in the bank system text; determining the occurrence times of each word in the washed bank system text to be classified, and obtaining the occurrence times information of each word; determining the ratio between the occurrence number information of each word and the total word number, and obtaining word frequency information of each word; determining total number information of the bank system texts in the preset corpus, wherein the total number information represents the total number of the bank system texts in the preset corpus; determining the text quantity of each word, wherein the text quantity characterization of each word comprises the quantity of banking system texts comprising each word in a preset corpus; determining the inverse document frequency of each word according to the total number information and the text quantity of each word; determining importance weight of each word according to word frequency information of each word and inverse document frequency of each word; sequencing all the words in the washed bank system text to be classified according to the importance weight from high to low to obtain sequenced words; determining the words with the importance weights of the first M as keywords in the washed bank system text to be classified so as to obtain keyword combinations; wherein M is a positive integer greater than or equal to 1.
In one example, the inverse document frequency for each term; for idf (t) =log (N/(N) t +1)); wherein t represents the word t; n is total information, N t Is the number of text for word t.
In one example, each term has an importance weight of Wherein t represents the word t; d, representing the washed bank system text to be classified; tf (t, d) is word frequency information of the word t, and idf (t) is inverse document frequency of the word t.
In one example, the apparatus further comprises:
the second acquisition unit is used for acquiring a text set to be trained; the text set to be trained comprises at least one bank system text to be trained, and the bank system text to be trained has an initial system classification label.
And the second processing unit is used for cleaning the data to obtain the cleaned bank system text to be trained.
The second extraction unit is used for extracting keywords in the washed bank system text to be trained to obtain a keyword combination of the washed bank system text to be trained; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed banking system text to be trained.
And the third determining unit is used for inputting the keyword combination of the washed banking system text to be trained and the initial system classification label of the washed banking system text to be trained into the initial model to obtain the identified system classification label of the washed banking system text to be trained.
And the fourth determining unit is used for giving a first preset value to the washed to-be-trained bank system text if the initial system classification label of the washed to-be-trained bank system text is matched with the identified system classification label of the washed to-be-trained bank system text.
And a fifth determining unit, configured to assign a second preset value to the washed to-be-trained bank system text if it is determined that the initial system classification label of the washed to-be-trained bank system text is not matched with the identified system classification label of the washed to-be-trained bank system text.
The modification unit is used for modifying parameters of the initial model based on the first preset value and the second preset value to obtain a preset classification model.
In one example, the preset classification model is a BERT model with a fully connected layer.
For example, the present embodiment may refer to the above method embodiment, and the principle and technical effects thereof are similar, and will not be described again.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where, as shown in fig. 4, the electronic device includes: a memory 41, a processor 42;
a memory 41; a memory for storing instructions executable by processor 42;
Wherein the processor 42 is configured to perform the method as provided in the above embodiments.
The electronic device further comprises a receiver 43 and a transmitter 44. The receiver 43 is for receiving instructions and data transmitted from an external device, and the transmitter 44 is for transmitting instructions and data to the external device.
Fig. 5 is a block diagram of an electronic device, which may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, etc., provided in an embodiment of the present application.
The apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the apparatus 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on the device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 806 provides power to the various components of the device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.
The multimedia component 808 includes a screen between the device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or sliding action, but also the duration and pressure associated with the touch or sliding operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the apparatus 800 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the apparatus 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the assemblies, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in position of the device 800 or one of the assemblies of the device 800, the presence or absence of user contact with the device 800, an orientation or acceleration/deceleration of the device 800, and a change in temperature of the device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the apparatus 800 and other devices, either in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of apparatus 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
Embodiments of the present application also provide a non-transitory computer-readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform the method provided by the above embodiments.
The embodiment of the application also provides a computer program product, which comprises: a computer program stored in a readable storage medium, from which at least one processor of an electronic device can read, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any one of the embodiments described above.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A method for classifying a model-based banking system text, the method comprising:
acquiring a bank system text to be classified; carrying out data cleaning treatment on the bank system text to be classified to obtain cleaned bank system text to be classified;
extracting keywords in the washed bank system text to be classified to obtain a keyword combination; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed bank system text to be classified;
inputting the keyword combination into a preset classification model to obtain a feature vector of the keyword combination;
and determining cosine similarity between the feature vector of the keyword combination and the vector of each preset system classification label based on the preset classification model, and determining the system classification label corresponding to the system classification label vector with the highest cosine similarity as the system classification label of the bank system text to be classified.
2. The method according to claim 1, wherein the step of performing data cleaning processing on the banking text to be classified to obtain cleaned banking text to be classified comprises:
removing noise information in the bank system text to be classified based on a regular matching mode to obtain the bank system text to be classified after denoising; wherein the noise information includes data and punctuation marks;
word segmentation processing is carried out on the denoised bank system text to be classified, so that words of the denoised bank system text to be classified are obtained;
and removing the stop words in the words to obtain the washed bank system text to be classified.
3. The method according to claim 1, wherein the washed banking text to be classified comprises at least one term; extracting keywords in the washed bank system text to be classified to obtain a keyword combination, wherein the keyword combination comprises the following steps:
determining the total word number in the washed bank system text to be classified; wherein the total word number characterizes the total number of words in a banking system text;
determining the occurrence times of each word in the washed bank system text to be classified, and obtaining the occurrence times information of each word;
Determining the ratio between the occurrence frequency information of each word and the total word number, and obtaining word frequency information of each word;
determining total number information of the bank system texts in a preset corpus, wherein the total number information represents the total number of the bank system texts in the preset corpus; determining the text quantity of each word, wherein the text quantity characterization of each word comprises the quantity of banking system texts comprising each word in a preset corpus;
determining the inverse document frequency of each word according to the total number information and the text quantity of each word;
determining importance weight of each word according to word frequency information of each word and inverse document frequency of each word;
sequencing the words in the washed bank system text to be classified according to the importance weight from high to low to obtain sequenced words;
determining the words with the importance weights of the first M as keywords in the washed bank system text to be classified so as to obtain keyword combinations; wherein M is a positive integer greater than or equal to 1.
4. A method according to claim 3, wherein the inverse document frequency of each term; for idf (t) =log (N/(N) t +1)); wherein t represents the word t; n is the total number information, N t Is the number of text for word t.
5. A method according to claim 3, wherein the importance of each term is weighted as followsWherein t represents the word t; d, representing the washed bank system text to be classified; tf (t, d) is word frequency information of the word t, and idf (t) is inverse document frequency of the word t.
6. The method according to any one of claims 1-5, further comprising:
acquiring a text set to be trained; the text set to be trained comprises at least one bank system text to be trained, and the bank system text to be trained is provided with an initial system classification label;
carrying out data cleaning treatment to obtain a cleaned bank system text to be trained;
extracting keywords in the washed bank system text to be trained to obtain a keyword combination of the washed bank system text to be trained; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed banking system text to be trained;
inputting the keyword combination of the washed bank system text to be trained and the initial system classification label of the washed bank system text to be trained into an initial model to obtain the identified system classification label of the washed bank system text to be trained;
If the initial system classification label of the washed to-be-trained bank system text is matched with the identified system classification label of the washed to-be-trained bank system text, giving a first preset value to the washed to-be-trained bank system text;
if the initial system classification label of the washed to-be-trained bank system text is not matched with the identified system classification label of the washed to-be-trained bank system text, giving a second preset value to the washed to-be-trained bank system text;
and modifying parameters of the initial model based on the first preset value and the second preset value to obtain the preset classification model.
7. The method according to any one of claims 1-5, wherein the pre-set classification model is a BERT model with fully connected layers.
8. A model-based classification device for bank system text, the device comprising:
the first acquisition unit is used for acquiring the bank system text to be classified;
the first processing unit is used for carrying out data cleaning processing on the bank system text to be classified to obtain cleaned bank system text to be classified;
The first extraction unit is used for extracting keywords in the washed bank system text to be classified to obtain a keyword combination; the keyword combination comprises at least one keyword, and the keyword combination is used for representing the washed bank system text to be classified;
the first determining unit is used for inputting the keyword combination into a preset classification model to obtain a feature vector of the keyword combination;
and the second determining unit is used for determining cosine similarity between the feature vector of the keyword combination and the vector of each preset system classification label based on the preset classification model, determining the system classification label corresponding to the system classification label vector with the highest cosine similarity, and taking the system classification label of the bank system text to be classified as the system classification label of the bank system text to be classified.
9. An electronic device, the electronic device comprising: a memory, a processor;
a memory; a memory for storing the processor-executable instructions;
wherein the processor is configured to perform the method of any of claims 1-7.
10. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1-7.
CN202311659524.7A 2023-12-05 2023-12-05 Model-based classification method, device and equipment for bank system text Pending CN117633231A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311659524.7A CN117633231A (en) 2023-12-05 2023-12-05 Model-based classification method, device and equipment for bank system text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311659524.7A CN117633231A (en) 2023-12-05 2023-12-05 Model-based classification method, device and equipment for bank system text

Publications (1)

Publication Number Publication Date
CN117633231A true CN117633231A (en) 2024-03-01

Family

ID=90021334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311659524.7A Pending CN117633231A (en) 2023-12-05 2023-12-05 Model-based classification method, device and equipment for bank system text

Country Status (1)

Country Link
CN (1) CN117633231A (en)

Similar Documents

Publication Publication Date Title
CN109522424B (en) Data processing method and device, electronic equipment and storage medium
CN113792207B (en) Cross-modal retrieval method based on multi-level feature representation alignment
CN110781305A (en) Text classification method and device based on classification model and model training method
CN110827831A (en) Voice information processing method, device, equipment and medium based on man-machine interaction
CN112699686B (en) Semantic understanding method, device, equipment and medium based on task type dialogue system
CN108121736A (en) A kind of descriptor determines the method for building up, device and electronic equipment of model
CN108345581A (en) A kind of information identifying method, device and terminal device
CN108345612A (en) A kind of question processing method and device, a kind of device for issue handling
CN113095085B (en) Emotion recognition method and device for text, electronic equipment and storage medium
CN111753091A (en) Classification method, classification model training method, device, equipment and storage medium
CN107424612B (en) Processing method, apparatus and machine-readable medium
CN111046927B (en) Method and device for processing annotation data, electronic equipment and storage medium
CN111274389B (en) Information processing method, device, computer equipment and storage medium
CN112328793A (en) Comment text data processing method and device and storage medium
CN116166843B (en) Text video cross-modal retrieval method and device based on fine granularity perception
CN111984765B (en) Knowledge base question-answering process relation detection method and device
CN116127062A (en) Training method of pre-training language model, text emotion classification method and device
CN115718801A (en) Text processing method, model training method, device, equipment and storage medium
CN115730073A (en) Text processing method, device and storage medium
CN115146633A (en) Keyword identification method and device, electronic equipment and storage medium
CN110968246A (en) Intelligent Chinese handwriting input recognition method and device
CN111400443B (en) Information processing method, device and storage medium
CN111667829B (en) Information processing method and device and storage medium
CN117633231A (en) Model-based classification method, device and equipment for bank system text
CN104699668B (en) Determine the method and device of Words similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination