CN110377731A - Complain text handling method, device, computer equipment and storage medium - Google Patents

Complain text handling method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110377731A
CN110377731A CN201910528626.2A CN201910528626A CN110377731A CN 110377731 A CN110377731 A CN 110377731A CN 201910528626 A CN201910528626 A CN 201910528626A CN 110377731 A CN110377731 A CN 110377731A
Authority
CN
China
Prior art keywords
text
risk
complaint
words
phrases
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910528626.2A
Other languages
Chinese (zh)
Inventor
田鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
Original Assignee
OneConnect Smart Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Smart Technology Co Ltd filed Critical OneConnect Smart Technology Co Ltd
Priority to CN201910528626.2A priority Critical patent/CN110377731A/en
Publication of CN110377731A publication Critical patent/CN110377731A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

This application discloses a kind of complaint text handling method, device, computer equipment and storage mediums, wherein complains text handling method, comprising: receives multiple complaint texts to be processed, and carries out Text Pretreatment to each complaint text;Text classification is carried out by being input in preset textual classification model after each complaint text vector after Text Pretreatment;Classified each sensitive risk words and phrases complained in text are extracted, and are given a mark according to predetermined manner to each sensitive risk words and phrases, the corresponding risk score value of each sensitive risk words and phrases is obtained;According to sensitive risk words and phrases and its corresponding risk score value, the corresponding urgency level score of classified each complaint text is calculated;According to urgency level score, classified each complaint text is ranked up according to preset rules.The treatment effeciency for complaining text can be improved in the complaint text handling method, while facilitating staff to handle problem in time and reflecting more serious complaint text.

Description

Complain text handling method, device, computer equipment and storage medium
Technical field
This application involves technical field of data processing is arrived, a kind of complaint text handling method, device, meter are especially related to Calculate machine equipment and storage medium.
Background technique
With the continuous improvement of people's troxerutine tabtets and the continuous development of network technology, people can by way of network to Relevant enterprise or government organs are complained, such as can be by modes such as specific website platform, mobile applications rapidly It is complained, this also just produces magnanimity, non-structured complaint text data, at this time, it is necessary to which text is complained to these Data are handled in time, and accurately analyze the various problems wherein contained, but are relied primarily in the prior art artificial It is handled and is analyzed, efficiency is very low, a large amount of the problem of complaining text data accumulation is be easy to cause, moreover, asking for some Topic reflects that more serious complaint text is even more easy to cause undesirable consequence if being handled not in time.
Summary of the invention
The main purpose of the application is to provide a kind of complaint text handling method, device, computer equipment and storage medium, It is intended to improve the unstructured treatment effeciency for complaining text, while it is more serious to facilitate staff to handle problem reflection in time Complain text.
The application proposes a kind of complaint text handling method, comprising:
Multiple complaint texts to be processed are received, and Text Pretreatment is carried out to each complaint text, wherein text is located in advance Reason includes participle, removal stop words and removal punctuation mark;
By each complaint text vector after Text Pretreatment, and by each complaint text input of vectorization to default Textual classification model in carry out text classification;
Classified each sensitive risk words and phrases complained in text are extracted, and according to predetermined manner to each sensitive risk Words and phrases are given a mark, and the corresponding risk score value of each sensitive risk words and phrases is obtained;
According to sensitive risk words and phrases and its corresponding risk score value, it is corresponding to calculate classified each complaint text Urgency level score;
According to urgency level score, classified each complaint text is ranked up according to preset rules.
Further, textual classification model is multiple perceptron model of the training to default precision, is received to be processed more A complaint text, and before the step of carrying out Text Pretreatment to each complaint text, further includes:
It obtains first with class label and complains corpus of text collection, and it is pre- to complain corpus of text collection to carry out text to first Processing;
Pretreated first complaint corpus of text collection is converted into text space vector set;
The text space vector of specified quantity is randomly selected from text space vector set as training set, remaining text Space vector is as test set;
Training set is input in multiple perceptron model and is trained;
Test set is input in the multiple perceptron model after training and is verified, to judge multi-layer perception (MLP) mould Whether the precision of type reaches default precision;
Ginseng if the not up to default precision of the precision of multiple perceptron model, in the multiple perceptron model that iterates Number, until the precision of multiple perceptron model reaches default precision.
Further, the step of pretreated first complaint corpus of text collection being converted into text space vector set, packet It includes:
The TF- that pretreated first complaint corpus of text concentrates each word is calculated using preset TF-IDF algorithm IDF value;
Corpus of text is complained to concentrate the word for choosing TF-IDF value greater than preset threshold as special from pretreated first Word is levied, and Feature Words are summarized, generates feature dictionary;
According to feature dictionary, pretreated first complaint corpus of text collection is converted into text space vector set.
Further, classified each sensitive risk words and phrases complained in text are extracted, and according to predetermined manner to each The step of a sensitivity risk words and phrases are given a mark, and each sensitive risk words and phrases corresponding risk score value is obtained, comprising:
Based on preset sensitive risk dictionary, classified each sensitivity complained in text is extracted by regular expression Risk words and phrases, wherein sensitive risk words and phrases, risk class corresponding with sensitive risk words and phrases are prestored in sensitive risk dictionary And risk score value corresponding with each risk class;
The corresponding risk class of sensitive with what is extracted risk words and phrases is inquired from sensitive risk dictionary, and according to wind Dangerous rank determines corresponding risk score value.
Further, based on preset sensitive risk dictionary, classified each complaint text is extracted by regular expression Before the step of sensitive risk words and phrases in this, further includes:
It obtains second and complains corpus of text collection, and based on preset sensitive risk dictionary, identified by regular expression The sensitive risk words and phrases of second complaint corpus of text concentration simultaneously stamp corresponding risk class label, wherein sensitive risk dictionary Prestore sensitive risk words and phrases, risk class and risk score value;
Deleting second complains corpus of text to concentrate the sensitive risk words and phrases for having risk class label, and to deleting sensitivity The second of risk words and phrases complains corpus of text collection to carry out Text Pretreatment, obtains the second each key for complaining corpus of text to concentrate Word;
Each keyword is summarized, bag of words data are generated;
Corresponding first eigenvector is converted by each keyword in bag of words data, and will be in sensitive risk dictionary Sensitive risk words and phrases be converted into corresponding second feature vector;
Using preset similarity mode algorithm, first eigenvector and second feature vector are subjected to similarity mode, Determine the first eigenvector to match with second feature vector;
The corresponding keyword of the first eigenvector to match is added to corresponding risk class in sensitive risk dictionary Under classification, sensitive risk dictionary is formed.
Further, according to urgency level score, classified each complaint text is ranked up according to preset rules The step of, comprising:
Classified each complaint text is ranked up from high to low according to the height of urgency level score;
According to preset score-hierarchical relationship table and the corresponding urgency level score of each complaint text, each throwing is determined Tell the risk class of text and be labeled, wherein prestored in score-hierarchical relationship table multiple urgency level scores section with And risk class corresponding with each urgency level score section.
Further, by each complaint text vector after Text Pretreatment, and by each complaint text of vectorization It is input in preset textual classification model after the step of carrying out text classification, further includes:
It obtains third and complains corpus of text collection;
Entity is acquired from Vertical Website, and collected entity is summarized, and generates entity dictionary;
According to entity dictionary, corpus of text is complained to concentrate using regular expression and natural language processing tool identification third Name entity, and the name entity that will identify that is labeled;
Name entity after mark is converted into the first term vector, and the first term vector is input to preset Bi- In LSTM-CRF model, using the parameter in back propagation training Bi-LSTM-CRF model, to obtain the Bi- of optimized parameter LSTM-CRF model;
By preset term vector model obtain it is classified it is each complain text the second term vector, and by the second word to Amount, which is input in Bi-LSTM-CRF model, is named Entity recognition, exports each name Entity recognition result for complaining text.
The application also proposes a kind of complaint text processing apparatus, comprising:
Preprocessing module for receiving multiple complaint texts to be processed, and carries out text to each complaint text and locates in advance Reason, wherein Text Pretreatment includes participle, removal stop words and removal punctuation mark;
Categorization module, for by each complaint text vector after Text Pretreatment, and by each complaint of vectorization Text input carries out text classification into preset textual classification model;
Extraction module, for extracting classified each sensitive risk words and phrases complained in text, and according to predetermined manner It gives a mark to each sensitive risk words and phrases, obtains the corresponding risk score value of each sensitive risk words and phrases;
Computing module, for calculating classified each according to sensitive risk words and phrases and its corresponding risk score value Complain the corresponding urgency level score of text;
Sorting module, for being carried out to classified each complaint text according to preset rules according to urgency level score Sequence.
The application also proposes that a kind of computer equipment, including memory and processor, memory are stored with computer program, The step of processor realizes complaint text handling method above-mentioned when executing computer program.
The application also proposes a kind of computer readable storage medium, is stored thereon with computer program, computer program quilt The step of processor realizes complaint text handling method above-mentioned when executing.
The beneficial effect of the application is: complaint text handling method provided by the embodiments of the present application is by by throwing to be processed Tell that text carries out pretreatment dyad, so that the non-structured text space vector for complaining text available structured carries out table The text classification shown, and then multiple complaint texts to be processed are automated using preset textual classification model, text After the completion of this classification, the sensitive risk words and phrases of problem severity can be reflected in text to analyze complaint text by being complained by extraction Urgency level, and then marking sequence is carried out to each complaint text under each text categories according to urgency level, so that asking Topic reflects more serious complaint text, and sequence can be more forward, in this way, by carrying out specific aim to non-structured complaint text Automatic classification and sequence, not only effectively improve the treatment effeciency for complaining text, and staff can be facilitated right in time Problem reflects that more serious complaint text is handled, and reduces the risk for generating adverse consequences.
Detailed description of the invention
Fig. 1 is the flow diagram that text handling method is complained during the application one is implemented;
Fig. 2 is the structural schematic diagram that text processing apparatus is complained during the application one is implemented;
Fig. 3 is the structural schematic diagram of the first conversion module during the application one is implemented;
Fig. 4 is the structural schematic diagram that text processing apparatus is complained in another implementation of the application;
Fig. 5 is the structural schematic diagram of extraction module during the application one is implemented;
Fig. 6 is the structural schematic diagram of sorting module during the application one is implemented;
Fig. 7 is the structural schematic diagram of computer equipment during the application one is implemented.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
Referring to Fig.1, the embodiment of the present application proposes a kind of complaint text handling method, comprising:
S11 receives multiple complaint texts to be processed, and carries out Text Pretreatment to each complaint text, wherein text Pretreatment includes participle, removal stop words and removal punctuation mark;
S12, by each complaint text vector after Text Pretreatment, and extremely by each complaint text input of vectorization Text classification is carried out in preset textual classification model;
S13 extracts classified each sensitive risk words and phrases complained in text, and according to predetermined manner to each sensitivity Risk words and phrases are given a mark, and the corresponding risk score value of each sensitive risk words and phrases is obtained;
S14 calculates classified each complaint text pair according to sensitive risk words and phrases and its corresponding risk score value The urgency level score answered;
S15 is ranked up classified each complaint text according to preset rules according to urgency level score.
In above-mentioned S11, specifically, multiple complaint texts to be processed of staff's input are received, it is then available Existing Chinese word segmentation tool (such as stammerer participle tool) segments complaint text, and then stops using existing Chinese To not there is no word (such as auxiliary words of mood of physical meaning in the first complaint text after participle with vocabulary (such as Baidu deactivates vocabulary) Deng) and punctuation mark delete, so as to subsequent operation.
In above-mentioned S12, due to computer can only the data of logarithm type calculated, and pretreated complaint text It is character type, computer cannot be calculated directly, it is therefore desirable to can specifically lead to pretreated complaint text vector It crosses bag of words and vectorization processing is carried out to pretreated each complaint text, it is empty to obtain the corresponding text of each complaint text Between vector, and then the text space vector of acquisition is input in preset textual classification model, exports each text of complaining Text categories, to realize to the mechanized classification for complaining text, wherein textual classification model is the mind by special training Through network model, its role is to the complaint texts to different text categories to classify.
In above-mentioned S13, generally, complain the problem of contained in text that can usually pass through some specific words or sentence Son embodies, for example, complaining in text generally can include " expired ", " food safety for the complaint text of food safety class The keywords such as standard ", " food poisoning ", " exceeded ", in another example the complaint text of government organs' class, complaining can generally be wrapped in text Containing the keywords such as " corruption ", " accepting bribes ", " behaving badly " or specific sentence, and different keywords or specific sentence institute is anti- The problem of reflecting seriousness would also vary from, such as the complaint text of government organs' class, contain in text if some is complained The keywords such as " corruption ", " accepting bribes ", then very maximum probability shows the problem of complaint text is contained than more serious, and if some is thrown It tells in text containing the keywords such as " behaving badly ", " swearing at people ", then shows that the problem of complaint text is contained is relatively slow With, therefore can be using some specific words or sentence as sensitive risk words and phrases, and then pass through quick included in complaint text Sense risk words and phrases can specifically be commonly used to analyze the severity for the reflected problem for complaining text by building complaint field Sensitive lexicon extract (the i.e. sensitive risk word and quick of sensitivity risk words and phrases included in classified each complaint text Feel risk sentence pattern), and then give a mark to the sensitive risk words and phrases extracted according to predetermined manner, for example, constructed quick The risk score value of each sensitive risk words and phrases is pre-defined in sense lexicon, and then can be while extracting sensitive risk words and phrases Obtain the risk score value of each sensitive risk words and phrases.
In above-mentioned S14, each urgency level score for complaining text can be calculate by the following formula acquisition:
Wherein, Y is the urgency level score for complaining text, miTo complain i-th of sensitive risk words and phrases in text corresponding Risk score value, xiTo complain word frequency of i-th of sensitive risk words and phrases in the complaint text in text.
For example, for example, extract some complain text in sensitive risk words and phrases have " corruption ", " accepting bribes ", " portion XX Door ", " xxx expresses strong dissatisfaction xxx ", " complaint ", " XX committee ", wherein sensitive risk words and phrases " corruption " are in the complaint document In word frequency be 2 (occurring twice in the complaint text), remaining word frequency of sensitive risk words and phrases in the complaint document It is 1, and above-mentioned sensitivity words and phrases predefined risk score value is successively in sensitive lexicon are as follows: 4,4,2,3,1,3, then pass through Above-mentioned formula is calculated, and can get urgency level score Y=4 × 2+4 × 1+2 × 1+3 × 1+1 × 1+3 of the complaint text × 1=21, urgency level score is higher, then shows that the problem of complaint text is reflected is more serious.
In above-mentioned S15, when calculating the urgency level score of each complaint text under each text categories, then may be used Each complaint text is ranked up according to default rule, such as by each complaint text under each text categories according to tight The height of anxious degree score is ranked up from high to low, in this way, problem reflects more serious complaint text, is sorted more forward, More serious complaint text, which is handled, to be reflected to problem in time to be conducive to staff.
In the present embodiment, the complaint text handling method is by carrying out pretreatment dyad for complaint text to be processed Change, so that the non-structured text space vector for complaining text available structured is indicated, and then using preset text The text classification that this disaggregated model automates multiple complaint texts to be processed, after the completion of text classification, passes through extraction The sensitive risk words and phrases of problem severity can be reflected in text to analyze the urgency level for complaining text by complaining, and then according to tight Anxious degree carries out marking sequence to each complaint text under each text categories, so that problem reflects more serious complaint text This, sequence can be more forward, in this way, by carrying out targetedly automatic classification and sequence to non-structured complaint text, no But the treatment effeciency for complaining text is effectively improved, and staff can be facilitated to reflect more serious throwing to problem in time It tells that text is handled, reduces the risk for generating adverse consequences.
In an alternative embodiment, textual classification model is multiple perceptron model of the training to default precision, is connect Multiple complaint texts to be processed are received, and before the step of carrying out Text Pretreatment to each complaint text, further includes:
S101 obtains first with class label and complains corpus of text collection, and complains corpus of text collection to carry out to first Text Pretreatment;
Pretreated first complaint corpus of text collection is converted to text space vector set by S102;
S103 randomly selects the text space vector of specified quantity as training set, residue from text space vector set Text space vector as test set;
Training set is input in multiple perceptron model and is trained by S104;
Test set is input in the multiple perceptron model after training and verifies by S105, to judge multilayer sense Know whether the precision of machine model reaches default precision;
S106, if the not up to default precision of the precision of multiple perceptron model, in the multiple perceptron model that iterates Parameter, until the precision of multiple perceptron model reaches default precision.
In above-mentioned S101, the first complaint corpus of text can be obtained by metadata acquisition tool (such as reptile instrument) Collection, the first complaint corpus of text collection are with the other each set for complaining text of different tag class, such as food safety class Complain text, the complaint text of government organs' class, complaint text of consumption service class etc., the complaint text and number of specific category Amount can not specifically limit this depending on actual use demand;It, can be into after getting the first complaint corpus of text collection Each complaint text that one step concentrates the first complaint corpus of text is segmented, removes the texts such as stop words, removal punctuation mark This pretreatment operation, to carry out subsequent operation.
In above-mentioned S102, pretreated first complaint corpus of text collection can be converted into text sky by bag of words Between vector set (set of i.e. each text space vector).
It, specifically, can each random choosing from the text space vector set under each label classification respectively in above-mentioned S103 Taking 70% text space vector, 30% text space vector is as test set as training set.
In above-mentioned S104, above-mentioned multiple perceptron model includes input layer, hidden layer and output layer, more between layers Using totally interconnected mode, there is no interconnecting between same layer unit, each layer of connection weight can by learn come It adjusts, a large amount of mode map relationship can be learnt, describe to input without any of mathematical function knowledge defeated Mapping between out is realized input pattern being mapped to desired output mode, it is only necessary to carry out model instruction to known mode Practice, by study, model is just provided with the ability of the non-linear relation mapping of this height, wherein the neuron in output layer Number can be designed according to actual classification demand, such as need to carry out five classification, then the neuron in output layer can design It is 5.
In above-mentioned S105, after multiple perceptron model is trained by training set, then further verified using test set Test set is specifically input to the hidden layer of multiple perceptron model by input layer by the precision of multiple perceptron model, warp After crossing the processing of hidden layer, text classification is exported as a result, in turn by calculating the complaint text correctly classified in output layer Quantity shared by test set total complaint amount of text percentage, obtain the precision of multiple perceptron model, and then verify again Whether obtained precision reaches default precision (such as 80%).
In above-mentioned S106, if the precision of multiple perceptron model reaches default precision, can directly it be used for subsequent Text classification can be adjusted repeatedly by back-propagation algorithm more if the precision of multiple perceptron model reaches default precision Parameter (weighted value between threshold value and neuron in such as each network layer) in layer perceptron model, until Multilayer Perception Until the precision of machine model reaches default precision, in this way, being made by one multiple perceptron model with good accuracy of training For textual classification model, the mechanized classification for complaining text is realized, improves the treatment effeciency for complaining text.
In an alternative embodiment, pretreated first complaint corpus of text collection is converted into text space vector The step of collection, comprising:
S1021 calculates pretreated first using preset TF-IDF algorithm and corpus of text is complained to concentrate each word TF-IDF value;
S1022 complains corpus of text to concentrate the word chosen TF-IDF value and be greater than preset threshold from pretreated first Summarize as Feature Words, and by Feature Words, generates feature dictionary;
Pretreated first complaint corpus of text collection is converted to text space vector according to feature dictionary by S1023 Collection.
In above-mentioned S1021, the first complaint corpus of text concentrates the TF-IDF value of each word can be by following formula (i.e. TF-IDF algorithm) it determines:
Wherein, TF is word frequency, and IDF is reverse document frequency, and t is time that some word occurs in some complaint text Number, T are total word number of some complaint text, and N is the first complaint text sum for complaining corpus of text collection, and n is the first complaint text Complaint text sum in this corpus comprising some word;Specifically, the TF-IDF value of some word is bigger, then represents the word The weight of language is bigger, and the word is more important for corresponding complaint text, and therefore, the TF-IDF value of some word is got over Greatly, more there is separating capacity, i.e., can more characterize text categories.
In above-mentioned S1022, since the TF-IDF value of word is bigger, text categories can be more characterized, therefore can be by from pre- Treated, and the first complaint corpus of text concentrates selection TF-IDF value greater than the word of preset threshold as Feature Words, and then passes through Feature Words are summarized, can get feature dictionary, concentrate most of differentiation energy so as to exclude the first complaint corpus of text The lower word of power both can reach the effect of dimensionality reduction (relative to bag of words data used in bag of words, feature dictionary in this way Included in word than bag of words data much less, and then achieve the effect that dimensionality reduction, be conducive to the processing speed for improving data Degree), being also beneficial to the number that reduction following model is trained, (quality of training set is better, is more easy to get with good accuracy Textual classification model).
In above-mentioned S1023, specifically, each word included in corpus of text collection is complained by pretreated first Language is retrieved in feature dictionary, if there are first in feature dictionary to complain word included in corpus of text collection, According to the sequence location of feature dictionary where the word, it is set as 1 in the corresponding dimension of space vector, is otherwise provided as 0, is lifted For example, for example, the pretreated first some for complaining corpus of text to concentrate complains text are as follows: " xxx manufacturer/production/steamed bun Head/contain/violation/additive/Hangzhou/mono-/citizen/purchase/edible/rear/appearance/food poisoning ";Contained in feature dictionary Word are as follows: 1, food poisoning, 2, murder, 3, corruption, 4, behave badly, 5, in violation of rules and regulations, 6, expired, 7, xx department, 8, addition Agent, 9, steamed bun, 10, complaint;It can be then text space vector [1 0001 by the complaint text conversion by this feature dictionary 0011 0], thus according to feature dictionary, it can be achieved that pretreated first complaint corpus of text collection is converted to text sky Between vector set (i.e. the set of text space vector).
In an alternative embodiment, the classified each sensitive risk words and phrases complained in text of extraction, and according to Predetermined manner gives a mark to each sensitive risk words and phrases, obtains the step of the corresponding risk score value of each sensitive risk words and phrases Suddenly, comprising:
S131 is extracted in classified each complaint text based on preset sensitive risk dictionary by regular expression Sensitive risk words and phrases, wherein sensitive risk words and phrases, wind corresponding with sensitive risk words and phrases are prestored in sensitive risk dictionary Dangerous rank and risk score value corresponding with each risk class;
S132 inquires the corresponding risk class of sensitive with what is extracted risk words and phrases from sensitive risk dictionary, and Corresponding risk score value is determined according to risk class.
In the present embodiment, the sensitive risk vocabulary of different risk classes is stored in above-mentioned sensitive risk dictionary, and Different risk classes corresponds to different risk score values, therefore the sensitive risk of each of same sensitive risk vocabulary Words and phrases all have identical risk score value, wherein risk score value is carried out set by risk assessment to sensitive risk words and phrases Score, which can be rationally arranged according to the expertise in the field of complaint, for example, such as can be by risk Partition of the level is high risk rank, medium risk rank, average risk rank and low-risk rank, the corresponding wind of high risk rank Dangerous fractional value is set as the corresponding risk score value of 4, medium risk rank and is set as the corresponding risk point of 3, average risk rank Numerical value is set as the corresponding risk score value of 2, low-risk rank and is set as 1, therefore, can pole by the way that sensitive risk dictionary is arranged Sensitivity risk words and phrases included in each complaint text are extracted and given a mark for convenience.
In an alternative embodiment, based on preset sensitive risk dictionary, classified by regular expression extraction Each complaint text in sensitive risk words and phrases the step of before, further includes:
S1301 obtains second and complains corpus of text collection, and based on preset sensitive risk dictionary, passes through regular expression It identifies the sensitive risk words and phrases that the second complaint corpus of text is concentrated and stamps corresponding risk class label, wherein sensitive wind Dangerous dictionary prestores above-mentioned sensitive risk words and phrases, above-mentioned risk class and above-mentioned risk score value;
S1302 deletes second and corpus of text is complained to concentrate the sensitive risk words and phrases for having risk class label, and to deletion Fall sensitive risk words and phrases second complains corpus of text collection to carry out Text Pretreatment, obtains second and complains each of corpus of text concentration A keyword;
S1303 summarizes each keyword, generates bag of words data;
Each keyword in bag of words data is converted corresponding first eigenvector by S1304, and by sensitive risk Sensitive risk words and phrases in dictionary are converted into corresponding second feature vector;
First eigenvector and second feature vector are carried out similarity using preset similarity mode algorithm by S1305 Matching, determines the first eigenvector to match with second feature vector;
The corresponding keyword of the first eigenvector to match is added to corresponding wind in sensitive risk dictionary by S1306 Under dangerous level categories, sensitive risk dictionary is formed.
In above-mentioned S1301, the second complaint corpus of text can be obtained by metadata acquisition tool (such as reptile instrument) Collection, the second complaint corpus of text collection and above-mentioned first complaint corpus of text collection can be the same data set, be also possible to not Same two datasets;Preset sensitivity risk dictionary is simple dictionary obtained from being summarized as the expertise of early period, this is quick The quantity for the sensitive risk words and phrases that the sensitive risk vocabulary of each risk class is included in sense risk dictionary is relatively limited, because This is subsequent need constantly to carry out this it is perfect.
In above-mentioned S1302, the second complaint corpus of text collection is being segmented, is removing the texts such as stop words, punctuation mark Before this pretreatment operation, it can first delete the second complaint corpus of text and concentrate the sensitive risk word for having risk class label Sentence, not only improves the accuracy for improving subsequent participle in this way, is also beneficial to the content of text that corpus of text collection is complained in reduction second, Improve the efficiency of Text Pretreatment.
In above-mentioned S1303, summarizes to by each keyword obtained after Text Pretreatment, refer to difference Keyword summarized, be repeated and be added in bag of words data to avoid the keyword repeated, therefore word generated It include different keywords in bag data.
In above-mentioned S1304, (can such as pass through Word2vec model) in such a way that word is embedded in will be in bag of words data Each keyword is converted into corresponding first eigenvector, and converts the sensitive risk words and phrases in sensitive risk dictionary to pair The second feature vector answered, realize be by the word of character type or syntactic transfer numeric type feature vector.
In above-mentioned S1305, preset similarity mode algorithm is cosine similarity algorithm, obtains the of each keyword It, can be further by each first eigenvector and each after the second feature vector of one feature vector and each sensitive risk words and phrases A second feature vector carries out similarity mode, the similarity between some first eigenvector and some second feature vector When more than preset similarity threshold, then it can determine that the first eigenvector matches with second feature vector, while showing this The corresponding keyword of first eigenvector sensitive risk words and phrases corresponding with second feature vector are close semantically.
Sensitive risk vocabulary in above-mentioned S1306, under some first eigenvector and some risk class classification In some second feature vector when matching, then show the corresponding keyword of the first eigenvector and the second feature vector Corresponding sensitivity risk words and phrases are close semantically, and then the corresponding keyword of the first eigenvector can be added to the risk In sensitive risk vocabulary under graded category, in this way, by the similarity mode for carrying out feature vector, to find and sensitive wind The similar keyword of each sensitivity risk words and phrases in dangerous dictionary, and then the similar key searched out is added to sensitive risk word Carry out supplementing in allusion quotation it is perfect, to can get the sensitive risk dictionary for having a large amount of sensitive risk words and phrases.
In an alternative embodiment, according to urgency level score, according to preset rules to classified each complaint The step of text is ranked up, comprising:
Classified each complaint text is ranked up by S151 from high to low according to the height of urgency level score;
S152 is determined each according to preset score-hierarchical relationship table and the corresponding urgency level score of each complaint text A risk class for complaining text is simultaneously labeled, wherein multiple urgency level scores area is prestored in score-hierarchical relationship table Between and risk class corresponding with each urgency level score section.
It in the present embodiment, can when calculating the urgency level score of each complaint text under each text categories Each complaint text under each text categories is ranked up from high to low according to the height of urgency level score, meanwhile, it can Urgency level point belonging to each urgency level score for complaining text is determined by inquiring preset score-hierarchical relationship table Number interval, and then risk belonging to each complaint text etc. is further determined that according to the urgency level score section determined Grade, so as to the different risk class label of the complaint text marking for different urgency level scores, urgency level score is higher, Then risk class is higher, such as risk class can be divided into high-risk grade, medium risk according to domain expertise is complained Grade, average risk grade and low risk level sort more forward, simultaneously in this way, problem reflects more serious complaint text Its risk class is also higher, by risk class label but also staff can more intuitively recognize each complaint text Urgency level can also be found out rapidly by risk class label and be asked even if largely complaining text because a variety of causes is disturbed Topic reflects more serious complaint text, so that staff more easily can reflect more serious complaint to problem Text is timely handled.
In an alternative embodiment, by each complaint text vector after Text Pretreatment, and by vectorization Each complaint text input was carried out into preset textual classification model after the step of text classification, further includes:
S12a obtains third and complains corpus of text collection;
S12b acquires entity from Vertical Website, and collected entity is summarized, and generates entity dictionary;
S12c complains text language using regular expression and natural language processing tool identification third according to entity dictionary Expect the name entity concentrated, and the name entity that will identify that is labeled;
Name entity after mark is converted to the first term vector, and the first term vector is input to preset Bi- by S12d In LSTM-CRF model, using the parameter in back propagation training Bi-LSTM-CRF model, to obtain the Bi- of optimized parameter LSTM-CRF model;
S12e obtains classified each the second term vector for complaining text by preset term vector model, and by the Two term vectors, which are input in Bi-LSTM-CRF model, is named Entity recognition, exports each name entity for complaining text and knows Other result.
In above-mentioned S12a, third can be obtained by metadata acquisition tool (such as reptile instrument) and complains corpus of text Collection, the third are complained corpus of text collection and above-mentioned first complaint corpus of text collection to can be the same data set, are also possible to not Same two datasets.
In above-mentioned S12b, can by metadata acquisition tool (such as reptile instrument etc.) from Vertical Website acquire name, The entities such as name, institution term, ProductName, company name, date, and collected entity is summarized, obtain an entity Dictionary.
It, can be by using regular expression and natural language processing tool (such as after obtaining entity dictionary in above-mentioned S12c Stamford Core NLP tool etc.) identify name entity (such as name, place name, organization that third complains corpus of text to concentrate Name, ProductName, company name, date etc. name entity), and will identify that the name entity come marks upper corresponding class label, The name entity such as identified is name, then stamps the class label of name.
It, can will be after mark by Word2vec model after the name entity after largely being marked in above-mentioned S12d Name entity is converted to the first term vector, and then the first term vector is input in preset Bi-LSTM-CRF model, uses Parameter in back propagation training Bi-LSTM-CRF model, to finally can get the Bi- with optimized parameter LSTM-CRF model, it is subsequent to be directly named Entity recognition using the Bi-LSTM-CRF model.
In above-mentioned S12e, preset term vector model is Word2vec model, which can be by using Above-mentioned third complains corpus of text collection to be trained and obtain;In this step, it can be obtained and classified by Word2vec model Each complaint text the second term vector, and then the second term vector is input in Bi-LSTM-CRF model and is named reality Body identification, exportable each name Entity recognition for complaining text is as a result, export included in each complaint content of text It names entity (such as name, place name, institution term, ProductName, company name, date), in this way, by exporting each complaint text This name entity result, even if also can probably recognize each so that staff does not check the particular content for complaining text Things involved in text is complained, such as who, which organization be related to.
Referring to Fig. 2, the embodiment of the present application also proposes a kind of complaint text processing apparatus, comprising:
Preprocessing module 11, for receiving multiple complaint texts to be processed, and it is pre- to carry out text to each complaint text Processing, wherein Text Pretreatment includes participle, removal stop words and removal punctuation mark;
Categorization module 12, for by each complaint text vector after Text Pretreatment, and by each throwing of vectorization Tell that text input carries out text classification into preset textual classification model;
Extraction module 13, for extracting classified each sensitive risk words and phrases complained in text, and according to default side Formula gives a mark to each sensitive risk words and phrases, obtains the corresponding risk score value of each sensitive risk words and phrases;
Computing module 14, for calculating classified each according to sensitive risk words and phrases and its corresponding risk score value The corresponding urgency level score of a complaint text;
Sorting module 15, for according to urgency level score, according to preset rules to classified each complaint text into Row sequence.
In above-mentioned preprocessing module 11, specifically, preprocessing module 11 receives the to be processed more of staff's input Then a complaint text segments complaint text using existing Chinese word segmentation tool (such as stammerer participle tool), And then first after participle is complained in text without real using existing Chinese stoplist (such as Baidu deactivates vocabulary) The word (such as auxiliary words of mood) and punctuation mark of border meaning are deleted, so as to subsequent operation.
In above-mentioned categorization module 12, due to computer can only the data of logarithm type calculated, and it is pretreated Complaining text is character type, and computer cannot be calculated directly, it is therefore desirable to by pretreated complaint text vector, specifically Ground, categorization module 12 can carry out vectorization processing to pretreated each complaint text by bag of words, obtain each throwing It tells the corresponding text space vector of text, and then the text space vector of acquisition is input in preset textual classification model, Each text categories for complaining text are exported, to realize to the mechanized classification for complaining text, wherein textual classification model is By the neural network model of special training, its role is to the complaint texts to different text categories to classify.
In said extracted module 13, generally, complain the problem of contained in text usually can be by some specific Word or sentence embody, for example, for the complaint text of food safety class, complain in text generally can include " expired ", The keywords such as " food security standard ", " food poisoning ", " exceeded ", in another example the complaint text of government organs' class, complains text In generally can include the keywords such as " corruption ", " accepting bribes ", " behaving badly " or specific sentence, and different keywords or spy Determining the problem of sentence is reflected seriousness would also vary from, such as the complaint text of government organs' class, if some is complained In text contain the keywords such as " corruption ", " accepting bribes ", then very maximum probability show the problem of complaint text is contained than more serious, And if some is complained and contains the keywords such as " behaving badly ", " swearing at people " in text, shows the problem of complaint text is contained It relatively mitigates, therefore can be using some specific words or sentence as sensitive risk words and phrases, and then passes through and complain in text The sensitive risk words and phrases for being included analyze complain text reflected problem severity, specifically, extraction module 13 can By building complaint field, commonly sensitive lexicon extracts sensitivity risk included in classified each complaint text Words and phrases (i.e. sensitive risk word and sensitive risk sentence pattern), and then the sensitive risk words and phrases extracted are carried out according to predetermined manner Marking, for example, pre-define the risk score value of each sensitive risk words and phrases in constructed sensitive lexicon, and then can be The risk score value of each sensitive risk words and phrases is obtained while extracting sensitive risk words and phrases.
In above-mentioned computing module 14, each urgency level score for complaining text, which can be calculate by the following formula, to be obtained :
Wherein, Y is the urgency level score for complaining text, miTo complain i-th of sensitive risk words and phrases in text corresponding Risk score value, xiTo complain word frequency of i-th of sensitive risk words and phrases in the complaint text in text.
For example, such as by extraction module 13 extracting some, to complain the sensitive risk words and phrases in text to have " greedy Dirt ", " accepting bribes ", " XX department ", " xxx expresses strong dissatisfaction xxx ", " complaint ", " XX committee ", wherein sensitive risk words and phrases " corruption " is 2 (occurring twice in the complaint text) in the word frequency in the complaint document, remaining sensitive risk words and phrases exists Word frequency in the complaint document is 1, and above-mentioned sensitivity words and phrases predefined risk score value is successively in sensitive lexicon are as follows: 4,4,2,3,1,3, then computing module 14 is calculated by above-mentioned formula, can get the urgency level score Y of the complaint text =4 × 2+4 × 1+2 × 1+3 × 1+1 × 1+3 × 1=21, urgency level score is higher, then shows that the complaint text is reflected The problem of it is more serious.
In above-mentioned sorting module 15, when computing module 14 calculates the tight of each complaint text under each text categories When anxious degree score, then each complaint text can be ranked up according to default rule by sorting module 15, such as will be every Each complaint text under a text categories is ranked up from high to low according to the height of urgency level score, in this way, problem is anti- More serious complaint text is reflected, is sorted more forward, so that being conducive to staff reflects more serious throwing to problem in time Tell that text is handled.
In the present embodiment, the complaint text processing apparatus is by carrying out pretreatment dyad to complaint text to be processed Change, so that the non-structured text space vector for complaining text available structured is indicated, and then using preset text The text classification that this disaggregated model automates multiple complaint texts to be processed, after the completion of text classification, passes through extraction The sensitive risk words and phrases of problem severity can be reflected in text to analyze the urgency level for complaining text by complaining, and then according to tight Anxious degree carries out marking sequence to each complaint text under each text categories, so that problem reflects more serious complaint text This, sequence can be more forward, in this way, by carrying out targetedly automatic classification and sequence to non-structured complaint text, no But the treatment effeciency for complaining text is effectively improved, and staff can be facilitated to reflect more serious throwing to problem in time It tells that text is handled, reduces the risk for generating adverse consequences.
Referring to Fig. 4, in an alternative embodiment, textual classification model is multi-layer perception (MLP) of the training to default precision Model, above-mentioned complaint text processing apparatus, further includes:
First obtains module 101, for obtaining the first complaint corpus of text collection for having class label, and complains to first Corpus of text collection carries out Text Pretreatment;
First conversion module 102, for pretreated first complaint corpus of text collection to be converted to text space vector Collection;
Module 103 is chosen, for randomly selecting the text space vector conduct of specified quantity from text space vector set Training set, remaining text space vector is as test set;
Training module 104, is trained for training set to be input in multiple perceptron model;
Authentication module 105 is verified for being input to test set in the multiple perceptron model after training, with Judge whether the precision of multiple perceptron model reaches default precision;
Parameter adjustment module 106, for iterating more when the precision of multiple perceptron model not up to default precision Parameter in layer perceptron model, until the precision of multiple perceptron model reaches default precision.
It is obtained in module 101 above-mentioned first, the first acquisition module 101 can pass through metadata acquisition tool (such as reptile instrument Deng) the first complaint corpus of text collection is obtained, first to complain corpus of text collection be with the other each complaint of different tag class for this The set of text, such as the complaint text for complaining text, consumption service class for complaining text, government organs' class of food safety class Deng the complaint text and quantity of specific category can not specifically limit this depending on actual use demand;First obtains After modulus block 101 gets the first complaint corpus of text collection, can further to first complain corpus of text concentrate each complaint Text is segmented, removes the Text Pretreatments operations such as stop words, removal punctuation mark, to carry out subsequent operation.
In above-mentioned first conversion module 102, the first conversion module 102 can be by bag of words by pretreated first Corpus of text collection is complained to be converted to text space vector set (set of i.e. each text space vector).
It, specifically, can be by choosing module 103 respectively from the text under each label classification in above-mentioned selection module 103 This space vector concentrates the text space vector for respectively randomly selecting 70%, and as training set, 30% text space vector is as survey Examination collection.
In above-mentioned training module 104, above-mentioned multiple perceptron model includes input layer, hidden layer and output layer, layer with Totally interconnected mode is mostly used between layer, is not present and is interconnected between same layer unit, each layer of connection weight can lead to Overfitting is adjusted, and can be learnt a large amount of mode map relationship, is retouched without any of mathematical function knowledge State the mapping between input and output, realize and input pattern is mapped to desired output mode, it is only necessary to known mode into Row model training, by study, model is just provided with the ability of the non-linear relation mapping of this height, wherein in output layer Neuron number can be designed according to actual classification demand, such as need to carry out five classification, then the nerve in output layer Member may be designed as 5.
In above-mentioned authentication module 105, after training module 104 trains multiple perceptron model by training set, then test Card module 105 can further verify the precision of multiple perceptron model using test set, and specifically, authentication module 105 passes through Test set is input to the hidden layer of multiple perceptron model by input layer, after the processing of hidden layer, is exported in output layer Text classification as a result, pass through total complaint amount of text of test set shared by the quantity for the complaint text that calculating is correctly classified in turn Percentage, obtain the precision of multiple perceptron model, and then verify whether obtained precision reaches default precision (such as again 80%).
In above-mentioned parameter adjustment module 106, if the precision of multiple perceptron model reaches default precision, can directly by It is used for subsequent text classification, if the precision of multiple perceptron model reaches default precision, parameter adjustment module 106 can lead to It crosses back-propagation algorithm and adjusts parameter (threshold value and neuron in such as each network layer in multiple perceptron model repeatedly Between weighted value), until the precision of multiple perceptron model reaches default precision, in this way, by training one have The multiple perceptron model of good accuracy realizes the mechanized classification for complaining text, improves throwing as textual classification model Tell the treatment effeciency of text.
Referring to Fig. 2 and Fig. 3, in an alternative embodiment, above-mentioned first conversion module 102, comprising:
Computing unit 1021 complains corpus of text collection for calculating pretreated first using preset TF-IDF algorithm In each word TF-IDF value;
Selection unit 1022 chooses TF-IDF value greater than default for complaining corpus of text to concentrate from pretreated first The word of threshold value summarizes as Feature Words, and by Feature Words, generates feature dictionary;
Converting unit 1023, for according to feature dictionary, pretreated first complaint corpus of text collection to be converted to text This space vector collection.
In above-mentioned computing unit 1021, computing unit 1021 can determine by following formula (i.e. TF-IDF algorithm) One complaint corpus of text concentrates the TF-IDF value of each word:
Wherein, TF is word frequency, and IDF is reverse document frequency, and t is time that some word occurs in some complaint text Number, T are total word number of some complaint text, and N is the first complaint text sum for complaining corpus of text collection, and n is the first complaint text Complaint text sum in this corpus comprising some word;Specifically, the TF-IDF value of some word is bigger, then represents the word The weight of language is bigger, and the word is more important for corresponding complaint text, and therefore, the TF-IDF value of some word is got over Greatly, more there is separating capacity, i.e., can more characterize text categories.
In above-mentioned selection unit 1022, since the TF-IDF value of word is bigger, text categories can be more characterized, therefore can lead to It crosses selection unit 1022 and complains corpus of text to concentrate the word chosen TF-IDF value and be greater than preset threshold from pretreated first As Feature Words, and then by summarizing Feature Words, feature dictionary can get, so as to exclude the first complaint text language Material concentrates the lower word of most of separating capacity, both can reach the effect of dimensionality reduction in this way (relative to used in bag of words Bag of words data, word included in feature dictionary achievees the effect that dimensionality reduction than bag of words data much less, is conducive to Improve the processing speed of data), being also beneficial to the number that reduction following model is trained, (quality of training set is better, easier to obtain There must be the textual classification model of good accuracy).
In above-mentioned converting unit 1023, specifically, converting unit 1023 complains corpus of text for pretreated first Each word included in collection, is retrieved in feature dictionary, if there are first in feature dictionary to complain corpus of text Word included in collection is set in the corresponding dimension of space vector then according to the sequence location of feature dictionary where the word It is set to 1, is otherwise provided as 0, for example, for example, the pretreated first some for complaining corpus of text to concentrate complains text Are as follows: " xxx manufacturer/production/steamed bun/containing/violation/additive/Hangzhou/mono-/citizen/purchase/it is edible/afterwards/appearance/food in Poison ";Word contained in feature dictionary are as follows: 1, food poisoning, 2, murder, 3, corruption, 4, behave badly, 5, in violation of rules and regulations, 6, mistake Phase, 7, xx department, 8, additive, 9, steamed bun, 10, complaint;It can be then text by the complaint text conversion by this feature dictionary Space vector [1 00010011 0], so that converting unit 1023 can be realized will be pretreated according to feature dictionary First complaint corpus of text collection is converted to text space vector set (i.e. the set of text space vector).
Referring to Fig. 2 and Fig. 5, in an alternative embodiment, said extracted module 13, comprising:
Extraction unit 131, for being extracted by regular expression classified each based on preset sensitive risk dictionary Complain the sensitive risk words and phrases in text, wherein sensitive risk words and phrases and sensitive risk words and phrases are prestored in sensitive risk dictionary Corresponding risk class and risk score value corresponding with each risk class;
Determination unit 132, it is corresponding for inquiring risk words and phrases sensitive with what is extracted from sensitive risk dictionary Risk class, and corresponding risk score value is determined according to risk class.
In the present embodiment, the sensitive risk vocabulary of different risk classes is stored in above-mentioned sensitive risk dictionary, and Different risk classes corresponds to different risk score values, therefore the sensitive risk of each of same sensitive risk vocabulary Words and phrases all have identical risk score value, wherein risk score value is carried out set by risk assessment to sensitive risk words and phrases Score, which can be rationally arranged according to the expertise in the field of complaint, for example, such as can be by risk Partition of the level is high risk rank, medium risk rank, average risk rank and low-risk rank, the corresponding wind of high risk rank Dangerous fractional value is set as the corresponding risk score value of 4, medium risk rank and is set as the corresponding risk point of 3, average risk rank Numerical value is set as the corresponding risk score value of 2, low-risk rank and is set as 1, therefore, can pole by the way that sensitive risk dictionary is arranged Sensitivity risk words and phrases included in each complaint text are extracted and given a mark for convenience.
Referring to Fig. 4, in an alternative embodiment, above-mentioned complaint text processing apparatus, further includes:
First identification module 1301 complains corpus of text collection for obtaining second, and is based on preset sensitive risk dictionary, The sensitive risk words and phrases of the second complaint corpus of text concentration are identified by regular expression and stamp corresponding risk class mark Label, wherein sensitive risk dictionary prestores above-mentioned sensitive risk words and phrases, above-mentioned risk class and above-mentioned risk score value;
Removing module 1302 concentrates the sensitive risk word with risk class label for deleting the second complaint corpus of text Sentence, and complain corpus of text collection to carry out Text Pretreatment delete sensitive risk words and phrases second, it obtains second and complains text Each keyword in corpus;
Summarizing module 1303 generates bag of words data for summarizing each keyword;
Conversion module 1304, for converting corresponding first eigenvector for each keyword in bag of words data, with And corresponding second feature vector is converted by the sensitive risk words and phrases in sensitive risk dictionary;
Matching module 1305, for utilizing preset similarity mode algorithm, by first eigenvector and second feature to Amount carries out similarity mode, determines the first eigenvector to match with second feature vector;
Adding module is right in sensitive risk dictionary for the corresponding keyword of the first eigenvector to match to be added to Under the risk class classification answered, sensitive risk dictionary is formed.
In above-mentioned first identification module 1301, the first identification module 1301 can pass through metadata acquisition tool (such as crawler work Tool etc.) obtain the second complaint corpus of text collection, this second complains corpus of text collection and above-mentioned first to complain corpus of text collection can To be the same data set, it is also possible to different two datasets;Preset sensitivity risk dictionary is to be passed through by the expert of early period Test simple dictionary obtained from summary, in the sensitivity risk dictionary the sensitive risk vocabulary of each risk class included it is quick Feel risk words and phrases quantity it is relatively limited, therefore it is subsequent need constantly to carry out this it is perfect.
In above-mentioned removing module 1302, the second complaint corpus of text collection is being segmented, is removing stop words, punctuate symbol Before number equal Text Pretreatments operation, the second complaint corpus of text can be first deleted by removing module 1302 and is concentrated with risky The sensitive risk words and phrases of grade distinguishing label, not only improve the accuracy for improving subsequent participle in this way, are also beneficial to reduction second and complain The content of text of corpus of text collection improves the efficiency of Text Pretreatment.
In above-mentioned summarizing module 1303, summarizes to by each keyword obtained after Text Pretreatment, refer to It is to summarize to different keywords, is repeated and is added in bag of words data to avoid the keyword repeated, therefore is logical Crossing in the bag of words data generated of summarizing module 1303 includes different keywords.
In above-mentioned conversion module 1304, conversion module 1304 can be in such a way that word be embedded in (as passed through Word2vec mould Type etc.) by each keyword in bag of words data corresponding first eigenvector is converted, and will be in sensitive risk dictionary Sensitive risk words and phrases are converted into corresponding second feature vector, realize be by the word of character type or syntactic transfer numeric type spy Levy vector.
In above-mentioned matching module 1305, preset similarity mode algorithm is cosine similarity algorithm, by converting mould After block 1304 obtains the first eigenvector of each keyword and the second feature vector of each sensitive risk words and phrases, it can pass through Each first eigenvector and each second feature vector are further carried out similarity mode by matching module 1305, when some the When similarity between one feature vector and some second feature vector is more than preset similarity threshold, then matching module 1305 It may determine therefrom that the first eigenvector matches with second feature vector, while showing the corresponding key of the first eigenvector Word sensitive risk words and phrases corresponding with second feature vector are close semantically.
Sensitive risk in above-mentioned adding module 1306, under some first eigenvector and some risk class classification When some second feature vector in vocabulary matches, then show the corresponding keyword of the first eigenvector and second spy It is close semantically to levy the corresponding sensitive risk words and phrases of vector, and then can be by adding module 1306 by the first eigenvector pair The keyword answered is added in the sensitive risk vocabulary under the risk class classification, in this way, passing through the phase for carrying out feature vector It is matched like degree, to find keyword similar with sensitivity risk words and phrases each in sensitive risk dictionary, and then the phase that will be searched out It is added in sensitive risk dictionary like keyword and supplement perfect, have the quick of a large amount of sensitive risk words and phrases to can get Feel risk dictionary.
Referring to Fig. 2 and Fig. 6, in an alternative embodiment, above-mentioned sorting module 15, comprising:
Sequencing unit 151, for by it is classified it is each complain text according to urgency level score height from high to low It is ranked up;
Unit 152 is marked, for according to preset score-hierarchical relationship table and the corresponding urgency level of each complaint text Score determines each risk class for complaining text and is labeled, wherein prestores in score-hierarchical relationship table multiple tight Anxious degree score section and risk class corresponding with each urgency level score section.
In the present embodiment, when calculating the urgent of each complaint text under each text categories by computing module 14 When degree score, can by sequencing unit 151 by each complaint text under each text categories according to urgency level score Height is ranked up from high to low, meanwhile, it is each to determine preset score-hierarchical relationship table can be inquired by mark unit 152 Urgency level score section belonging to a urgency level score for complaining text, and then it is tight according to what is determined to mark unit 152 Anxious degree score section further determines that risk class belonging to each complaint text, so as to being different urgency level scores The different risk class label of complaint text marking, urgency level score is higher, then risk class is higher, such as can be according to throwing It tells domain expertise, risk class is divided into high-risk grade, medium risk grade, average risk grade and low-risk etc. Grade sorts more forward, while its risk class is also higher, passes through risk etc. in this way, problem reflects more serious complaint text Grade label is but also staff can more intuitively recognize each urgency level for complaining text, even if largely complaining text Because a variety of causes is disturbed, problem can also be found out rapidly by risk class label and reflects more serious complaint text, from And staff is made more easily can to reflect that more serious complaint text is timely handled to problem.
Referring to Fig. 4, in an alternative embodiment, above-mentioned complaint text processing apparatus, further includes:
Second obtains module 12a, complains corpus of text collection for obtaining third;
Generation module 12b summarizes for acquiring entity from Vertical Website, and by collected entity, generates real Pronouns, general term for nouns, numerals and measure words allusion quotation;
Second identification module 12c, for being identified using regular expression and natural language processing tool according to entity dictionary The name entity that third complains corpus of text to concentrate, and the name entity that will identify that is labeled;
Second conversion module 12d, for the name entity after mark to be converted to the first term vector, and by the first term vector It is input in preset Bi-LSTM-CRF model, using the parameter in back propagation training Bi-LSTM-CRF model, to obtain Obtain the Bi-LSTM-CRF model of optimized parameter;
Entity recognition module 12e is named, for obtaining classified each complaint text by preset term vector model The second term vector, and the second term vector is input in Bi-LSTM-CRF model and is named Entity recognition, export each throwing Tell the name Entity recognition result of text.
It is obtained in module 12a above-mentioned second, the second acquisition module 12a can pass through metadata acquisition tool (such as reptile instrument Deng) to obtain third corpus of text collection is complained, which complains corpus of text collection and above-mentioned first to complain corpus of text collection can be with It is the same data set, is also possible to different two datasets.
In above-mentioned generation module 12b, generation module 12b can be by metadata acquisition tool (such as reptile instrument) from vertical The entities such as name, place name, institution term, ProductName, company name, date are acquired in website, and collected entity is carried out Summarize, obtains an entity dictionary.
In above-mentioned second identification module 12c, after obtaining entity dictionary by generation module 12b, the second identification module 12c Text can be complained by using regular expression and natural language processing tool (such as Stamford Core NLP tool) identification third Name entity (such as name, place name, institution term, ProductName, company name, date name entity) in this corpus, and It will identify that the name entity come marks upper corresponding class label, the name entity such as identified is name, then stamps people The class label of name.
Name entity in above-mentioned second conversion module 12d, after largely being marked by the second identification module 12c Afterwards, the name entity after mark can be converted to the first term vector by Word2vec model by the second conversion module 12d, and then will First term vector is input in preset Bi-LSTM-CRF model, using in back propagation training Bi-LSTM-CRF model Parameter, thus finally can get the Bi-LSTM-CRF model with optimized parameter, it is subsequent directly to utilize the Bi-LSTM- CRF model is named Entity recognition.
In above-mentioned name Entity recognition module 12e, preset term vector model is Word2vec model, the Word2vec Model can complain corpus of text collection to be trained and obtain by using above-mentioned third;In this step, Entity recognition mould is named Block 12e can obtain classified each the second term vector for complaining text by Word2vec model, and then by the second term vector Be input in Bi-LSTM-CRF model and be named Entity recognition, it is exportable it is each complain text name Entity recognition as a result, Export name entity (such as name, place name, institution term, ProductName, company included in each complaint content of text Name, date etc.), in this way, by exporting each name entity result for complaining text, even if so that staff does not check complaint The particular content of text also can probably recognize things involved in each complaint text, such as be related to who, which tissue Mechanism etc..
Referring to Fig. 7, a kind of computer equipment is also provided in the embodiment of the present application, which can be server, Its internal structure can be as shown in Figure 7.The computer equipment includes processor, the memory, network connected by system bus Interface and database.Wherein, the processor of the Computer Design is for providing calculating and control ability.The computer equipment is deposited Reservoir includes non-volatile memory medium, built-in storage.The non-volatile memory medium is stored with operating system, computer program And database.The internal memory provides environment for the operation of operating system and computer program in non-volatile memory medium.It should The database of computer equipment complains text handling method program etc. for storing.The network interface of the computer equipment be used for External terminal passes through network connection communication.The computer program realizes the throwing in any of the above-described embodiment when being executed by processor Tell text handling method.
The embodiment of the present application also proposes a kind of computer readable storage medium, is stored thereon with computer program, computer The complaint text handling method in any of the above-described embodiment is realized when program is executed by processor.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the computer program can store and a non-volatile computer In read/write memory medium, the computer program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, Any reference used in provided herein and embodiment to memory, storage, database or other media, Including non-volatile and/or volatile memory.Nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include Random access memory (RAM) or external cache.By way of illustration and not limitation, RAM can by diversified forms , such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double speed according to rate SDRAM (SSRSDRAM), Enhanced SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and And further include the other elements being not explicitly listed, or further include for this process, device, article or method institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, device of element, article or method.
The above description is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all utilizations Equivalent structure or equivalent flow shift made by description of the invention and accompanying drawing content is applied directly or indirectly in other correlations Technical field, be included within the scope of the present invention.

Claims (10)

1. a kind of complaint text handling method characterized by comprising
Multiple complaint texts to be processed are received, and Text Pretreatment is carried out to each complaint text, wherein the text Pretreatment includes participle, removal stop words and removal punctuation mark;
By each complaint text vector after Text Pretreatment, and extremely by each complaint text input of vectorization Text classification is carried out in preset textual classification model;
Classified each sensitive risk words and phrases complained in text are extracted, and according to predetermined manner to each sensitivity Risk words and phrases are given a mark, and the corresponding risk score value of each sensitivity risk words and phrases is obtained;
According to the sensitive risk words and phrases and its corresponding risk score value, classified each complaint text is calculated This corresponding urgency level score;
According to the urgency level score, classified each complaint text is ranked up according to preset rules.
2. complaint text handling method according to claim 1, which is characterized in that the textual classification model is to train extremely The multiple perceptron model of default precision, it is described to receive multiple complaint texts to be processed, and to each complaint text into Before this pretreated step of composing a piece of writing, further includes:
It obtains first with class label and complains corpus of text collection, and complain corpus of text collection to carry out the text to described first This pretreatment;
The pretreated first complaint corpus of text collection is converted into text space vector set;
The text space vector of specified quantity is randomly selected from the text space vector set as training set, remaining text Space vector is as test set;
The training set is input in the multiple perceptron model and is trained;
The test set is input in the multiple perceptron model after training and is verified, to judge the multilayer Whether the precision of perceptron model reaches default precision;
If the not up to default precision of the precision of the multiple perceptron model, in the multiple perceptron model that iterates Parameter, until the precision of the multiple perceptron model reaches default precision.
3. complaint text handling method according to claim 2, which is characterized in that described by pretreated described first The step of complaining corpus of text collection to be converted to text space vector set, comprising:
The TF- that the pretreated first complaint corpus of text concentrates each word is calculated using preset TF-IDF algorithm IDF value;
Corpus of text is complained to concentrate the word for choosing TF-IDF value greater than preset threshold as special from pretreated described first Word is levied, and the Feature Words are summarized, generates feature dictionary;
According to the feature dictionary, the pretreated first complaint corpus of text collection is converted into text space vector set.
4. complaint text handling method according to claim 1, which is characterized in that the extraction is classified each described The sensitive risk words and phrases in text are complained, and are given a mark according to predetermined manner to each sensitive risk words and phrases, are obtained each The step of a sensitive risk words and phrases corresponding risk score value, comprising:
Based on preset sensitive risk dictionary, classified each sensitivity complained in text is extracted by regular expression Risk words and phrases, wherein sensitive risk words and phrases, corresponding with the sensitivity risk words and phrases are prestored in the sensitivity risk dictionary Risk class and risk score value corresponding with each risk class;
Risk class corresponding with the sensitivity risk words and phrases extracted, and root are inquired from the sensitive risk dictionary The corresponding risk score value is determined according to the risk class.
5. complaint text handling method according to claim 4, which is characterized in that it is based on preset sensitive risk dictionary, Before the step of extracting classified each sensitive risk words and phrases complained in text by regular expression, further includes:
It obtains second and complains corpus of text collection, and based on preset sensitive risk dictionary, identified by regular expression described The sensitive risk words and phrases of second complaint corpus of text concentration simultaneously stamp corresponding risk class label, wherein the sensitivity risk Dictionary prestores the sensitive risk words and phrases, the risk class and the risk score value;
Deleting described second complains corpus of text to concentrate the sensitive risk words and phrases with risk class label, and to deleting Described the second of the sensitivity risk words and phrases complains corpus of text collection to carry out the Text Pretreatment, obtains described second and complains text Each keyword in this corpus;
Each keyword is summarized, bag of words data are generated;
Corresponding first eigenvector is converted by each keyword in the bag of words data, and by the sensitive wind The sensitive risk words and phrases in dangerous dictionary are converted into corresponding second feature vector;
Using preset similarity mode algorithm, the first eigenvector and the second feature vector are subjected to similarity Match, determines the first eigenvector to match with the second feature vector;
The corresponding keyword of the first eigenvector to match is added to corresponding in the sensitive risk dictionary Under risk class classification, the sensitive risk dictionary is formed.
6. complaint text handling method according to claim 1, which is characterized in that described according to the urgency level point Number, the step of being ranked up according to preset rules to classified each complaint text, comprising:
Classified each complaint text is ranked up from high to low according to the height of the urgency level score;
According to preset score-hierarchical relationship table and the corresponding urgency level score of each complaint text, determine each A risk class for complaining text is simultaneously labeled, wherein multiple urgent journeys are prestored in the score-hierarchical relationship table Spend score section and the risk class corresponding with urgency level score section described in each.
7. complaint text handling method according to any one of claim 1 to 6, which is characterized in that described that text is pre- Treated each complaint text vector, and by each complaint text input of vectorization to preset text point After the step of carrying out text classification in class model, further includes:
It obtains third and complains corpus of text collection;
Entity is acquired from Vertical Website, and the collected entity is summarized, and generates entity dictionary;
According to the entity dictionary, identify that the third complains corpus of text using regular expression and natural language processing tool The name entity of concentration, and the name entity that will identify that is labeled;
The name entity after mark is converted into the first term vector, and first term vector is input to preset Bi- In LSTM-CRF model, using the parameter in the back propagation training Bi-LSTM-CRF model, to obtain optimized parameter The Bi-LSTM-CRF model;
Classified each second term vector for complaining text is obtained by preset term vector model, and by described second Term vector, which is input in the Bi-LSTM-CRF model, is named Entity recognition, exports each name for complaining text Entity recognition result.
8. a kind of complaint text processing apparatus characterized by comprising
Preprocessing module for receiving multiple complaint texts to be processed, and carries out text to each complaint text and locates in advance Reason, wherein the Text Pretreatment includes participle, removal stop words and removal punctuation mark;
Categorization module, for by each complaint text vector after Text Pretreatment, and by each described of vectorization Text input is complained to carry out text classification into preset textual classification model;
Extraction module, for extracting classified each sensitive risk words and phrases complained in text, and according to predetermined manner It gives a mark to each sensitive risk words and phrases, obtains the corresponding risk score value of each sensitivity risk words and phrases;
Computing module, for calculating classified according to the sensitive risk words and phrases and its corresponding risk score value The corresponding urgency level score of each complaint text;
Sorting module is used for according to the urgency level score, according to preset rules to classified each complaint text It is ranked up.
9. a kind of computer equipment, including memory and processor, the memory are stored with computer program, feature exists In the processor realizes complaint text-processing side described in any one of claims 1 to 7 when executing the computer program The step of method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program The step of complaint text handling method described in any one of claims 1 to 7 is realized when being executed by processor.
CN201910528626.2A 2019-06-18 2019-06-18 Complain text handling method, device, computer equipment and storage medium Pending CN110377731A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910528626.2A CN110377731A (en) 2019-06-18 2019-06-18 Complain text handling method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910528626.2A CN110377731A (en) 2019-06-18 2019-06-18 Complain text handling method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110377731A true CN110377731A (en) 2019-10-25

Family

ID=68249302

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910528626.2A Pending CN110377731A (en) 2019-06-18 2019-06-18 Complain text handling method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110377731A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110880142A (en) * 2019-11-22 2020-03-13 深圳前海微众银行股份有限公司 Risk entity acquisition method and device
CN111177386A (en) * 2019-12-27 2020-05-19 安徽商信政通信息技术股份有限公司 Proposal classification method and system
CN111522950A (en) * 2020-04-26 2020-08-11 成都思维世纪科技有限责任公司 Rapid identification system for unstructured massive text sensitive data
CN112182207A (en) * 2020-09-16 2021-01-05 神州数码信息系统有限公司 Invoice false-proof risk assessment method based on keyword extraction and rapid text classification
CN112307770A (en) * 2020-10-13 2021-02-02 深圳前海微众银行股份有限公司 Sensitive information detection method and device, electronic equipment and storage medium
CN112597752A (en) * 2020-12-18 2021-04-02 平安银行股份有限公司 Complaint text processing method and device, electronic equipment and storage medium
CN112631888A (en) * 2020-12-30 2021-04-09 航天信息股份有限公司 Fault prediction method and device of distributed system, storage medium and electronic equipment
CN112860876A (en) * 2021-03-31 2021-05-28 中国工商银行股份有限公司 Session auxiliary processing method and device
CN113705200A (en) * 2021-08-31 2021-11-26 中国平安财产保险股份有限公司 Method, device and equipment for analyzing complaint behavior data and storage medium
CN115442832A (en) * 2021-06-03 2022-12-06 中国移动通信集团四川有限公司 Complaint problem positioning method and device and electronic equipment
CN116523555A (en) * 2023-05-12 2023-08-01 珍岛信息技术(上海)股份有限公司 Clue business opportunity insight system based on NLP text processing technology

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
CN108108352A (en) * 2017-12-18 2018-06-01 广东广业开元科技有限公司 A kind of enterprise's complaint risk method for early warning based on machine learning Text Mining Technology
CN108269115A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 A kind of advertisement safety evaluation method and system
CN108710651A (en) * 2018-05-08 2018-10-26 华南理工大学 A kind of large scale customer complaint data automatic classification method
CN109670837A (en) * 2018-11-30 2019-04-23 平安科技(深圳)有限公司 Recognition methods, device, computer equipment and the storage medium of bond default risk

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
CN108269115A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 A kind of advertisement safety evaluation method and system
CN107908614A (en) * 2017-10-12 2018-04-13 北京知道未来信息技术有限公司 A kind of name entity recognition method based on Bi LSTM
CN108108352A (en) * 2017-12-18 2018-06-01 广东广业开元科技有限公司 A kind of enterprise's complaint risk method for early warning based on machine learning Text Mining Technology
CN108710651A (en) * 2018-05-08 2018-10-26 华南理工大学 A kind of large scale customer complaint data automatic classification method
CN109670837A (en) * 2018-11-30 2019-04-23 平安科技(深圳)有限公司 Recognition methods, device, computer equipment and the storage medium of bond default risk

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110880142A (en) * 2019-11-22 2020-03-13 深圳前海微众银行股份有限公司 Risk entity acquisition method and device
CN110880142B (en) * 2019-11-22 2024-01-19 深圳前海微众银行股份有限公司 Risk entity acquisition method and device
CN111177386A (en) * 2019-12-27 2020-05-19 安徽商信政通信息技术股份有限公司 Proposal classification method and system
CN111177386B (en) * 2019-12-27 2021-05-14 安徽商信政通信息技术股份有限公司 Proposal classification method and system
CN111522950B (en) * 2020-04-26 2023-06-27 成都思维世纪科技有限责任公司 Rapid identification system for unstructured massive text sensitive data
CN111522950A (en) * 2020-04-26 2020-08-11 成都思维世纪科技有限责任公司 Rapid identification system for unstructured massive text sensitive data
CN112182207A (en) * 2020-09-16 2021-01-05 神州数码信息系统有限公司 Invoice false-proof risk assessment method based on keyword extraction and rapid text classification
CN112182207B (en) * 2020-09-16 2023-07-11 神州数码信息系统有限公司 Invoice virtual offset risk assessment method based on keyword extraction and rapid text classification
CN112307770A (en) * 2020-10-13 2021-02-02 深圳前海微众银行股份有限公司 Sensitive information detection method and device, electronic equipment and storage medium
CN112597752B (en) * 2020-12-18 2023-09-19 平安银行股份有限公司 Complaint text processing method and device, electronic equipment and storage medium
CN112597752A (en) * 2020-12-18 2021-04-02 平安银行股份有限公司 Complaint text processing method and device, electronic equipment and storage medium
CN112631888A (en) * 2020-12-30 2021-04-09 航天信息股份有限公司 Fault prediction method and device of distributed system, storage medium and electronic equipment
CN112860876A (en) * 2021-03-31 2021-05-28 中国工商银行股份有限公司 Session auxiliary processing method and device
CN115442832A (en) * 2021-06-03 2022-12-06 中国移动通信集团四川有限公司 Complaint problem positioning method and device and electronic equipment
CN115442832B (en) * 2021-06-03 2024-04-09 中国移动通信集团四川有限公司 Complaint problem positioning method and device and electronic equipment
CN113705200A (en) * 2021-08-31 2021-11-26 中国平安财产保险股份有限公司 Method, device and equipment for analyzing complaint behavior data and storage medium
CN113705200B (en) * 2021-08-31 2023-09-15 中国平安财产保险股份有限公司 Analysis method, analysis device, analysis equipment and analysis storage medium for complaint behavior data
CN116523555A (en) * 2023-05-12 2023-08-01 珍岛信息技术(上海)股份有限公司 Clue business opportunity insight system based on NLP text processing technology
CN116523555B (en) * 2023-05-12 2024-07-16 珍岛信息技术(上海)股份有限公司 Clue business opportunity insight system based on NLP text processing technology

Similar Documents

Publication Publication Date Title
CN110377731A (en) Complain text handling method, device, computer equipment and storage medium
CN110427623B (en) Semi-structured document knowledge extraction method and device, electronic equipment and storage medium
CN108520343B (en) Risk model training method, risk identification device, risk identification equipment and risk identification medium
TWI735543B (en) Method and device for webpage text classification, method and device for webpage text recognition
US20200279105A1 (en) Deep learning engine and methods for content and context aware data classification
CN102332028B (en) Webpage-oriented unhealthy Web content identifying method
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN108376151A (en) Question classification method, device, computer equipment and storage medium
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN112632989B (en) Method, device and equipment for prompting risk information in contract text
CN111028934B (en) Diagnostic quality inspection method, diagnostic quality inspection device, electronic equipment and storage medium
CN110098961A (en) A kind of Data Quality Assessment Methodology, device and storage medium
CN110008309A (en) A kind of short phrase picking method and device
Sheshikala et al. Natural language processing and machine learning classifier used for detecting the author of the sentence
CN109933783A (en) A kind of essence of a contract method of non-performing asset operation field
KR20230163983A (en) Similar patent extraction methods using neural network model and device for the method
Oelke et al. Visual evaluation of text features for document summarization and analysis
CN116578703A (en) Intelligent identification system and method
CN112732908B (en) Test question novelty evaluation method and device, electronic equipment and storage medium
CN109635289A (en) Entry classification method and audit information abstracting method
CN113868431A (en) Financial knowledge graph-oriented relation extraction method and device and storage medium
CN109300031A (en) Data digging method and device based on stock comment data
Intani et al. Automating Public Complaint Classification Through JakLapor Channel: A Case Study of Jakarta, Indonesia
Cholissodin et al. Audit system development for government institution documents using stream deep learning to support smart governance
CN112115258A (en) User credit evaluation method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination