CN113553431B - User tag extraction method, device, equipment and medium - Google Patents

User tag extraction method, device, equipment and medium Download PDF

Info

Publication number
CN113553431B
CN113553431B CN202110851246.XA CN202110851246A CN113553431B CN 113553431 B CN113553431 B CN 113553431B CN 202110851246 A CN202110851246 A CN 202110851246A CN 113553431 B CN113553431 B CN 113553431B
Authority
CN
China
Prior art keywords
sample
text
samples
label
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110851246.XA
Other languages
Chinese (zh)
Other versions
CN113553431A (en
Inventor
陈贝妮
王坚
李婷
赵炀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Pingan Integrated Financial Services Co ltd
Original Assignee
Shenzhen Pingan Integrated Financial Services Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Pingan Integrated Financial Services Co ltd filed Critical Shenzhen Pingan Integrated Financial Services Co ltd
Priority to CN202110851246.XA priority Critical patent/CN113553431B/en
Publication of CN113553431A publication Critical patent/CN113553431A/en
Application granted granted Critical
Publication of CN113553431B publication Critical patent/CN113553431B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of artificial intelligence, and provides a user tag extraction method, device, equipment and medium, which can enhance recall of all possible text expressions through texts, screen the text obtained after enhancement through sensitive word recall, ensure that keywords are positioned as much as possible and accurately for subsequent model training, improve the training effect of the model, combine sensitive word recall and deep learning training to obtain a user tag extraction model, combine the strong adaptability of sensitive word recognition in public opinion scenes, and the high accuracy of a deep learning algorithm for text emotion mood judgment, firstly recognize a large number of text related to recall public opinion through sensitive words, and judge the emotion positive and negative of the recalled text through deep learning, thereby effectively improving the accuracy of tag recognition. In addition, the invention also relates to a blockchain technology, and a user tag extraction model can be stored in a blockchain node.

Description

User tag extraction method, device, equipment and medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for extracting a user tag.
Background
With the development of economy, the state will continue to advance in strict regulation of the financial industry. Under the continuous strict supervision trend, each large financial group needs more advanced and effective risk management means and aims to provide better products and services for clients, so that accurate identification of the emotion state of the clients in the service process is necessary.
However, when public opinion event occurs, the seat is often touched first. On one hand, because the seat lacks macroscopic control consciousness, the influence of the event is easily underestimated, and the public opinion event conceal is caused; on the other hand, the management layer has stronger macroscopic control consciousness, but cannot contact the public opinion event at the first time, so that the public opinion event often has the dilemma that the public opinion event is not uploaded from bottom to top and cannot be managed from top to bottom.
In addition, at present, machine learning is generally adopted to analyze the emotion of the user, and the following defects mainly exist:
(1) The deep learning algorithm is difficult to identify interactive long texts between customer service and users, a large amount of computing resources are needed, and a large amount of non-public opinion texts need to be marked to recall enough public opinion sample texts due to extremely low occurrence frequency of public opinion related texts, so that labor cost is extremely high;
(2) Public opinion is mostly generated based on only one keyword, such as a dirty word, and the characteristics of the word of a dirty word in a long sentence are not obvious and are difficult to recognize.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a device, and a medium for extracting a user tag, which can combine sensitive word recall and deep learning training to obtain a user tag extraction model, and at the same time, combine strong adaptability of sensitive word recognition in public opinion scenes with high accuracy of deep learning algorithm for text emotion mood judgment, firstly recognize a large number of text related to recall public opinion through sensitive words, and then judge emotion positive and negative of the recall text through deep learning, so that accuracy of tag recognition can be effectively improved.
A user tag extraction method, the user tag extraction method comprising:
acquiring an initial sample, and performing text cleaning on the initial sample to obtain a first sample set;
Text enhancement is carried out on samples in the first sample set to obtain a second sample set;
performing sensitive word recall processing on samples in the second sample set to obtain training samples;
Carrying out label processing on samples in the training samples to obtain label samples;
vectorizing the label sample to obtain a vector set;
training a preset classification network by using the vector set to obtain a user tag extraction model;
When receiving a text to be processed corresponding to a target user, converting the text to be processed into a vector to be processed, inputting the vector to be processed into the user tag extraction model, and generating a public opinion tag of the target user according to the output of the user tag extraction model.
According to a preferred embodiment of the present invention, the text washing of the initial sample to obtain a first sample set includes:
Configuring a plurality of windows of a specified size;
scanning on each text in the first sample set with the windows, respectively;
when the same text is scanned, determining the scanned same text as a group of repeated words, and reserving one word in each group of repeated words in the first sample set to obtain a first intermediate set;
Extracting digital texts and time texts in the first intermediate set by using a regular expression;
replacing the digital text with a first preset value, and replacing the time text with a second preset value to obtain a second middle set;
Word segmentation is carried out on the texts in the second intermediate set, so that a third intermediate set is obtained;
invoking a pre-configured stop word dictionary, and inquiring in the stop word dictionary by utilizing the text in the third middle set;
and deleting the queried words consistent with the words in the stop word dictionary from the third middle set to obtain the first sample set.
According to a preferred embodiment of the present invention, the text enhancement of the samples in the first sample set to obtain a second sample set includes:
Obtaining a pre-constructed fault-tolerant dictionary;
querying in the fault-tolerant dictionary by utilizing each sample in the first sample set, and determining the queried text matched with each sample as enhanced text of each sample;
Adding the enhanced text of each sample to the first sample set to obtain the second sample set.
According to a preferred embodiment of the present invention, the performing sensitive word recall processing on the samples in the second sample set to obtain training samples includes:
obtaining pre-configured public opinion categories and obtaining sensitive words of each public opinion category;
constructing a regular expression corresponding to each public opinion category according to the sensitive word of each public opinion category;
Identifying a public opinion category of each sample in the second sample set;
Acquiring regular expressions corresponding to the public opinion categories of each sample, traversing each sample by utilizing the regular expressions corresponding to the public opinion categories of each sample, and obtaining candidate sensitive words of each sample;
And repairing the candidate sensitive words of each sample to obtain the training sample.
According to a preferred embodiment of the present invention, the performing label processing on the samples in the training samples to obtain label samples includes:
Identifying emotion words with emotion orientations in the training samples;
deleting samples without the emotion words from the training samples to obtain preliminary screening samples;
The proportion of positive samples and negative samples in the preliminary screening samples is adjusted to be a preset proportion through undersampling, wherein the positive samples represent samples with emotion words pointed by positive emotion, and the negative samples represent samples with emotion words pointed by negative emotion;
the positive sample and the negative sample are sent to a designated platform for marking;
And constructing the label sample by utilizing the positive sample with the label fed back by the appointed platform and the negative sample.
According to a preferred embodiment of the present invention, the vectorizing the label samples to obtain a vector set includes:
a word2vec algorithm is adopted to convert words composing each sample in the label sample into a vector with appointed dimension;
longitudinally splicing vectors corresponding to the words composing each sample to obtain a text feature matrix of each sample;
and combining the text feature matrix of each sample to obtain the vector set.
According to a preferred embodiment of the present invention, after generating the public opinion label of the target user according to the output of the user label extraction model, the method further includes:
Acquiring a current service scene, and when the current service scene is an experience feedback scene, positioning a target problem according to a public opinion label of the target user, and sending the target problem to a designated terminal device; or when the current service scene is a consultation scene, connecting a designated customer service, and sending prompt information to the designated customer service, wherein the prompt information is used for prompting the designated customer service to assist in soothing the emotion of the target user;
And adding the public opinion label of the target user to a specified label library.
A user tag extraction apparatus, the user tag extraction apparatus comprising:
The cleaning unit is used for acquiring an initial sample, and performing text cleaning on the initial sample to obtain a first sample set;
The enhancement unit is used for carrying out text enhancement on the samples in the first sample set to obtain a second sample set;
the recall unit is used for carrying out sensitive word recall processing on the samples in the second sample set to obtain training samples;
the label unit is used for carrying out label processing on the samples in the training samples to obtain label samples;
The vectorization unit is used for vectorizing the label samples to obtain a vector set;
the training unit is used for training a preset classification network by utilizing the vector set to obtain a user tag extraction model;
The generating unit is used for converting the text to be processed into a vector to be processed when receiving the text to be processed corresponding to the target user, inputting the vector to be processed into the user tag extraction model, and generating the public opinion tag of the target user according to the output of the user tag extraction model.
A computer device, the computer device comprising:
a memory storing at least one instruction; and
And the processor executes the instructions stored in the memory to realize the user tag extraction method.
A computer-readable storage medium having stored therein at least one instruction that is executed by a processor in a computer device to implement the user tag extraction method.
According to the technical scheme, the invention can acquire an initial sample, clean the initial sample to obtain a first sample set, clean the text which is not beneficial to text recognition, make the characteristics of the reserved text more obvious and easy to recognize, enhance the text of the sample in the first sample set to obtain a second sample set, increase generalization based on the original sample, enable similar expressions to be recalled, improve recall rate, perform sensitive word recall processing on the sample in the second sample set to obtain training samples, firstly enhance all possible text expressions through the text, screen the text obtained after enhancement through the sensitive word recall to ensure that the key words are positioned as much and accurately as possible for the training of the subsequent model, improve the training effect of the model, performing label processing on samples in the training samples to obtain label samples, performing vectorization processing on the label samples to obtain a vector set, training a preset classification network by utilizing the vector set to obtain a user label extraction model, converting a text to be processed into the vector to be processed when receiving the text to be processed corresponding to a target user, inputting the vector to be processed into the user label extraction model, generating a public opinion label of the target user according to the output of the user label extraction model, combining sensitive word recall and deep learning training to obtain the user label extraction model, simultaneously fusing the strong adaptability of sensitive word recognition in the public opinion scene and the high accuracy of a deep learning algorithm on text emotion basic tone judgment, firstly recognizing a large number of text related to recall public opinion through the sensitive word, and then, judging the emotion positive and negative of the recall text through deep learning, so that the accuracy of label identification can be effectively improved.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of the user tag extraction method of the present invention.
FIG. 2 is a functional block diagram of a preferred embodiment of the user tag extraction apparatus of the present invention.
Fig. 3 is a schematic structural diagram of a computer device for implementing a preferred embodiment of the user tag extraction method according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
FIG. 1 is a flow chart of a preferred embodiment of the user tag extraction method of the present invention. The order of the steps in the flowchart may be changed and some steps may be omitted according to various needs.
The user tag extraction method is applied to one or more computer devices, wherein the computer device is a device capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware of the computer device comprises, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device and the like.
The computer device may be any electronic product that can interact with a user in a human-computer manner, such as a Personal computer, a tablet computer, a smart phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a game console, an interactive internet protocol television (Internet Protocol Television, IPTV), a smart wearable device, etc.
The computer device may also include a network device and/or a user device. Wherein the network device includes, but is not limited to, a single network server, a server group composed of a plurality of network servers, or a Cloud based Cloud Computing (Cloud Computing) composed of a large number of hosts or network servers.
The network in which the computer device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.
S10, acquiring an initial sample, and performing text cleaning on the initial sample to obtain a first sample set.
In at least one embodiment of the present invention, the initial sample may be obtained from a database of the relevant enterprise, and the present invention is not limited.
For example: the initial sample can be a chat record between a user and customer service on any platform of the relevant enterprise.
In at least one embodiment of the present invention, the text washing the initial sample to obtain a first sample set includes:
Configuring a plurality of windows of a specified size;
scanning on each text in the first sample set with the windows, respectively;
when the same text is scanned, determining the scanned same text as a group of repeated words, and reserving one word in each group of repeated words in the first sample set to obtain a first intermediate set;
Extracting digital texts and time texts in the first intermediate set by using a regular expression;
replacing the digital text with a first preset value, and replacing the time text with a second preset value to obtain a second middle set;
Word segmentation is carried out on the texts in the second intermediate set, so that a third intermediate set is obtained;
invoking a pre-configured stop word dictionary, and inquiring in the stop word dictionary by utilizing the text in the third middle set;
and deleting the queried words consistent with the words in the stop word dictionary from the third middle set to obtain the first sample set.
For example, when the window size is 4, there are 4 words that appear at a time in the window. When the text of the window appearing in the current position t and the t+4 position is the same, the window is considered to be a repeated word.
Specifically, 2 parameters are input by the program, one of which is text: text= "or one more that is the salesman who is given that business at the time, you also go me; the second is the maximum window width: max_ngram_length=4. If the preset maximum width of the window is 4, the window size traverses 4, 3, 2 and 1.
Text= "or one more is given to me at the time that business person, you also send me to you";
max_ngram_length=4;
result=trimmer(text,max_ngram_length);
When the program runs, the window width is 4, and a text slides from left to right, so that 'also one' is found to be a repeated word; when the window width is 2, it is found that "is a repeat word.
After the repeated word deletion is carried out, the output result is as follows: "or still another is to give me that salesman at the time, you also go me.
It will be appreciated that text features can be optimized by repeated word processing, since there are a large number of repeated words in the customer's interaction text with the agent, which then dilutes the text features for the NLU (Natural Language Understanding ).
Further, a great deal of description of the amount will be involved in the financial scenario, such as: three hundred seventy yuan, forty thousand yuan, etc., and time descriptions such as: this month thirty, friday, five month twenty-eight, tomorrow, etc., such descriptions are very diverse and are easily misinterpreted as features by the classification algorithm.
Therefore, the embodiment adopts a regular expression to generalize the description of the amount and the description of the time so as to reduce the interference on text classification caused by the diversity of the amount and the time.
For example: and when the first preset value is configured to be 300 and the second preset value is configured to be 14 pm, all the amounts can be uniformly generalized to 300 yuan, and all the times can be uniformly generalized to two pm points.
Further, in this embodiment, LAC (Lexical Analysis of Chinese, chinese word segmentation tool) may be used to perform word segmentation on the text in the second intermediate set, which is not limited by the present invention.
Further, the stop word dictionary may be configured in a customized manner according to actual requirements, for example, the stop word dictionary may include a part of a word of a Chinese, a part of a preposition, a talk technique (such as a programmed start and end word), and the like.
Through the text cleaning, the text which is not beneficial to text recognition is cleaned, so that the characteristics of the reserved text are more obvious and easy to recognize.
S11, carrying out text enhancement on the samples in the first sample set to obtain a second sample set.
In at least one embodiment of the present invention, the text enhancement of the samples in the first sample set to obtain a second sample set includes:
Obtaining a pre-constructed fault-tolerant dictionary;
querying in the fault-tolerant dictionary by utilizing each sample in the first sample set, and determining the queried text matched with each sample as enhanced text of each sample;
Adding the enhanced text of each sample to the first sample set to obtain the second sample set.
In this embodiment, the fault tolerance dictionary refers to an ASR (Automatic Speech Recognition, automatic speech recognition technology) fault tolerance dictionary constructed based on a continuously accumulated massive corpus.
For example: during sudden epidemic situation, many clients of enterprises in the financial field can enter into line to require delayed repayment, and because the keyword of epidemic situation is too low in occurrence frequency in the previous life, the keyword cannot be accurately translated by ASR, and voice can be translated into irrelevant words such as happy, heterology, heteromorphism and the like. Through accumulation of massive corpus, an ASR fault tolerance dictionary is continuously improved, text which cannot be translated by an ASR can be enhanced by inquiring corresponding text in the ASR fault tolerance dictionary, and therefore ASR fault tolerance of keywords in most financial scenes can be dealt with.
Text enhancement synthesizes text of similar meaning in order to cope with the diversity of spoken language expressions. If the client incoming line requires to log off the card, the client cannot say the keyword of "log off" because of lack of professional knowledge of the credit card, the client may express "cancel", "log off", "cancel", etc., and through text enhancement, the text similar to the meaning of "log off" can be accurately identified to assist in understanding the client's wish.
Through text enhancement, generalization can be added on the basis of the original sample, so that similar expressions can be recalled, and the recall rate is improved.
And S12, performing sensitive word recall processing on the samples in the second sample set to obtain training samples.
In at least one embodiment of the present invention, the performing sensitive word recall processing on the samples in the second sample set to obtain training samples includes:
obtaining pre-configured public opinion categories and obtaining sensitive words of each public opinion category;
constructing a regular expression corresponding to each public opinion category according to the sensitive word of each public opinion category;
Identifying a public opinion category of each sample in the second sample set;
Acquiring regular expressions corresponding to the public opinion categories of each sample, traversing each sample by utilizing the regular expressions corresponding to the public opinion categories of each sample, and obtaining candidate sensitive words of each sample;
And repairing the candidate sensitive words of each sample to obtain the training sample.
For example, category 54 public opinion categories and related sensitive words of each category are pre-configured, and text that may be related to public opinion is recalled using regular expressions, namely the candidate sensitive words. In order to make the recall result as accurate as possible, the recall text is repaired by badcase. The public opinion category such as "complaint to social regulatory authorities" includes two channels of complaint to regulatory authorities and consumer protection associations, and thus keywords include "market regulatory" and related enhanced text, and "consumer protection association" and related enhanced text. Because "12315" is a hotline of the consumer protection association, "12315" is one of the keywords in this category, but through the analysis of badcase, it is found that the client also has a situation of mishitting "12315" in the scenes of reporting identification card number, insurance number, license plate number and the like, so that the text with other numbers or letters before and after "12315" is eliminated by using the regular expression. That is, if the customer speaks a string of numbers such as "0112315", "1231507", etc., the text will not hit the public opinion category "complaint to social regulatory agency".
It can be understood that the dialogue text between the user and the seat is usually long, but the recognition of the long text by the deep learning model is difficult, so that in the embodiment, a large number of texts related to public opinion are recalled through sensitive word recognition, the text is intercepted in a keyword positioning mode, and only the text positioned by the keyword is intercepted to perform subsequent emotion classification so as to optimize the training effect of the model.
According to the embodiment, all possible text expressions are enhanced and recalled through the text, and then the text obtained after enhancement is screened through the sensitive word recall, so that keywords are ensured to be positioned as much as possible and accurately for subsequent model training, and the training effect of the model is improved.
S13, carrying out label processing on the samples in the training samples to obtain label samples.
In at least one embodiment of the present invention, the performing label processing on the samples in the training samples to obtain label samples includes:
Identifying emotion words with emotion orientations in the training samples;
deleting samples without the emotion words from the training samples to obtain preliminary screening samples;
The proportion of positive samples and negative samples in the preliminary screening samples is adjusted to be a preset proportion through undersampling, wherein the positive samples represent samples with emotion words pointed by positive emotion, and the negative samples represent samples with emotion words pointed by negative emotion;
the positive sample and the negative sample are sent to a designated platform for marking;
And constructing the label sample by utilizing the positive sample with the label fed back by the appointed platform and the negative sample.
Specifically, a regular expression matching algorithm may be used to identify the emotion words with emotion orientations in the training samples, or may also identify the emotion words with emotion orientations in the training samples according to a preconfigured emotion word bank, rules, and the like, which is not limited by the present invention.
For example: "I want to go to court to tell your, we can mark emotion as" negative "; "I go to court to give me son a meal to bump the car" and mark emotion as forward.
In this embodiment, the preset ratio may be configured in a self-defined manner, for example: 1:1.
Specifically, when the proportion of the positive sample and the negative sample in the preliminary screening sample is adjusted to the preset proportion through undersampling, redundant samples with emotion words pointed by negative emotion can be deleted, so that the positive sample and the negative sample can be ensured to meet the preset proportion.
In this embodiment, the designated platform is used to label the sample. The positive and negative samples may be labeled on the designated platform by means of manual assistance, the invention is not limited.
In the embodiment, the samples are first screened, so that the workload caused by marking a large number of redundant samples is avoided, the working efficiency is improved, and the proportion of the positive and negative samples is further adjusted through undersampling, so that the model for subsequent training is more objective.
S14, carrying out vectorization processing on the label sample to obtain a vector set.
In at least one embodiment of the present invention, the vectorizing the label sample to obtain a vector set includes:
a word2vec algorithm is adopted to convert words composing each sample in the label sample into a vector with appointed dimension;
longitudinally splicing vectors corresponding to the words composing each sample to obtain a text feature matrix of each sample;
and combining the text feature matrix of each sample to obtain the vector set.
Of course, in other embodiments, the words that make up each sample in the label sample may be converted into vectors of a specified dimension in a random initialization manner.
Wherein the specified dimension may be 1x512 according to the requirements of the subsequent model training for data. Further, through longitudinal stitching, a text feature matrix of n x is obtained. n is a positive integer, which can be taken as 100, and is the maximum input length of the subsequent model.
Of course, in other embodiments, each of the label samples may also be encoded using a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model.
Through the embodiment, the label sample can be converted into the vector which can be read and processed by the model, so that the model can be trained conveniently later.
S15, training a preset classification network by using the vector set to obtain a user tag extraction model.
It can be understood that after recall of the sensitive word, although some mishits are removed by badcase repair, there are some situations where regular expressions cannot be removed, such as the customer expressing "i want to go to court to tell your, including the sensitive word" court ", should hit the" pursue legal liability "tag, while" i go to court to give me son a meal to hit the car ", including the sensitive word" court ", but should not hit the" pursue legal liability "tag.
In order to accurately identify even under the above conditions, the embodiment further trains the user tag extraction model in combination with deep learning.
Specifically, textCNN models may be employed as the preset classification network to train the user tag extraction model.
Further, the output probability of the positive and negative labels is obtained by adopting a softmax activation function, and the label with the highest probability is selected as the final label output.
S16, when receiving a text to be processed corresponding to a target user, converting the text to be processed into a vector to be processed, inputting the vector to be processed into the user tag extraction model, and generating a public opinion tag of the target user according to the output of the user tag extraction model.
It should be noted that, in the conventional manner, only a single deep learning model is adopted, so that the consumed computing resource is high, the manual labeling cost is high, and the accuracy of label extraction is still to be improved.
In this embodiment, the user tag extraction model is obtained by combining sensitive word recall and deep learning training, and meanwhile, the strong adaptability of sensitive word recognition in public opinion scenes and the high accuracy of a deep learning algorithm for text emotion basic tone judgment are fused, a large number of text related to recall public opinion is firstly recognized through the sensitive word, and then the emotion positive and negative of the recall text is judged through the deep learning, so that the accuracy of tag recognition can be effectively improved.
Moreover, experiments prove that the accuracy of the public opinion label mining is improved from 81.3% to 92.9% by the composite algorithm adopted in the embodiment, and the millisecond-level recognition timeliness is maintained while the accuracy is improved.
In at least one embodiment of the present invention, the generating the public opinion tag of the target user according to the output of the user tag extraction model includes:
when the output of the user tag extraction model is a tag representing negative emotion, sequencing each tag according to the sequence from big to small of the probability of each tag output, and acquiring the tag ranked in the previous preset position as the public opinion tag of the target user; or alternatively
And when the output of the user tag extraction model is a tag representing forward emotion, not generating the public opinion tag of the target user.
The preset bit may be configured in a user-defined manner, for example, 3.
In the above embodiment, the public opinion label is only established for the user with negative emotion, so as to attract importance and focus on the state of the user.
In at least one embodiment of the present invention, after generating the public opinion label of the target user according to the output of the user label extraction model, the method further includes:
Acquiring a current service scene, and when the current service scene is an experience feedback scene, positioning a target problem according to a public opinion label of the target user, and sending the target problem to a designated terminal device; or when the current service scene is a consultation scene, connecting a designated customer service, and sending prompt information to the designated customer service, wherein the prompt information is used for prompting the designated customer service to assist in soothing the emotion of the target user;
And adding the public opinion label of the target user to a specified label library.
The appointed terminal equipment can be equipment of related staff, such as maintenance personnel of the APP experiencing feedback, so that the problem can be solved in time, and the use experience of a user is improved.
The specified customer service may be an excellent customer service with higher performance and higher service evaluation to assist in risk management.
The appointed tag library stores a user list needing to be focused and public opinion tags corresponding to each user, so that all people can focus on the user list, and macroscopic control of sensitive users is facilitated.
It should be noted that, in order to further improve the security of the data and avoid the data from being tampered maliciously, the user tag extraction model may be stored in the blockchain node.
According to the technical scheme, the user tag extraction model can be obtained by combining the recall of the sensitive words and the deep learning training, meanwhile, the strong adaptability of the sensitive word recognition in the public opinion scene and the high accuracy of the deep learning algorithm to the text emotion mood judgment are fused, a large number of texts related to the recall public opinion are firstly recognized through the sensitive words, and then the emotion positive and negative of the recalled texts are judged through the deep learning, so that the accuracy of tag recognition can be effectively improved.
Fig. 2 is a functional block diagram of a preferred embodiment of the user tag extraction apparatus of the present invention. The user tag extraction apparatus 11 includes a cleaning unit 110, an enhancing unit 111, a recall unit 112, a tag unit 113, a vectorization unit 114, a training unit 115, and a generation unit 116. The module/unit referred to in the present invention refers to a series of computer program segments capable of being executed by the processor 13 and of performing a fixed function, which are stored in the memory 12. In the present embodiment, the functions of the respective modules/units will be described in detail in the following embodiments.
The cleaning unit 110 obtains an initial sample, and performs text cleaning on the initial sample to obtain a first sample set.
In at least one embodiment of the present invention, the initial sample may be obtained from a database of the relevant enterprise, and the present invention is not limited.
For example: the initial sample can be a chat record between a user and customer service on any platform of the relevant enterprise.
In at least one embodiment of the present invention, the text washing of the initial sample by the washing unit 110, to obtain a first sample set includes:
Configuring a plurality of windows of a specified size;
scanning on each text in the first sample set with the windows, respectively;
when the same text is scanned, determining the scanned same text as a group of repeated words, and reserving one word in each group of repeated words in the first sample set to obtain a first intermediate set;
Extracting digital texts and time texts in the first intermediate set by using a regular expression;
replacing the digital text with a first preset value, and replacing the time text with a second preset value to obtain a second middle set;
Word segmentation is carried out on the texts in the second intermediate set, so that a third intermediate set is obtained;
invoking a pre-configured stop word dictionary, and inquiring in the stop word dictionary by utilizing the text in the third middle set;
and deleting the queried words consistent with the words in the stop word dictionary from the third middle set to obtain the first sample set.
For example, when the window size is 4, there are 4 words that appear at a time in the window. When the text of the window appearing in the current position t and the t+4 position is the same, the window is considered to be a repeated word.
Specifically, 2 parameters are input by the program, one of which is text: text= "or one more that is the salesman who is given that business at the time, you also go me; the second is the maximum window width: max_ngram_length=4. If the preset maximum width of the window is 4, the window size traverses 4, 3, 2 and 1.
Text= "or one more is given to me at the time that business person, you also send me to you";
max_ngram_length=4;
result=trimmer(text,max_ngram_length);
When the program runs, the window width is 4, and a text slides from left to right, so that 'also one' is found to be a repeated word; when the window width is 2, it is found that "is a repeat word.
After the repeated word deletion is carried out, the output result is as follows: "or still another is to give me that salesman at the time, you also go me.
It will be appreciated that text features can be optimized by repeated word processing, since there are a large number of repeated words in the customer's interaction text with the agent, which then dilutes the text features for the NLU (Natural Language Understanding ).
Further, a great deal of description of the amount will be involved in the financial scenario, such as: three hundred seventy yuan, forty thousand yuan, etc., and time descriptions such as: this month thirty, friday, five month twenty-eight, tomorrow, etc., such descriptions are very diverse and are easily misinterpreted as features by the classification algorithm.
Therefore, the embodiment adopts a regular expression to generalize the description of the amount and the description of the time so as to reduce the interference on text classification caused by the diversity of the amount and the time.
For example: and when the first preset value is configured to be 300 and the second preset value is configured to be 14 pm, all the amounts can be uniformly generalized to 300 yuan, and all the times can be uniformly generalized to two pm points.
Further, in this embodiment, LAC (Lexical Analysis of Chinese, chinese word segmentation tool) may be used to perform word segmentation on the text in the second intermediate set, which is not limited by the present invention.
Further, the stop word dictionary may be configured in a customized manner according to actual requirements, for example, the stop word dictionary may include a part of a word of a Chinese, a part of a preposition, a talk technique (such as a programmed start and end word), and the like.
Through the text cleaning, the text which is not beneficial to text recognition is cleaned, so that the characteristics of the reserved text are more obvious and easy to recognize.
The enhancement unit 111 performs text enhancement on the samples in the first sample set to obtain a second sample set.
In at least one embodiment of the present invention, the enhancing unit 111 performs text enhancement on the samples in the first sample set, to obtain a second sample set includes:
Obtaining a pre-constructed fault-tolerant dictionary;
querying in the fault-tolerant dictionary by utilizing each sample in the first sample set, and determining the queried text matched with each sample as enhanced text of each sample;
Adding the enhanced text of each sample to the first sample set to obtain the second sample set.
In this embodiment, the fault tolerance dictionary refers to an ASR (Automatic Speech Recognition, automatic speech recognition technology) fault tolerance dictionary constructed based on a continuously accumulated massive corpus.
For example: during sudden epidemic situation, many clients of enterprises in the financial field can enter into line to require delayed repayment, and because the keyword of epidemic situation is too low in occurrence frequency in the previous life, the keyword cannot be accurately translated by ASR, and voice can be translated into irrelevant words such as happy, heterology, heteromorphism and the like. Through accumulation of massive corpus, an ASR fault tolerance dictionary is continuously improved, text which cannot be translated by an ASR can be enhanced by inquiring corresponding text in the ASR fault tolerance dictionary, and therefore ASR fault tolerance of keywords in most financial scenes can be dealt with.
Text enhancement synthesizes text of similar meaning in order to cope with the diversity of spoken language expressions. If the client incoming line requires to log off the card, the client cannot say the keyword of "log off" because of lack of professional knowledge of the credit card, the client may express "cancel", "log off", "cancel", etc., and through text enhancement, the text similar to the meaning of "log off" can be accurately identified to assist in understanding the client's wish.
Through text enhancement, generalization can be added on the basis of the original sample, so that similar expressions can be recalled, and the recall rate is improved.
Recall unit 112 performs sensitive word recall processing on samples in the second sample set to obtain training samples.
In at least one embodiment of the present invention, the recall unit 112 performs sensitive word recall processing on the samples in the second sample set, and obtaining training samples includes:
obtaining pre-configured public opinion categories and obtaining sensitive words of each public opinion category;
constructing a regular expression corresponding to each public opinion category according to the sensitive word of each public opinion category;
Identifying a public opinion category of each sample in the second sample set;
Acquiring regular expressions corresponding to the public opinion categories of each sample, traversing each sample by utilizing the regular expressions corresponding to the public opinion categories of each sample, and obtaining candidate sensitive words of each sample;
And repairing the candidate sensitive words of each sample to obtain the training sample.
For example, category 54 public opinion categories and related sensitive words of each category are pre-configured, and text that may be related to public opinion is recalled using regular expressions, namely the candidate sensitive words. In order to make the recall result as accurate as possible, the recall text is repaired by badcase. The public opinion category such as "complaint to social regulatory authorities" includes two channels of complaint to regulatory authorities and consumer protection associations, and thus keywords include "market regulatory" and related enhanced text, and "consumer protection association" and related enhanced text. Because "12315" is a hotline of the consumer protection association, "12315" is one of the keywords in this category, but through the analysis of badcase, it is found that the client also has a situation of mishitting "12315" in the scenes of reporting identification card number, insurance number, license plate number and the like, so that the text with other numbers or letters before and after "12315" is eliminated by using the regular expression. That is, if the customer speaks a string of numbers such as "0112315", "1231507", etc., the text will not hit the public opinion category "complaint to social regulatory agency".
It can be understood that the dialogue text between the user and the seat is usually long, but the recognition of the long text by the deep learning model is difficult, so that in the embodiment, a large number of texts related to public opinion are recalled through sensitive word recognition, the text is intercepted in a keyword positioning mode, and only the text positioned by the keyword is intercepted to perform subsequent emotion classification so as to optimize the training effect of the model.
According to the embodiment, all possible text expressions are enhanced and recalled through the text, and then the text obtained after enhancement is screened through the sensitive word recall, so that keywords are ensured to be positioned as much as possible and accurately for subsequent model training, and the training effect of the model is improved.
The label unit 113 performs label processing on the samples in the training samples to obtain label samples.
In at least one embodiment of the present invention, the labeling unit 113 performs label processing on samples in the training samples, where obtaining label samples includes:
Identifying emotion words with emotion orientations in the training samples;
deleting samples without the emotion words from the training samples to obtain preliminary screening samples;
The proportion of positive samples and negative samples in the preliminary screening samples is adjusted to be a preset proportion through undersampling, wherein the positive samples represent samples with emotion words pointed by positive emotion, and the negative samples represent samples with emotion words pointed by negative emotion;
the positive sample and the negative sample are sent to a designated platform for marking;
And constructing the label sample by utilizing the positive sample with the label fed back by the appointed platform and the negative sample.
Specifically, a regular expression matching algorithm may be used to identify the emotion words with emotion orientations in the training samples, or may also identify the emotion words with emotion orientations in the training samples according to a preconfigured emotion word bank, rules, and the like, which is not limited by the present invention.
For example: "I want to go to court to tell your, we can mark emotion as" negative "; "I go to court to give me son a meal to bump the car" and mark emotion as forward.
In this embodiment, the preset ratio may be configured in a self-defined manner, for example: 1:1.
Specifically, when the proportion of the positive sample and the negative sample in the preliminary screening sample is adjusted to the preset proportion through undersampling, redundant samples with emotion words pointed by negative emotion can be deleted, so that the positive sample and the negative sample can be ensured to meet the preset proportion.
In this embodiment, the designated platform is used to label the sample. The positive and negative samples may be labeled on the designated platform by means of manual assistance, the invention is not limited.
In the embodiment, the samples are first screened, so that the workload caused by marking a large number of redundant samples is avoided, the working efficiency is improved, and the proportion of the positive and negative samples is further adjusted through undersampling, so that the model for subsequent training is more objective.
Vectorization unit 114 performs vectorization processing on the label samples to obtain a vector set.
In at least one embodiment of the present invention, the vectorizing unit 114 performs vectorization processing on the tag samples, to obtain a vector set includes:
a word2vec algorithm is adopted to convert words composing each sample in the label sample into a vector with appointed dimension;
longitudinally splicing vectors corresponding to the words composing each sample to obtain a text feature matrix of each sample;
and combining the text feature matrix of each sample to obtain the vector set.
Of course, in other embodiments, the words that make up each sample in the label sample may be converted into vectors of a specified dimension in a random initialization manner.
Wherein the specified dimension may be 1x512 according to the requirements of the subsequent model training for data. Further, through longitudinal stitching, a text feature matrix of n x is obtained. n is a positive integer, which can be taken as 100, and is the maximum input length of the subsequent model.
Of course, in other embodiments, each of the label samples may also be encoded using a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model.
Through the embodiment, the label sample can be converted into the vector which can be read and processed by the model, so that the model can be trained conveniently later.
The training unit 115 uses the vector set to train a preset classification network to obtain a user tag extraction model.
It can be understood that after recall of the sensitive word, although some mishits are removed by badcase repair, there are some situations where regular expressions cannot be removed, such as the customer expressing "i want to go to court to tell your, including the sensitive word" court ", should hit the" pursue legal liability "tag, while" i go to court to give me son a meal to hit the car ", including the sensitive word" court ", but should not hit the" pursue legal liability "tag.
In order to accurately identify even under the above conditions, the embodiment further trains the user tag extraction model in combination with deep learning.
Specifically, textCNN models may be employed as the preset classification network to train the user tag extraction model.
Further, the output probability of the positive and negative labels is obtained by adopting a softmax activation function, and the label with the highest probability is selected as the final label output.
When receiving a text to be processed corresponding to a target user, the generating unit 116 converts the text to be processed into a vector to be processed, inputs the vector to be processed into the user tag extraction model, and generates a public opinion tag of the target user according to the output of the user tag extraction model.
It should be noted that, in the conventional manner, only a single deep learning model is adopted, so that the consumed computing resource is high, the manual labeling cost is high, and the accuracy of label extraction is still to be improved.
In this embodiment, the user tag extraction model is obtained by combining sensitive word recall and deep learning training, and meanwhile, the strong adaptability of sensitive word recognition in public opinion scenes and the high accuracy of a deep learning algorithm for text emotion basic tone judgment are fused, a large number of text related to recall public opinion is firstly recognized through the sensitive word, and then the emotion positive and negative of the recall text is judged through the deep learning, so that the accuracy of tag recognition can be effectively improved.
Moreover, experiments prove that the accuracy of the public opinion label mining is improved from 81.3% to 92.9% by the composite algorithm adopted in the embodiment, and the millisecond-level recognition timeliness is maintained while the accuracy is improved.
In at least one embodiment of the present invention, the generating unit 116 generates the public opinion label of the target user according to the output of the user label extraction model includes:
when the output of the user tag extraction model is a tag representing negative emotion, sequencing each tag according to the sequence from big to small of the probability of each tag output, and acquiring the tag ranked in the previous preset position as the public opinion tag of the target user; or alternatively
And when the output of the user tag extraction model is a tag representing forward emotion, not generating the public opinion tag of the target user.
The preset bit may be configured in a user-defined manner, for example, 3.
In the above embodiment, the public opinion label is only established for the user with negative emotion, so as to attract importance and focus on the state of the user.
In at least one embodiment of the present invention, after generating a public opinion label of the target user according to the output of the user label extraction model, a current service scene is obtained, and when the current service scene is an experience feedback scene, a target problem is located according to the public opinion label of the target user, and the target problem is sent to a designated terminal device; or when the current service scene is a consultation scene, connecting a designated customer service, and sending prompt information to the designated customer service, wherein the prompt information is used for prompting the designated customer service to assist in soothing the emotion of the target user;
And adding the public opinion label of the target user to a specified label library.
The appointed terminal equipment can be equipment of related staff, such as maintenance personnel of the APP experiencing feedback, so that the problem can be solved in time, and the use experience of a user is improved.
The specified customer service may be an excellent customer service with higher performance and higher service evaluation to assist in risk management.
The appointed tag library stores a user list needing to be focused and public opinion tags corresponding to each user, so that all people can focus on the user list, and macroscopic control of sensitive users is facilitated.
It should be noted that, in order to further improve the security of the data and avoid the data from being tampered maliciously, the user tag extraction model may be stored in the blockchain node.
According to the technical scheme, the user tag extraction model is obtained by combining the recall of the sensitive words and the deep learning training, meanwhile, the strong adaptability of the sensitive word recognition in the public opinion scene is fused, the high accuracy of the deep learning algorithm for judging the emotion basic tone of the text is achieved, a large number of texts related to the recall public opinion are firstly recognized through the sensitive words, and then the emotion positive and negative of the recalled texts are judged through the deep learning, so that the accuracy of tag recognition can be effectively improved.
Fig. 3 is a schematic structural diagram of a computer device according to a preferred embodiment of the present invention for implementing the user tag extraction method.
The computer device 1 may comprise a memory 12, a processor 13 and a bus, and may further comprise a computer program, such as a user tag extraction program, stored in the memory 12 and executable on the processor 13.
It will be appreciated by those skilled in the art that the schematic diagram is merely an example of the computer device 1 and does not constitute a limitation of the computer device 1, the computer device 1 may be a bus type structure, a star type structure, the computer device 1 may further comprise more or less other hardware or software than illustrated, or a different arrangement of components, for example, the computer device 1 may further comprise an input-output device, a network access device, etc.
It should be noted that the computer device 1 is only used as an example, and other electronic products that may be present in the present invention or may be present in the future are also included in the scope of the present invention by way of reference.
The memory 12 includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 12 may in some embodiments be an internal storage unit of the computer device 1, such as a removable hard disk of the computer device 1. The memory 12 may also be an external storage device of the computer device 1 in other embodiments, such as a plug-in mobile hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 1. Further, the memory 12 may also include both an internal storage unit and an external storage device of the computer device 1. The memory 12 may be used not only for storing application software installed in the computer device 1 and various types of data, such as codes of user tag extraction programs, etc., but also for temporarily storing data that has been output or is to be output.
The processor 13 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, various control chips, and the like. The processor 13 is a Control Unit (Control Unit) of the computer device 1, connects the respective components of the entire computer device 1 using various interfaces and lines, executes or executes programs or modules stored in the memory 12 (for example, executes a user tag extraction program or the like), and invokes data stored in the memory 12 to perform various functions of the computer device 1 and process data.
The processor 13 executes the operating system of the computer device 1 and various types of applications installed. The processor 13 executes the application program to implement the steps of the various user tag extraction method embodiments described above, such as the steps shown in fig. 1.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory 12 and executed by the processor 13 to complete the present invention. The one or more modules/units may be a series of computer readable instruction segments capable of performing the specified functions, which instruction segments describe the execution of the computer program in the computer device 1. For example, the computer program may be partitioned into a cleaning unit 110, an enhancement unit 111, a recall unit 112, a label unit 113, a vectorization unit 114, a training unit 115, a generation unit 116.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional module is stored in a storage medium, and includes instructions for causing a computer device (which may be a personal computer, a computer device, or a network device, etc.) or a processor (processor) to execute portions of the user tag extraction method according to the embodiments of the present invention.
The modules/units integrated in the computer device 1 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on this understanding, the present invention may also be implemented by a computer program for instructing a relevant hardware device to implement all or part of the procedures of the above-mentioned embodiment method, where the computer program may be stored in a computer readable storage medium and the computer program may be executed by a processor to implement the steps of each of the above-mentioned method embodiments.
Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory, or the like.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The bus may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one straight line is shown in fig. 3, but not only one bus or one type of bus. The bus is arranged to enable a connection communication between the memory 12 and at least one processor 13 or the like.
Although not shown, the computer device 1 may further comprise a power source (such as a battery) for powering the various components, preferably the power source may be logically connected to the at least one processor 13 via a power management means, whereby the functions of charge management, discharge management, and power consumption management are achieved by the power management means. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The computer device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described in detail herein.
Further, the computer device 1 may also comprise a network interface, optionally comprising a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the computer device 1 and other computer devices.
The computer device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the computer device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
Fig. 3 shows only a computer device 1 with components 12-13, it being understood by those skilled in the art that the structure shown in fig. 3 is not limiting of the computer device 1 and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.
In connection with fig. 1, the memory 12 in the computer device 1 stores a plurality of instructions to implement a user tag extraction method, the processor 13 being executable to implement:
acquiring an initial sample, and performing text cleaning on the initial sample to obtain a first sample set;
Text enhancement is carried out on samples in the first sample set to obtain a second sample set;
performing sensitive word recall processing on samples in the second sample set to obtain training samples;
Carrying out label processing on samples in the training samples to obtain label samples;
vectorizing the label sample to obtain a vector set;
training a preset classification network by using the vector set to obtain a user tag extraction model;
When receiving a text to be processed corresponding to a target user, converting the text to be processed into a vector to be processed, inputting the vector to be processed into the user tag extraction model, and generating a public opinion tag of the target user according to the output of the user tag extraction model.
Specifically, the specific implementation method of the above instructions by the processor 13 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.
In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. The units or means stated in the invention may also be implemented by one unit or means, either by software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (9)

1. The user tag extraction method is characterized by comprising the following steps of:
acquiring an initial sample, and performing text cleaning on the initial sample to obtain a first sample set;
Text enhancement is carried out on samples in the first sample set to obtain a second sample set;
performing sensitive word recall processing on samples in the second sample set to obtain training samples;
Carrying out label processing on samples in the training samples to obtain label samples;
vectorizing the label sample to obtain a vector set;
training a preset classification network by using the vector set to obtain a user tag extraction model;
When receiving a text to be processed corresponding to a target user, converting the text to be processed into a vector to be processed, inputting the vector to be processed into the user tag extraction model, and generating a public opinion tag of the target user according to the output of the user tag extraction model;
the text cleaning is carried out on the initial sample, and the first sample set is obtained comprises the following steps:
Configuring a plurality of windows of a specified size;
scanning on each text in the first sample set with the windows, respectively;
when the same text is scanned, determining the scanned same text as a group of repeated words, and reserving one word in each group of repeated words in the first sample set to obtain a first intermediate set;
Extracting digital texts and time texts in the first intermediate set by using a regular expression;
replacing the digital text with a first preset value, and replacing the time text with a second preset value to obtain a second middle set;
Word segmentation is carried out on the texts in the second intermediate set, so that a third intermediate set is obtained;
invoking a pre-configured stop word dictionary, and inquiring in the stop word dictionary by utilizing the text in the third middle set;
deleting the queried words consistent with the words in the stop word dictionary from the third middle set to obtain the first sample set;
Wherein, the dead word dictionary comprises partial word, partial preposition and talk.
2. The method of claim 1, wherein text enhancing the samples in the first sample set to obtain a second sample set comprises:
Obtaining a pre-constructed fault-tolerant dictionary;
querying in the fault-tolerant dictionary by utilizing each sample in the first sample set, and determining the queried text matched with each sample as enhanced text of each sample;
Adding the enhanced text of each sample to the first sample set to obtain the second sample set.
3. The method of claim 1, wherein the performing sensitive word recall processing on the samples in the second sample set to obtain training samples comprises:
obtaining pre-configured public opinion categories and obtaining sensitive words of each public opinion category;
constructing a regular expression corresponding to each public opinion category according to the sensitive word of each public opinion category;
Identifying a public opinion category of each sample in the second sample set;
Acquiring regular expressions corresponding to the public opinion categories of each sample, traversing each sample by utilizing the regular expressions corresponding to the public opinion categories of each sample, and obtaining candidate sensitive words of each sample;
And repairing the candidate sensitive words of each sample to obtain the training sample.
4. The method for extracting a user tag according to claim 1, wherein the performing tag processing on the samples in the training samples to obtain tag samples includes:
Identifying emotion words with emotion orientations in the training samples;
deleting samples without the emotion words from the training samples to obtain preliminary screening samples;
The proportion of positive samples and negative samples in the preliminary screening samples is adjusted to be a preset proportion through undersampling, wherein the positive samples represent samples with emotion words pointed by positive emotion, and the negative samples represent samples with emotion words pointed by negative emotion;
the positive sample and the negative sample are sent to a designated platform for marking;
And constructing the label sample by utilizing the positive sample with the label fed back by the appointed platform and the negative sample.
5. The method of claim 1, wherein the vectorizing the tag samples to obtain a vector set comprises:
a word2vec algorithm is adopted to convert words composing each sample in the label sample into a vector with appointed dimension;
longitudinally splicing vectors corresponding to the words composing each sample to obtain a text feature matrix of each sample;
and combining the text feature matrix of each sample to obtain the vector set.
6. The user tag extraction method of claim 1, wherein after generating the public opinion tag of the target user according to the output of the user tag extraction model, the method further comprises:
Acquiring a current service scene, and when the current service scene is an experience feedback scene, positioning a target problem according to a public opinion label of the target user, and sending the target problem to a designated terminal device; or when the current service scene is a consultation scene, connecting a designated customer service, and sending prompt information to the designated customer service, wherein the prompt information is used for prompting the designated customer service to assist in soothing the emotion of the target user;
And adding the public opinion label of the target user to a specified label library.
7. A user tag extraction apparatus, characterized in that the user tag extraction apparatus comprises:
The cleaning unit is used for acquiring an initial sample, and performing text cleaning on the initial sample to obtain a first sample set;
The enhancement unit is used for carrying out text enhancement on the samples in the first sample set to obtain a second sample set;
the recall unit is used for carrying out sensitive word recall processing on the samples in the second sample set to obtain training samples;
the label unit is used for carrying out label processing on the samples in the training samples to obtain label samples;
The vectorization unit is used for vectorizing the label samples to obtain a vector set;
the training unit is used for training a preset classification network by utilizing the vector set to obtain a user tag extraction model;
the generating unit is used for converting the text to be processed into a vector to be processed when receiving the text to be processed corresponding to the target user, inputting the vector to be processed into the user tag extraction model, and generating a public opinion tag of the target user according to the output of the user tag extraction model;
the text cleaning is carried out on the initial sample, and the first sample set is obtained comprises the following steps:
Configuring a plurality of windows of a specified size;
scanning on each text in the first sample set with the windows, respectively;
when the same text is scanned, determining the scanned same text as a group of repeated words, and reserving one word in each group of repeated words in the first sample set to obtain a first intermediate set;
Extracting digital texts and time texts in the first intermediate set by using a regular expression;
replacing the digital text with a first preset value, and replacing the time text with a second preset value to obtain a second middle set;
Word segmentation is carried out on the texts in the second intermediate set, so that a third intermediate set is obtained;
invoking a pre-configured stop word dictionary, and inquiring in the stop word dictionary by utilizing the text in the third middle set;
deleting the queried words consistent with the words in the stop word dictionary from the third middle set to obtain the first sample set;
Wherein, the dead word dictionary comprises partial word, partial preposition and talk.
8. A computer device, the computer device comprising:
a memory storing at least one instruction; and a processor executing instructions stored in the memory to implement the user tag extraction method of any one of claims 1 to 6.
9. A computer-readable storage medium, characterized by: the computer-readable storage medium having stored therein at least one instruction for execution by a processor in a computer device to implement the user tag extraction method of any one of claims 1 to 6.
CN202110851246.XA 2021-07-27 2021-07-27 User tag extraction method, device, equipment and medium Active CN113553431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110851246.XA CN113553431B (en) 2021-07-27 2021-07-27 User tag extraction method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110851246.XA CN113553431B (en) 2021-07-27 2021-07-27 User tag extraction method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN113553431A CN113553431A (en) 2021-10-26
CN113553431B true CN113553431B (en) 2024-05-10

Family

ID=78104567

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110851246.XA Active CN113553431B (en) 2021-07-27 2021-07-27 User tag extraction method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN113553431B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114239591B (en) * 2021-12-01 2023-08-18 马上消费金融股份有限公司 Sensitive word recognition method and device
CN114943228B (en) * 2022-06-06 2023-11-24 北京百度网讯科技有限公司 Training method of end-to-end sensitive text recall model and sensitive text recall method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444326A (en) * 2020-03-30 2020-07-24 腾讯科技(深圳)有限公司 Text data processing method, device, equipment and storage medium
CN112100332A (en) * 2020-09-14 2020-12-18 腾讯科技(深圳)有限公司 Word embedding expression learning method and device and text recall method and device
CN112686022A (en) * 2020-12-30 2021-04-20 平安普惠企业管理有限公司 Method and device for detecting illegal corpus, computer equipment and storage medium
CN112699232A (en) * 2019-10-17 2021-04-23 北京京东尚科信息技术有限公司 Text label extraction method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220043135A (en) * 2019-07-12 2022-04-05 삼성전자주식회사 Method and apparatus for generating structured relationship information based on text input

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699232A (en) * 2019-10-17 2021-04-23 北京京东尚科信息技术有限公司 Text label extraction method, device, equipment and storage medium
CN111444326A (en) * 2020-03-30 2020-07-24 腾讯科技(深圳)有限公司 Text data processing method, device, equipment and storage medium
CN112100332A (en) * 2020-09-14 2020-12-18 腾讯科技(深圳)有限公司 Word embedding expression learning method and device and text recall method and device
CN112686022A (en) * 2020-12-30 2021-04-20 平安普惠企业管理有限公司 Method and device for detecting illegal corpus, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113553431A (en) 2021-10-26

Similar Documents

Publication Publication Date Title
Wu et al. Ai-generated content (aigc): A survey
US20200226212A1 (en) Adversarial Training Data Augmentation Data for Text Classifiers
US20210150398A1 (en) Conversational interchange optimization
CN108304468A (en) A kind of file classification method and document sorting apparatus
CN113553431B (en) User tag extraction method, device, equipment and medium
US20220358292A1 (en) Method and apparatus for recognizing entity, electronic device and storage medium
US10831990B1 (en) Debiasing textual data while preserving information
US10339534B2 (en) Segregation of chat sessions based on user query
US11074043B2 (en) Automated script review utilizing crowdsourced inputs
CN113360654B (en) Text classification method, apparatus, electronic device and readable storage medium
US20230092274A1 (en) Training example generation to create new intents for chatbots
CN110377733A (en) A kind of text based Emotion identification method, terminal device and medium
US20230237276A1 (en) System and Method for Incremental Estimation of Interlocutor Intents and Goals in Turn-Based Electronic Conversational Flow
US20190164061A1 (en) Analyzing product feature requirements using machine-based learning and information retrieval
CN113204698B (en) News subject term generation method, device, equipment and medium
CN114420168A (en) Emotion recognition method, device, equipment and storage medium
CN113379432A (en) Sales system customer matching method based on machine learning
Singh et al. Knowing what and how: a multi-modal aspect-based framework for complaint detection
CN116450797A (en) Emotion classification method, device, equipment and medium based on multi-modal dialogue
CN114548114B (en) Text emotion recognition method, device, equipment and storage medium
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
CN115718807A (en) Personnel relationship analysis method, device, equipment and storage medium
CN115510219A (en) Method and device for recommending dialogs, electronic equipment and storage medium
US20220343073A1 (en) Quantitative comment summarization
CN114492446A (en) Legal document processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant