CN116805147B - Text labeling method and device applied to urban brain natural language processing - Google Patents

Text labeling method and device applied to urban brain natural language processing Download PDF

Info

Publication number
CN116805147B
CN116805147B CN202310204225.8A CN202310204225A CN116805147B CN 116805147 B CN116805147 B CN 116805147B CN 202310204225 A CN202310204225 A CN 202310204225A CN 116805147 B CN116805147 B CN 116805147B
Authority
CN
China
Prior art keywords
emotion
word
clause
text
natural language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310204225.8A
Other languages
Chinese (zh)
Other versions
CN116805147A (en
Inventor
申永生
陈冲杰
叶晓华
凌从礼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou City Brain Co ltd
Original Assignee
Hangzhou City Brain Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou City Brain Co ltd filed Critical Hangzhou City Brain Co ltd
Priority to CN202310204225.8A priority Critical patent/CN116805147B/en
Publication of CN116805147A publication Critical patent/CN116805147A/en
Application granted granted Critical
Publication of CN116805147B publication Critical patent/CN116805147B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a text labeling method and a text labeling device applied to urban brain natural language processing. And carrying out emotion marking based on the emotion keyword set and further judging the emotion intensity degree of the natural language text. For a text with strong emotion, each clause is intersected with a union set of a business keyword set and the emotion keyword set in business labeling, and intersecting elements containing the emotion keywords and the business keywords are obtained to form a business keyword sequence. And inputting the service keyword sequence into a trained model for classification, accurately marking texts with different emotion degrees and the same or similar semantics based on whether emotion keywords exist in the service keyword sequence, and taking the classification with the highest confidence level output by the model as the service label of the natural language text.

Description

Text labeling method and device applied to urban brain natural language processing
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a text labeling method and device applied to urban brain natural language processing and electronic equipment.
Background
The urban brain is a product of combining an Internet brain architecture and a smart city construction, is a urban-level brain-like complex intelligent giant system, and under the joint participation of human intelligence and machine intelligence, under the support of leading edge technologies such as Internet of things, big data, artificial intelligence, edge calculation, 5G, cloud robots, digital twinning and the like, the urban neuron network and the urban cloud reflection arc are important points of the urban brain construction, and the urban brain has the effects of improving the running efficiency of the city, solving the complex problem faced in the running of the city and better meeting the different requirements of each member of the city.
The urban brain is an intelligent system based on information generated by urban operation as input, and the urban operation not only can generate mass data but also has non-uniform data format, so how to acquire effective information from disordered information has become a research hotspot in the industry. The text classification task is one of the most basic tasks in the field of Natural Language Processing (NLP), can effectively screen information, and has important application in the aspects of information retrieval, automatic abstract and the like. Current text-based classification focuses mainly on classification of text traffic types, involving very few emotion classifications and being independent between emotion classification and traffic classification.
With the continuous popularization of the internet of things, public opinion information related to people in daily life is gradually collected to related departments in a data form. For such information, which includes the emotion and urgency of the feedback person for the relevant business as well as the feedback of the business, there is a need to analyze the business category and emotion category of such text to better instruct the relevant departments to solve the relevant problems quickly and orderly. However, since such public opinion information usually contains nonsensical redundant words, if the original text is directly used as corpus to perform model training to perform service classification, the influence of the redundant words on classification accuracy is ignored, which results in the problem of poor classification accuracy or incapability of classification. In addition, a large number of nonsensical redundant words also bring great difficulty to emotion classification, and separation of emotion marking and business marking also makes it difficult for an information receiver to identify the importance degree of massive information. Therefore, the current public opinion information related to people daily is marked mainly in a manual mode, and a large amount of human resources are consumed.
Disclosure of Invention
The invention provides a text labeling method, a text labeling device and electronic equipment applied to urban brain natural language processing, and aims to overcome the defects of the prior art.
In order to achieve the above object, the present invention provides a text labeling method applied to urban brain natural language processing, comprising:
preprocessing the obtained natural language text, including clause segmentation and word segmentation processing of each clause;
based on a preset word part set, traversing word segmentation results of each clause, and screening the word parts of single words in each word segmentation result to respectively generate a business word set and an emotion word set;
extracting a business keyword set and an emotion keyword set of a text from the business word set and the emotion word set respectively;
matching each clause with the emotion keyword set to obtain a clause emotion word sequence corresponding to each clause, and matching each emotion word in the clause emotion word sequence with the emotion dictionary to obtain an emotion value corresponding to each emotion word; obtaining emotion total scores of the natural language text based on emotion values of the emotion words;
matching the emotion total score of the natural language text with a preset emotion threshold value to mark an emotion label and judging whether the natural language text is an emotion strong text or not based on a strong emotion threshold value;
if judging that the current natural language text is an emotion strong text, considering that the emotion words influence service label marking, intersecting each clause with a union set of a service keyword set and the emotion keyword set to obtain intersecting elements containing the emotion keywords and the service keywords so as to form a service keyword sequence; if judging that the current natural language text is a non-emotion strong text, considering that emotion words do not affect service label marking, and matching each clause with a service keyword set in an intersecting manner to obtain intersecting elements containing service keywords so as to form a service keyword sequence;
inputting the service keyword sequences into a trained FastText model for classification and marking texts with the same or similar service keywords but different emotion degrees based on whether emotion keywords exist in the service keyword sequences or not; and meanwhile, the classification with highest confidence level output by the Fasttext model is used as the service classification of the natural language text to obtain the service label.
According to one embodiment of the invention, in calculating emotion values for each emotion word within a sequence of clause emotion words:
judging the part of speech of each emotion word in the clause emotion word sequence to determine whether the current clause contains emotion degree words, wherein the emotion degree words comprise auxiliary words, dynamic adverbs and adverbs;
if judging that the current clause only comprises one or more single adjectives and has no emotion degree word, calculating emotion values according to a preset first calculation rule only related to the single adjectives; if judging that the current clause contains the emotion degree word, calculating an emotion value by combining emotion degree word weights on the basis of one or more single adjectives according to a second calculation rule.
According to one embodiment of the invention, when judging that the current clause contains emotion degree words, acquiring the simplex appearance words which are closest to each emotion degree word and appear at the rear side of the emotion degree word based on the word space distance, and updating the emotion values of the simplex appearance words which are closest to and positioned at the rear side of the emotion degree word according to the weight of the emotion degree word.
According to the embodiment of the invention, one unigram appearance word in the clause is taken as a node, the adjacent nodes adopt a sliding window M to divide the window of the clause, the emotion degree word is matched to the unigram appearance word which is closest to the sliding window M and appears at the rear side of the emotion degree word by taking the sliding window M as a measurement unit, and the emotion value of the unigram appearance word at the rear side of the emotion degree word is updated according to the weight of the emotion degree word.
According to one embodiment of the invention, when the emotion value of each emotion word in the clause emotion word sequence is calculated, whether the emotion words in the clause emotion word sequence contain conjunctions or not is judged; if yes, fusing the interlinking weight on the basis of the first calculation rule or the second calculation rule.
According to an embodiment of the present invention, the natural language text obtained by preprocessing includes:
dividing the obtained natural language text into a plurality of clauses and constructing a clause set T= { S 1 ,S 2 ,…,S n };
Each clause S within a set of clauses i Performing word segmentation to obtain a plurality of word segmentation results S i ={W 1 ={w 1 ,p 1 },W 2 ,…,W n Each word segmentation result comprises single word w after word segmentation i And part of speech p of the vocabulary i
Filtering each clause S by a preset stop word set ST i Nonsensical stop words.
According to one embodiment of the invention, when the emotion word set is obtained, the corresponding emotion keyword set is extracted by the following steps:
based on the co-occurrence relation among emotion words, the emotion words w i Constructing a candidate emotion keyword undirected weighted graph for the nodes and based on similar words appearing in the sliding window H;
iteratively propagating the weights of all nodes until convergence according to the following formula to obtain a candidate emotion keyword weight value set TRE:
wherein TRE (w i ) For the word w i Weights of (2); d represents a damping coefficient and is set to 0.85; in (w) i ) Representing the direction w i A collection of nodes; out (w) i ) Represents w i The set of nodes pointed to; WE (Power of industry) ji Representative node w j To node w i Is a connection weight of (2); WE (Power of industry) jk Representative node w j To node w k Is a connection weight of (2); TRE (w) j ) For the word w j Weights of (2);
sorting the obtained weight value set TRE of the candidate emotion keywords in a descending order according to the weight value to obtain an emotion keyword set KWE;
and extracting the emotion keyword set KEB from the service word set by adopting the same steps.
On the other hand, the invention also provides a text labeling device applied to urban brain natural language processing, which comprises a preprocessing unit, a word set screening unit, a keyword extraction unit, an emotion value calculation unit, an emotion labeling unit, a business parameter extraction unit and a model output unit. The preprocessing unit preprocesses the obtained natural language text, including clause segmentation and word segmentation processing of each clause. The word set screening unit is used for screening the parts of speech of a single word in each word segmentation result based on the preset part of speech set to traverse the word segmentation result of each clause so as to respectively generate a business word set and an emotion word set. And the keyword extraction unit is used for extracting the business keyword set and the emotion keyword set of the text from the business word set and the emotion word set respectively. The emotion value calculation unit matches each clause with the emotion keyword set to obtain a clause emotion word sequence corresponding to each clause, and matches each emotion word in the clause emotion word sequence with the emotion dictionary to obtain an emotion value corresponding to each emotion word; and obtaining the emotion total score of the natural language text based on the emotion values of the plurality of emotion words. And the emotion marking unit is used for matching the emotion total score of the natural language text with a preset emotion threshold value so as to mark an emotion label and judging whether the natural language text is an emotion strong text or not based on the strong emotion threshold value. If the emotion marking unit judges that the current natural language text is an emotion strong text, the emotion word is considered to influence the service label marking; the business parameter extraction unit intersects each clause with the union set of the business keyword set and the emotion keyword set to obtain intersecting elements containing emotion keywords and business keywords so as to form a business keyword sequence; if judging that the current natural language text is a non-emotion strong text, considering that emotion words do not affect service label marking, and the service parameter extraction unit is used for intersecting and matching each clause with a service keyword set to obtain intersecting elements containing the service keywords so as to form a service keyword sequence. The model output unit inputs the service keyword sequences into a trained Fasttext model to classify and marks texts with the same or similar service keywords but different emotion degrees based on whether emotion keywords exist in the service keyword sequences or not; and meanwhile, the classification with highest confidence level output by the Fasttext model is used as the service classification of the natural language text to obtain the service label.
In another aspect, the present invention also provides an electronic device including one or more processors and a storage device. The storage device is used for storing one or more programs. The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the text labeling method described above as being applied to urban brain natural language processing.
In summary, the text labeling method applied to urban brain natural language processing provided by the invention carries out service labeling and emotion labeling on each natural language text respectively so as to realize multidimensional display of text information. Further, the service label is marked by taking the service keyword as a basis; for the events with the same event but different emotion degrees, such as suggestion and complaint of a certain non-civilized event, the problem of labeling the same label can occur only by adopting the business keywords, so that the invention reserves the emotion keywords of the emotion strong text and is used as the business keywords to assist in business classification, thereby being beneficial to improving the classification accuracy, facilitating rapid screening of serious event and improving the event processing capability. Furthermore, an emotion value calculation mode based on emotion degree words and conjunctions is provided when the emotion value is calculated, and the emotion value is considered from the multi-dimensionality of the parts of speech in the calculation mode so as to accurately extract emotion labels of the complex folk text. In addition, each sentence is screened and matched by constructing a business keyword set and an emotion keyword set, redundant words are removed to respectively form clause emotion word sequences and business keyword sequences, and therefore the influence of multiple redundant words on classification labels is effectively solved, and classification accuracy is improved; meanwhile, the waste of calculation resources caused by huge corpus is avoided.
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments, as illustrated in the accompanying drawings.
Drawings
Fig. 1 is a flowchart of a text labeling method applied to urban brain natural language processing according to an embodiment of the present invention.
Fig. 2 shows a specific step of calculating the emotion value of the emotion word in step S40 in fig. 1.
Fig. 3 is a schematic structural diagram of a text labeling device applied to urban brain natural language processing according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The text labeling method applied to urban brain natural language processing can be used in computer equipment. In one possible implementation, the computer device may be a terminal, which may be a mobile phone, a computer, a tablet computer, or other types of terminals. In another possible implementation, the computer device may include a server and a terminal.
Fig. 1 shows a text labeling method applied to urban brain natural language processing, which includes:
step S10, preprocessing the obtained natural language text, wherein the preprocessing comprises clause segmentation and word segmentation processing of each clause. In this embodiment, description will be given taking, as an example, a natural language text, which includes a plurality of redundant words and has various emotion tendencies. However, the present invention is not limited in any way thereto. In other embodiments, the text labeling method applied to urban brain natural language processing provided by the invention is also applicable to classification labeling of natural language texts such as internet vocabulary entries, medical information vocabulary entries and the like.
For the preprocessing in step S10, the specific flow includes:
step 101: each natural language text is segmented to obtain clause S n Construct clause set t= { S 1 ,S 2 ,…,S n }。
Step 102: each clause in the set is segmented and part-of-speech labeled by using a JieBa algorithm, and then S= { W 1 ={w 1 ,p 1 },W 2 ,…,W n }. Wherein: w (W) i Representing the word segmentation result, w represents a single word after word segmentation, and p represents the part of speech of the word.
Step 102: filtering clauses S by a preset stop word set ST i Nonsensical stop words; i.e. arbitrary vocabulary w.epsilon.ST andif the clause set T is empty after screening, the text is considered to be meaningless, and the labeling processing is not performed any more.
Step S20 is executed after the word segmentation result of each clause is obtained through preprocessing: based on a preset word part set, traversing word segmentation results of each clause, and screening the word parts of single words in each word segmentation result to respectively generate a business word set and an emotion word set; namely, a business word set TB and an emotion word set TE are respectively generated for the same sub-sentence set T. Specifically, each word segmentation result W in each clause in T is traversed i W is paired according to a preset service part-of-speech set PB i Part of speech p in (2) i Screening is performed, requiring p i E PB; namely W i ={w i ,p i In the case of p i E PB, will w i Put into business vocabulary TB, ifThen consider w i Neglecting for noise. Traversing is completed to obtain a business word set TB= { wb 1 ,wb 2 ,…,wb n }. Similarly, each word segmentation result W is traversed i Screening the emotion part-of-speech set PE through the emotion part-of-speech set PE, and requiring p i E PE to obtain an emotion word set TE= { we 1 ,we 2 ,…,we n }. In this embodiment, pb= { n, s, v, ns, vn, nt }; where n represents a common noun, s represents a place noun, v represents a common verb, ns represents a place name, vn represents a place name, and nt represents an organization name.
Pe= { a, u, d, vd, c }; wherein a represents adjectives, u represents auxiliary words, d represents adverbs, vd represents dynamic adverbs, and c represents conjunctions.
After obtaining the service vocabulary TB and the emotion vocabulary TE, step S30 is performed: and extracting keywords aiming at the two word sets respectively to form a business keyword set KWB and an emotion keyword set KWE. The embodiment provides a keyword extraction method based on a TextRank algorithm, which is specifically as follows:
step S301, according to co-occurrence relation of co-current emotion words, emotion words we are used i And constructing a candidate emotion keyword undirected weighted graph GE= (TE, E) for the nodes and based on similar words appearing in the sliding window H, wherein E represents a non-empty finite set of each edge between emotion word sets TE. Iteratively propagating weights of all nodes until convergence to obtain a candidate emotion keyword weight value set TRE, wherein the calculation formula is as follows:
wherein TRE (we i ) Is node we i Weights of (2); d represents a damping coefficient and is set to 0.85; in (we) i ) Representative pointing we i A node set; out (we) i ) Representative we i The set of nodes pointed to; WEE (web-defined element) ji Representative node we i To node we j Is a connection weight of (2); WEE (web-defined element) jk Representative node we j To node we k Is a connection weight of (2); TRE (we) j ) Is the word we j Is a weight value of (a).
Similarly, according to co-occurrence relation among the emotion business words, business words wb are used i And constructing a candidate business keyword undirected weighted graph GB= (TB, B) for the nodes and based on similar words appearing in the sliding window H, wherein B represents a non-empty finite set of each edge between the business word sets TB. Iteratively propagating weights of all nodes until convergence to obtain a candidate business keyword weight value set TRB, wherein the calculation formula is as follows:
wherein TRB (wb) i ) For node wb i Weights of (2); d represents a damping coefficient and is set to 0.85; in (wb) i ) Representative pointing direction wb i A node set; out (wb) i ) Represents wb i The set of nodes pointed to; web (WEB) ji Representative node wb i To node wb j Is a connection weight of (2); web (WEB) jk Representative node wb j To node wb k Is a connection weight of (2); TRB (wb) j ) For the word wb j Is a weight value of (a).
Step 302: and (3) sorting the candidate emotion keyword weight value set TRE and the candidate business keyword weight value set TRB obtained in the step 301 in a descending order according to weight values (namely, textRank values), and respectively taking the first 50 TextRank values as final keyword sets to obtain an emotion keyword set KWE and a business keyword set KWB. If the number of the keywords is less than 50, taking all the candidate words in the set as the keywords.
After obtaining the emotion keyword set KWE, executing step S40, matching each clause with the emotion keyword set to obtain a clause emotion word sequence corresponding to each clause, and matching each emotion word in the clause emotion word sequence with an emotion dictionary to obtain an emotion value corresponding to each emotion word; and obtaining emotion scores of the natural language text based on emotion values of the plurality of emotion words. In this embodiment, after the emotion value of each emotion word is obtained, emotion values of a plurality of emotion words together form emotion scores of the clauses, and emotion scores of the plurality of clauses are summed to obtain emotion scores of the natural language text. However, the present invention is not limited in any way thereto.
Specifically, step S40 includes:
step S401, matching the clause set T in the step S10 with the emotion keyword set KWE, reserving emotion words of each clause, and using we for emotion words of the ith clause i Representation, i.e. we i E, T.U. KWE, obtaining the i-th clause emotion word sequence ET i ,ET i Matching with emotion dictionary, if hit corresponding emotion word, assigning emotion value to emotion word, and calculating emotion value Score of ith clause according to clause emotion calculation rule i
For step S401, the embodiment further provides a method for further improving accuracy of emotion value calculation based on emotion word judgment, which is specifically as follows:
step S4011, judge the part of speech of emotion word in the emotion word sequence of clause in order to confirm whether the current clause includes emotion degree word, emotion degree word includes the auxiliary word, moves adverb and adverb.
Step S4012, if it is determined that the current clause includes only one or more single adjectives and no emotion level word, calculating emotion values according to a preset first calculation rule related to only single adjectives, where the first calculation rule is as follows:
wherein s is i The emotion value of the ith single-shape word is used as the emotion value, and N is the number of single-shape words in the clause; score i Is the emotion value of the ith clause.
Step S4013, if step S4011 determines that the current clause includes an emotion degree word, calculating an emotion value according to a second calculation rule based on one or more single adjectives in combination with emotion degree word weights. Specifically, a unimorpheme word in a clause is taken as a node, and a sliding window M is adopted between adjacent nodes to divide the clause into windows. Matching the emotion degree word to a simplex word which is closest to the emotion degree word and appears at the rear side of the emotion degree word by taking the sliding window M as a measurement unit; and updating the emotion value of the unimorpheme word through the weight of the emotion degree word. For example, if a certain emotion degree word is matched to a single-shape word at the front side by taking the sliding window M as a measurement unit, the emotion degree word is considered to not influence the emotion value of the single-shape word at the front side, the emotion degree word is ignored in emotion value calculation, and a clause emotion value is calculated by using a first calculation rule. And window sliding, if the single adjective appears at the rear side of the emotion degree word at the moment, considering that the emotion degree word influences the emotion value of the single adjective positioned at the rear side of the single adjective, and calculating the emotion value of the single adjective positioned at the rear side of the emotion degree word by using a second calculation rule, wherein the second calculation rule is as follows:
wherein s is i Representing the emotion value of the monoadjective; n represents the adjective number; weight (Weight) ij Representing emotion words s i A weight for each emotion level word, the weight being obtainable by matching an emotion level dictionary; k is emotion word s i The corresponding number of emotion level words, the sliding window size is set to m=2. However, the present invention does not impose any limitation on the second calculation rule. In other embodiments, the emotion calculation rate may also be improved by summarizing the number of emotion degree values included in the clauses and directly assigning corresponding weights according to different numbers.
In addition, the distance between the emotion degree word and the emotion word can be calculated and judged through the word space distance. For example, when the judgment indicates that the current clause contains emotion degree words, obtaining the simplex appearance words which are closest to each emotion degree word and appear at the rear side of the emotion degree words based on the word space distance, and updating the emotion values of the simplex appearance words closest to the emotion degree words according to the weights of the emotion degree words.
Further, step S401 further includes step S4014: judging whether emotion words in the clause emotion word sequence contain conjunctions or not; if yes, step S4015 is executed to fuse the conjunctive weights based on the first calculation rule or the second calculation rule, specifically:
Score i ′=Score i ×(1+Weight ij )
wherein Weight is ij Representing the weight of the conjunctive word, score i Score for emotion value of ith clause calculated by first calculation rule or second calculation rule i ' is the emotion value of the ith clause updated based on the conjunctive weight. In this embodiment, the weight of the conjunctions is given different weights according to the properties of the conjunctions, specifically:
(one) turning relation
a) If the ligature emphasizes the front sentence, weight ij =0.5
b) If the conjunctions emphasize the post sentence, weight ij =1.5
(II) progressive relationship
a) The emotion degree of clauses increases progressively before and after the progressive relationship, then Weight ij =1.5
(III) parallel relationship and causal relationship
a) Parallel relationship and causal relationship, and the emotion relationship of the front clause and the back clause are the same, weight ij =1.0
Step S402, each clause emotion Score i And carrying out weighted average summation to obtain the total emotion score TotalScore of the natural language text.
After that, step S50 is performed: and matching the total emotion score of the natural language text with a preset emotion threshold value to label the emotion label. In this embodiment, the emotion threshold value TH1 is 1 and-1, respectively, and the labeling rules are as follows. However, the present invention is not limited in any way thereto.
Emotion label
Further, step S50 will also determine whether the natural language text is emotion intense text based on the strong emotion threshold value. Specifically, taking the absolute value of the total text emotion score, obtaining the absolute value AbsTotalScare of the total text score, comparing AbsTotalScare with a strong emotion threshold value TH, and if AbsTotalScare > TH2, considering the text as an emotion strong text, wherein TH2 is larger than the absolute value of the emotion threshold value TH1, namely TH2 > AbsTH1.
If step S50 judges that the current natural language text is emotion strong text, step S601 is executed to set each clause S in clause set T i Intersecting the union of the business keyword set KWB and the emotion keyword set KWE to obtain intersecting elements containing emotion keywords and business keywords to form a business keyword sequence, i.e. w i E, T U (KWB U KWE), and further realizing fusion of emotion words in service labeling, and viewing the emotion words as service keywords. If the current natural language text is judged to be a non-emotion strong text, executing step S602 to match each clause with a business keyword set KWB to obtain intersecting elements containing business keywordsTo form a sequence of business keywords, i.e. w i ∈T∩KWB。
Step S70, inputting the service keyword sequence obtained in the step S601 or the step S602 into a trained Fasttext model to classify, respectively labeling texts with the same or similar service keywords but different emotion degrees based on whether emotion keywords exist in the service keyword sequence, and outputting classification with highest confidence as service classification of the natural language text to obtain a service tag; if the highest confidence does not meet the minimum requirement, then the text is deemed not to belong to any classification.
For example, when the input natural language text is "beautiful: in the section (from mountain front street to Wu Qiaolu), a person without quality breaks a sharing bicycle at 18 meters on the east-south side of the Wenyan music, which is very angry, hopefully, the relevant departments are willing to penalize the person, and the person is really big in fire (1 person, bad brake handle, green orange and no door sign) ", and the text labeling method applied to urban brain natural language processing by the worker of the embodiment can be as follows:
service label: urban and rural construction complaints/urban construction and municipal administration
Matching degree: 99.89%
Emotion label: is strongly and negatively
Emotion threshold: -3.427
In this example, the strong emotion threshold value is set to th2=3, and when the input natural language text is a strong text containing emotion keywords, the model outputs corresponding business labels based on the emotion keywords to represent the situation degree, such as "urban and rural construction complaints" in this case. In other embodiments, when the service keyword sequence does not include the emotion keyword, according to the preset service tag, the service tag marked by the text does not include the emotion word when the service tag is marked, or includes the service tag emotion word with low emotion degree, such as "urban and rural construction suggestion" or "urban and rural construction feedback" and the like. When the information processor receives the natural language text with the same or similar business keywords, the negative public opinion information with strong emotion can be rapidly and accurately screened based on the business tag emotion words related to emotion on the business tag, so that the problems reflected by the natural language text can be rapidly processed in time, and the information processing speed is improved.
In this embodiment, the business classification of natural language text is based on the FastText model, for which training will be performed using the following steps:
step S100, service marking is carried out on natural language texts representing the moods in a manual marking mode, and marked texts are subjected to 8:2 into training sets
Tr={{T r1 ,Label},{T r2 ,Label},…,{T rn Label }, and test set
Te={T e1 ,Label},{T e2 ,Label},…,{T en ,Label}}。
Step S200, preprocessing step S10 is performed on each sample in the training set Tr, and each training sample T is processed ri Sentence segmentation is carried out, and word segmentation is carried out on each clause.
Step S300, for each training sample T ri Step S20 is performed to obtain training text service keyword TrKWB i And training text emotion keyword TrKWE i . For TrKWE i Steps S30 and S40 are executed to calculate the emotion total score of the training text and determine whether the training text is an emotion strong text.
Step S400, if the training text is judged to be emotion strong text, T is determined to be ri Intersecting with TrKWB i And TrKWE i Is the union, T ri Retaining intersecting elements, i.e. w ri ∈T ri ∩(TrKWB i ∪TrKWE i ) W which will satisfy this condition ri The training text keyword sequences are combined together. Otherwise, T ri With TrKWB i Matching, T ri Retaining intersecting elements, i.e. w ri ∈T ri ∩TrKWB i W which will satisfy this condition ri The training text keyword sequences are combined together, which will be obtained as training text keyword sequences.
And S500, performing N-gram model processing on a training text keyword sequence of a training set Tr, taking the training text keyword sequence as input, building a Softmax output layer based on Huffman, and building a Fasttext model. Wherein, the learning rate lr=0.1, the word vector dimension dim=100, the iteration number epoch=10, and the lowest word frequency min_count=1.
Step 600, introducing a test set Te, calculating accuracy Precision, recall rate Recall and harmonic mean value F1 of the model, and evaluating the model, wherein the calculation formula is as follows:
where TP represents the amount of correctly classified text, FP represents the amount of successfully classified text, and FN represents the total number of samples. If the harmonic mean value F1 does not meet the requirement, returning to step S500, introducing more training sets, and updating the model.
Corresponding to the above text labeling method applied to urban brain natural language processing, the present embodiment further provides a text labeling device applied to urban brain natural language processing, which includes a preprocessing unit 10, a word set screening unit 20, a keyword extraction unit 30, an emotion value calculation unit 40, an emotion labeling unit 50, a business parameter extraction unit 60, and a model output unit 70. The preprocessing unit 10 preprocesses the obtained natural language text, including clause segmentation and word segmentation processing of each clause. The word set screening unit 20 screens the parts of speech of the single vocabulary in each word segmentation result based on the preset part of speech set to generate a business word set and an emotion word set respectively. The keyword extraction unit 30 extracts a service keyword set and an emotion keyword set of a text from the service keyword set and the emotion keyword set, respectively. Emotion value calculation section 40 matches each clause with the emotion keyword set to obtain a clause emotion word sequence corresponding to each clause, and matches each emotion word in the clause emotion word sequence with the emotion dictionary to obtain an emotion value corresponding to each emotion word; the emotion values of the emotion words jointly form emotion scores of the clauses, and the emotion scores of the clauses are summed to obtain the emotion score of the natural language text. Emotion marking unit 50 matches the emotion total score of the natural language text with a preset emotion threshold value to mark an emotion label and judges whether the natural language text is an emotion strong text based on the strong emotion threshold value. If the judgment shows that the current natural language text is the emotion strong text, the emotion words are considered to influence the service label marking, and the service parameter extraction unit 60 intersects each clause with the union of the service keyword set and the emotion keyword set to obtain intersected elements containing the emotion keywords and the service keywords so as to form a service keyword sequence; if judging that the current natural language text is a non-emotion strong text, considering that emotion words do not affect service label marking, and matching each clause with a service keyword set in an intersecting manner to obtain intersecting elements containing service keywords so as to form a service keyword sequence. The model output unit 70 inputs the service keyword sequence into a trained FastText model, which marks texts with the same or similar service keywords but different emotion degrees based on whether the service keyword sequence has emotion keywords or not; and meanwhile, the classification with highest confidence level output by the Fasttext model is used as the service classification of the natural language text to obtain the service label.
Since each function of the text labeling device applied to the urban brain natural language processing is described in detail in the corresponding method steps S10 to S70, the description thereof is omitted.
Fig. 3 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure. It should be noted that the electronic device shown in fig. 3 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present disclosure. The electronic device 100 includes one or more processors 101 and a storage 102. The storage 102 is used to store one or more programs. The one or more programs, when executed by the one or more processors 101, cause the one or more processors 101 to implement the text labeling method applied to urban brain natural language processing provided in the present embodiment.
In summary, the text labeling method applied to urban brain natural language processing provided by the invention carries out service labeling and emotion labeling on each natural language text respectively so as to realize multidimensional display of text information. Further, the service label is marked by taking the service keyword as a basis; for the events with the same event but different emotion degrees, such as suggestion and complaint of a certain non-civilized event, the problem of labeling the same label can occur only by adopting the business keywords, so that the invention reserves the emotion keywords of the emotion strong text and is used as the business keywords to assist in business classification, thereby being beneficial to improving the classification accuracy, facilitating rapid screening of serious event and improving the event processing capability. Furthermore, an emotion value calculation mode based on emotion degree words and conjunctions is provided when the emotion value is calculated, and the emotion value is considered from the multi-dimensionality of the parts of speech in the calculation mode so as to accurately extract emotion labels of the complex folk text. In addition, each sentence is screened and matched by constructing a business keyword set and an emotion keyword set, redundant words are removed to respectively form clause emotion word sequences and business keyword sequences, and therefore the influence of multiple redundant words on classification labels is effectively solved, and classification accuracy is improved; meanwhile, the waste of calculation resources caused by huge corpus is avoided.
Although the invention has been described with reference to the preferred embodiments, it should be understood that the invention is not limited thereto, but rather may be modified and varied by those skilled in the art without departing from the spirit and scope of the invention.

Claims (9)

1. A text labeling method applied to urban brain natural language processing, characterized by comprising the following steps:
preprocessing the obtained natural language text, including clause segmentation and word segmentation processing of each clause;
based on a preset word part set, traversing word segmentation results of each clause, and screening the word parts of single words in each word segmentation result to respectively generate a business word set and an emotion word set;
extracting a business keyword set and an emotion keyword set of a text from the business word set and the emotion word set respectively;
matching each clause with the emotion keyword set to obtain a clause emotion word sequence corresponding to each clause, and matching each emotion word in the clause emotion word sequence with the emotion dictionary to obtain an emotion value corresponding to each emotion word; obtaining emotion total scores of the natural language text based on emotion values of the emotion words;
matching the emotion total score of the natural language text with a preset emotion threshold value to mark an emotion label and judging whether the natural language text is an emotion strong text or not based on a strong emotion threshold value;
if judging that the current natural language text is an emotion strong text, considering that the emotion words influence service label marking, intersecting each clause with a union set of a service keyword set and the emotion keyword set to obtain intersecting elements containing the emotion keywords and the service keywords so as to form a service keyword sequence; if judging that the current natural language text is a non-emotion strong text, considering that emotion words do not affect service label marking, and matching each clause with a service keyword set in an intersecting manner to obtain intersecting elements containing service keywords so as to form a service keyword sequence;
inputting the service keyword sequences into a trained FastText model for classification and marking texts with the same or similar service keywords but different emotion degrees based on whether emotion keywords exist in the service keyword sequences or not; and meanwhile, the classification with highest confidence level output by the Fasttext model is used as the service classification of the natural language text to obtain the service label.
2. The text labeling method applied to urban brain natural language processing according to claim 1, wherein, when calculating emotion values of each emotion word in the clause emotion word sequence:
judging the part of speech of each emotion word in the clause emotion word sequence to determine whether the current clause contains emotion degree words, wherein the emotion degree words comprise auxiliary words, dynamic adverbs and adverbs;
if judging that the current clause only comprises one or more single adjectives and has no emotion degree word, calculating emotion values according to a preset first calculation rule only related to the single adjectives; if judging that the current clause contains the emotion degree word, calculating an emotion value by combining emotion degree word weights on the basis of one or more single adjectives according to a second calculation rule.
3. The text labeling method applied to urban brain natural language processing according to claim 2, wherein when judging that the current clause contains emotion degree words, acquiring the unimorpheme containing words closest to each emotion degree word and appearing at the rear side of the emotion degree word based on word space distance, and updating emotion values of the nearest unimorpheme containing words at the rear side of the emotion degree word according to the weight of the emotion degree word.
4. The text labeling method applied to urban brain natural language processing according to claim 2, wherein a single-shape word in a clause is taken as a node, a sliding window M is adopted between adjacent nodes to divide the sub-sentence into windows, the sliding window M is taken as a measurement unit to match emotion degree words to the single-shape word which is closest to the sliding window M and appears at the rear side of the emotion degree words, and the emotion value of the single-shape word at the rear side of the emotion degree words is updated according to the weight of the emotion degree words.
5. The text labeling method applied to urban brain natural language processing according to claim 2, wherein when the emotion value of each emotion word in the clause emotion word sequence is calculated, whether the emotion word in the clause emotion word sequence contains a conjunctive is judged; if yes, fusing the interlinking weight on the basis of the first calculation rule or the second calculation rule.
6. The text labeling method applied to urban brain natural language processing according to claim 1, wherein the natural language text obtained by preprocessing comprises:
dividing the obtained natural language text into a plurality of clauses and constructing a clause set T= { S 1 ,S 2 ,…,S n };
Each clause S within a set of clauses i Performing word segmentation to obtain multiple word segmentation results, each clause S i ={W 1 ={w 1 ,p 1 },W 2 ,…,W n Each word segmentation result W i Each comprises a single word w after word segmentation i And part of speech p of the vocabulary i
Filtering each clause S by a preset stop word set ST i Nonsensical stop words.
7. The text labeling method applied to urban brain natural language processing according to claim 1, wherein the emotion keyword set is obtained by extracting the corresponding emotion keyword set by the following steps:
based on the co-occurrence relation among emotion words, the emotion words w i Constructing a candidate emotion keyword undirected weighted graph for the nodes and based on similar words appearing in the sliding window H;
iteratively propagating the weights of all nodes until convergence according to the following formula to obtain a candidate emotion keyword weight value set TRE:
wherein TRE (w i ) For the word w i Weights of (2); d represents a damping coefficient and is set to 0.85; in (w) i ) Representing the direction w i A collection of nodes; out (w) i ) Represents w i The set of nodes pointed to; WE (Power of industry) ji Representative node w j To node w i Is a connection weight of (2); WE (Power of industry) jk Representative node w j To node w k Is a connection weight of (2); TRE (w) j ) For the word w j Weights of (2);
sorting the obtained weight value set TRE of the candidate emotion keywords in a descending order according to the weight value to obtain an emotion keyword set KWE;
and extracting the emotion keyword set KEB from the service word set by adopting the same steps.
8. A text labeling device for urban brain natural language processing, comprising:
the preprocessing unit is used for preprocessing the obtained natural language text and comprises clause segmentation and word segmentation processing of each clause;
the word set screening unit is used for screening the parts of speech of a single word in each word segmentation result based on the preset part of speech set traversal of the word segmentation result of each clause so as to respectively generate a business word set and an emotion word set;
the keyword extraction unit is used for extracting a business keyword set and an emotion keyword set of the text from the business keyword set and the emotion keyword set respectively;
the emotion value calculation unit is used for matching each clause with the emotion keyword set to obtain a clause emotion word sequence corresponding to each clause, and matching each emotion word in the clause emotion word sequence with the emotion dictionary to obtain an emotion value corresponding to each emotion word; obtaining emotion total scores of the natural language text based on emotion values of the emotion words;
the emotion marking unit is used for matching the emotion total score of the natural language text with a preset emotion threshold value to mark an emotion label and judging whether the natural language text is an emotion strong text or not based on a strong emotion threshold value;
the business parameter extraction unit considers that the emotion words influence business label marking if judging that the current natural language text is an emotion strong text, and intersects each clause with a union set of a business keyword set and the emotion keyword set to obtain intersected elements containing the emotion keywords and the business keywords so as to form a business keyword sequence; if judging that the current natural language text is a non-emotion strong text, considering that emotion words do not affect service label marking, and matching each clause with a service keyword set in an intersecting manner to obtain intersecting elements containing service keywords so as to form a service keyword sequence;
the model output unit inputs the service keyword sequences into the trained Fasttext model for classification and marks texts with the same or similar service keywords but different emotion degrees based on whether the service keyword sequences have emotion keywords or not; and meanwhile, the classification with highest confidence level output by the Fasttext model is used as the service classification of the natural language text to obtain the service label.
9. An electronic device, comprising:
one or more processors;
a storage means for storing one or more programs;
when executed by the one or more processors, causes the one or more processors to implement the text labeling method of any of claims 1-7 for application to urban brain natural language processing.
CN202310204225.8A 2023-02-27 2023-02-27 Text labeling method and device applied to urban brain natural language processing Active CN116805147B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310204225.8A CN116805147B (en) 2023-02-27 2023-02-27 Text labeling method and device applied to urban brain natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310204225.8A CN116805147B (en) 2023-02-27 2023-02-27 Text labeling method and device applied to urban brain natural language processing

Publications (2)

Publication Number Publication Date
CN116805147A CN116805147A (en) 2023-09-26
CN116805147B true CN116805147B (en) 2024-03-22

Family

ID=88078718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310204225.8A Active CN116805147B (en) 2023-02-27 2023-02-27 Text labeling method and device applied to urban brain natural language processing

Country Status (1)

Country Link
CN (1) CN116805147B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016173742A (en) * 2015-03-17 2016-09-29 株式会社Jsol Face mark emotion information extraction system, method and program
CN108874937A (en) * 2018-05-31 2018-11-23 南通大学 A kind of sensibility classification method combined based on part of speech with feature selecting
CN109858026A (en) * 2019-01-17 2019-06-07 深圳壹账通智能科技有限公司 Text emotion analysis method, device, computer equipment and storage medium
CN110598219A (en) * 2019-10-23 2019-12-20 安徽理工大学 Emotion analysis method for broad-bean-net movie comment
KR20200127590A (en) * 2019-05-03 2020-11-11 주식회사 자이냅스 An apparatus for automatic sentiment information labeling to news articles
CN114219337A (en) * 2021-12-21 2022-03-22 中国农业银行股份有限公司 Service quality evaluation method, system, equipment and readable storage medium
US11450124B1 (en) * 2022-04-21 2022-09-20 Morgan Stanley Services Group Inc. Scoring sentiment in documents using machine learning and fuzzy matching

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016173742A (en) * 2015-03-17 2016-09-29 株式会社Jsol Face mark emotion information extraction system, method and program
CN108874937A (en) * 2018-05-31 2018-11-23 南通大学 A kind of sensibility classification method combined based on part of speech with feature selecting
CN109858026A (en) * 2019-01-17 2019-06-07 深圳壹账通智能科技有限公司 Text emotion analysis method, device, computer equipment and storage medium
KR20200127590A (en) * 2019-05-03 2020-11-11 주식회사 자이냅스 An apparatus for automatic sentiment information labeling to news articles
CN110598219A (en) * 2019-10-23 2019-12-20 安徽理工大学 Emotion analysis method for broad-bean-net movie comment
CN114219337A (en) * 2021-12-21 2022-03-22 中国农业银行股份有限公司 Service quality evaluation method, system, equipment and readable storage medium
US11450124B1 (en) * 2022-04-21 2022-09-20 Morgan Stanley Services Group Inc. Scoring sentiment in documents using machine learning and fuzzy matching

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于情感倾向性分析的网络舆情情感演化特征研究;蒋知义 等;《现代情报》;第38卷(第4期);第50-57页 *

Also Published As

Publication number Publication date
CN116805147A (en) 2023-09-26

Similar Documents

Publication Publication Date Title
CN110569508A (en) Method and system for classifying emotional tendencies by fusing part-of-speech and self-attention mechanism
CN109766544B (en) Document keyword extraction method and device based on LDA and word vector
CN112001185A (en) Emotion classification method combining Chinese syntax and graph convolution neural network
US20130018824A1 (en) Sentiment classifiers based on feature extraction
Ghosh et al. Unsupervised linguistic approach for sentiment classification from online reviews using SentiWordNet 3.0
Kaur Incorporating sentimental analysis into development of a hybrid classification model: A comprehensive study
CN113157859B (en) Event detection method based on upper concept information
Wankhede et al. Design approach for accuracy in movies reviews using sentiment analysis
CN111538828A (en) Text emotion analysis method and device, computer device and readable storage medium
CN107688870A (en) A kind of the classification factor visual analysis method and device of the deep neural network based on text flow input
Matlatipov et al. Uzbek sentiment analysis based on local restaurant reviews
Pandey et al. A study of sentiment analysis task and it's challenges
CN111091009B (en) Document association auditing method based on semantic analysis
RU2665261C1 (en) Recovery of text annotations related to information objects
CN114706972A (en) Unsupervised scientific and technical information abstract automatic generation method based on multi-sentence compression
Jha et al. Hsas: Hindi subjectivity analysis system
CN113535949B (en) Multi-modal combined event detection method based on pictures and sentences
Reddy et al. Classification of user’s review using modified logistic regression technique
Alsolamy et al. A corpus based approach to build arabic sentiment lexicon
CN116805147B (en) Text labeling method and device applied to urban brain natural language processing
CN116628377A (en) Webpage theme relevance judging method
CN115238709A (en) Method, system and equipment for analyzing sentiment of policy announcement network comments
Braoudaki et al. Hybrid data driven and rule based sentiment analysis on Greek text
CN114239565A (en) Deep learning-based emotion reason identification method and system
Zuo et al. Data mining strategies and techniques of internet education public sentiment monitoring and analysis system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant