CN116805147B

CN116805147B - Text labeling method and device applied to urban brain natural language processing

Info

Publication number: CN116805147B
Application number: CN202310204225.8A
Authority: CN
Inventors: 申永生; 陈冲杰; 叶晓华; 凌从礼
Original assignee: Hangzhou City Brain Co ltd
Current assignee: Hangzhou City Brain Co ltd
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2024-03-22
Anticipated expiration: 2043-02-27
Also published as: CN116805147A

Abstract

The invention provides a text labeling method and a text labeling device applied to urban brain natural language processing. And carrying out emotion marking based on the emotion keyword set and further judging the emotion intensity degree of the natural language text. For a text with strong emotion, each clause is intersected with a union set of a business keyword set and the emotion keyword set in business labeling, and intersecting elements containing the emotion keywords and the business keywords are obtained to form a business keyword sequence. And inputting the service keyword sequence into a trained model for classification, accurately marking texts with different emotion degrees and the same or similar semantics based on whether emotion keywords exist in the service keyword sequence, and taking the classification with the highest confidence level output by the model as the service label of the natural language text.

Description

Text labeling method and device applied to urban brain natural language processing

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a text labeling method and device applied to urban brain natural language processing and electronic equipment.

Background

The urban brain is a product of combining an Internet brain architecture and a smart city construction, is a urban-level brain-like complex intelligent giant system, and under the joint participation of human intelligence and machine intelligence, under the support of leading edge technologies such as Internet of things, big data, artificial intelligence, edge calculation, 5G, cloud robots, digital twinning and the like, the urban neuron network and the urban cloud reflection arc are important points of the urban brain construction, and the urban brain has the effects of improving the running efficiency of the city, solving the complex problem faced in the running of the city and better meeting the different requirements of each member of the city.

The urban brain is an intelligent system based on information generated by urban operation as input, and the urban operation not only can generate mass data but also has non-uniform data format, so how to acquire effective information from disordered information has become a research hotspot in the industry. The text classification task is one of the most basic tasks in the field of Natural Language Processing (NLP), can effectively screen information, and has important application in the aspects of information retrieval, automatic abstract and the like. Current text-based classification focuses mainly on classification of text traffic types, involving very few emotion classifications and being independent between emotion classification and traffic classification.

With the continuous popularization of the internet of things, public opinion information related to people in daily life is gradually collected to related departments in a data form. For such information, which includes the emotion and urgency of the feedback person for the relevant business as well as the feedback of the business, there is a need to analyze the business category and emotion category of such text to better instruct the relevant departments to solve the relevant problems quickly and orderly. However, since such public opinion information usually contains nonsensical redundant words, if the original text is directly used as corpus to perform model training to perform service classification, the influence of the redundant words on classification accuracy is ignored, which results in the problem of poor classification accuracy or incapability of classification. In addition, a large number of nonsensical redundant words also bring great difficulty to emotion classification, and separation of emotion marking and business marking also makes it difficult for an information receiver to identify the importance degree of massive information. Therefore, the current public opinion information related to people daily is marked mainly in a manual mode, and a large amount of human resources are consumed.

Disclosure of Invention

The invention provides a text labeling method, a text labeling device and electronic equipment applied to urban brain natural language processing, and aims to overcome the defects of the prior art.

In order to achieve the above object, the present invention provides a text labeling method applied to urban brain natural language processing, comprising:

preprocessing the obtained natural language text, including clause segmentation and word segmentation processing of each clause;

based on a preset word part set, traversing word segmentation results of each clause, and screening the word parts of single words in each word segmentation result to respectively generate a business word set and an emotion word set;

extracting a business keyword set and an emotion keyword set of a text from the business word set and the emotion word set respectively;

matching each clause with the emotion keyword set to obtain a clause emotion word sequence corresponding to each clause, and matching each emotion word in the clause emotion word sequence with the emotion dictionary to obtain an emotion value corresponding to each emotion word; obtaining emotion total scores of the natural language text based on emotion values of the emotion words;

matching the emotion total score of the natural language text with a preset emotion threshold value to mark an emotion label and judging whether the natural language text is an emotion strong text or not based on a strong emotion threshold value;

if judging that the current natural language text is an emotion strong text, considering that the emotion words influence service label marking, intersecting each clause with a union set of a service keyword set and the emotion keyword set to obtain intersecting elements containing the emotion keywords and the service keywords so as to form a service keyword sequence; if judging that the current natural language text is a non-emotion strong text, considering that emotion words do not affect service label marking, and matching each clause with a service keyword set in an intersecting manner to obtain intersecting elements containing service keywords so as to form a service keyword sequence;

inputting the service keyword sequences into a trained FastText model for classification and marking texts with the same or similar service keywords but different emotion degrees based on whether emotion keywords exist in the service keyword sequences or not; and meanwhile, the classification with highest confidence level output by the Fasttext model is used as the service classification of the natural language text to obtain the service label.

According to one embodiment of the invention, in calculating emotion values for each emotion word within a sequence of clause emotion words:

judging the part of speech of each emotion word in the clause emotion word sequence to determine whether the current clause contains emotion degree words, wherein the emotion degree words comprise auxiliary words, dynamic adverbs and adverbs;

if judging that the current clause only comprises one or more single adjectives and has no emotion degree word, calculating emotion values according to a preset first calculation rule only related to the single adjectives; if judging that the current clause contains the emotion degree word, calculating an emotion value by combining emotion degree word weights on the basis of one or more single adjectives according to a second calculation rule.

According to one embodiment of the invention, when judging that the current clause contains emotion degree words, acquiring the simplex appearance words which are closest to each emotion degree word and appear at the rear side of the emotion degree word based on the word space distance, and updating the emotion values of the simplex appearance words which are closest to and positioned at the rear side of the emotion degree word according to the weight of the emotion degree word.

According to the embodiment of the invention, one unigram appearance word in the clause is taken as a node, the adjacent nodes adopt a sliding window M to divide the window of the clause, the emotion degree word is matched to the unigram appearance word which is closest to the sliding window M and appears at the rear side of the emotion degree word by taking the sliding window M as a measurement unit, and the emotion value of the unigram appearance word at the rear side of the emotion degree word is updated according to the weight of the emotion degree word.

According to one embodiment of the invention, when the emotion value of each emotion word in the clause emotion word sequence is calculated, whether the emotion words in the clause emotion word sequence contain conjunctions or not is judged; if yes, fusing the interlinking weight on the basis of the first calculation rule or the second calculation rule.

According to an embodiment of the present invention, the natural language text obtained by preprocessing includes:

dividing the obtained natural language text into a plurality of clauses and constructing a clause set T= { S ₁ ，S ₂ ，…，S _n }；

Each clause S within a set of clauses _i Performing word segmentation to obtain a plurality of word segmentation results S _i ＝{W ₁ ＝{w ₁ ，p ₁ }，W ₂ ，…，W _n Each word segmentation result comprises single word w after word segmentation _i And part of speech p of the vocabulary _i ；

Filtering each clause S by a preset stop word set ST _i Nonsensical stop words.

According to one embodiment of the invention, when the emotion word set is obtained, the corresponding emotion keyword set is extracted by the following steps:

based on the co-occurrence relation among emotion words, the emotion words w _i Constructing a candidate emotion keyword undirected weighted graph for the nodes and based on similar words appearing in the sliding window H;

iteratively propagating the weights of all nodes until convergence according to the following formula to obtain a candidate emotion keyword weight value set TRE:

wherein TRE (w _i ) For the word w _i Weights of (2); d represents a damping coefficient and is set to 0.85; in (w) _i ) Representing the direction w _i A collection of nodes; out (w) _i ) Represents w _i The set of nodes pointed to; WE (Power of industry) _ji Representative node w _j To node w _i Is a connection weight of (2); WE (Power of industry) _jk Representative node w _j To node w _k Is a connection weight of (2); TRE (w) _j ) For the word w _j Weights of (2);

sorting the obtained weight value set TRE of the candidate emotion keywords in a descending order according to the weight value to obtain an emotion keyword set KWE;

and extracting the emotion keyword set KEB from the service word set by adopting the same steps.

On the other hand, the invention also provides a text labeling device applied to urban brain natural language processing, which comprises a preprocessing unit, a word set screening unit, a keyword extraction unit, an emotion value calculation unit, an emotion labeling unit, a business parameter extraction unit and a model output unit. The preprocessing unit preprocesses the obtained natural language text, including clause segmentation and word segmentation processing of each clause. The word set screening unit is used for screening the parts of speech of a single word in each word segmentation result based on the preset part of speech set to traverse the word segmentation result of each clause so as to respectively generate a business word set and an emotion word set. And the keyword extraction unit is used for extracting the business keyword set and the emotion keyword set of the text from the business word set and the emotion word set respectively. The emotion value calculation unit matches each clause with the emotion keyword set to obtain a clause emotion word sequence corresponding to each clause, and matches each emotion word in the clause emotion word sequence with the emotion dictionary to obtain an emotion value corresponding to each emotion word; and obtaining the emotion total score of the natural language text based on the emotion values of the plurality of emotion words. And the emotion marking unit is used for matching the emotion total score of the natural language text with a preset emotion threshold value so as to mark an emotion label and judging whether the natural language text is an emotion strong text or not based on the strong emotion threshold value. If the emotion marking unit judges that the current natural language text is an emotion strong text, the emotion word is considered to influence the service label marking; the business parameter extraction unit intersects each clause with the union set of the business keyword set and the emotion keyword set to obtain intersecting elements containing emotion keywords and business keywords so as to form a business keyword sequence; if judging that the current natural language text is a non-emotion strong text, considering that emotion words do not affect service label marking, and the service parameter extraction unit is used for intersecting and matching each clause with a service keyword set to obtain intersecting elements containing the service keywords so as to form a service keyword sequence. The model output unit inputs the service keyword sequences into a trained Fasttext model to classify and marks texts with the same or similar service keywords but different emotion degrees based on whether emotion keywords exist in the service keyword sequences or not; and meanwhile, the classification with highest confidence level output by the Fasttext model is used as the service classification of the natural language text to obtain the service label.

In another aspect, the present invention also provides an electronic device including one or more processors and a storage device. The storage device is used for storing one or more programs. The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the text labeling method described above as being applied to urban brain natural language processing.

In summary, the text labeling method applied to urban brain natural language processing provided by the invention carries out service labeling and emotion labeling on each natural language text respectively so as to realize multidimensional display of text information. Further, the service label is marked by taking the service keyword as a basis; for the events with the same event but different emotion degrees, such as suggestion and complaint of a certain non-civilized event, the problem of labeling the same label can occur only by adopting the business keywords, so that the invention reserves the emotion keywords of the emotion strong text and is used as the business keywords to assist in business classification, thereby being beneficial to improving the classification accuracy, facilitating rapid screening of serious event and improving the event processing capability. Furthermore, an emotion value calculation mode based on emotion degree words and conjunctions is provided when the emotion value is calculated, and the emotion value is considered from the multi-dimensionality of the parts of speech in the calculation mode so as to accurately extract emotion labels of the complex folk text. In addition, each sentence is screened and matched by constructing a business keyword set and an emotion keyword set, redundant words are removed to respectively form clause emotion word sequences and business keyword sequences, and therefore the influence of multiple redundant words on classification labels is effectively solved, and classification accuracy is improved; meanwhile, the waste of calculation resources caused by huge corpus is avoided.

The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments, as illustrated in the accompanying drawings.

Drawings

Fig. 1 is a flowchart of a text labeling method applied to urban brain natural language processing according to an embodiment of the present invention.

Fig. 2 shows a specific step of calculating the emotion value of the emotion word in step S40 in fig. 1.

Fig. 3 is a schematic structural diagram of a text labeling device applied to urban brain natural language processing according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The text labeling method applied to urban brain natural language processing can be used in computer equipment. In one possible implementation, the computer device may be a terminal, which may be a mobile phone, a computer, a tablet computer, or other types of terminals. In another possible implementation, the computer device may include a server and a terminal.

Fig. 1 shows a text labeling method applied to urban brain natural language processing, which includes:

step S10, preprocessing the obtained natural language text, wherein the preprocessing comprises clause segmentation and word segmentation processing of each clause. In this embodiment, description will be given taking, as an example, a natural language text, which includes a plurality of redundant words and has various emotion tendencies. However, the present invention is not limited in any way thereto. In other embodiments, the text labeling method applied to urban brain natural language processing provided by the invention is also applicable to classification labeling of natural language texts such as internet vocabulary entries, medical information vocabulary entries and the like.

For the preprocessing in step S10, the specific flow includes:

step 101: each natural language text is segmented to obtain clause S _n Construct clause set t= { S ₁ ，S ₂ ，…，S _n }。

Step 102: each clause in the set is segmented and part-of-speech labeled by using a JieBa algorithm, and then S= { W ₁ ＝{w ₁ ，p ₁ }，W ₂ ，…，W _n }. Wherein: w (W) _i Representing the word segmentation result, w represents a single word after word segmentation, and p represents the part of speech of the word.

Step 102: filtering clauses S by a preset stop word set ST _i Nonsensical stop words; i.e. arbitrary vocabulary w.epsilon.ST andif the clause set T is empty after screening, the text is considered to be meaningless, and the labeling processing is not performed any more.

Step S20 is executed after the word segmentation result of each clause is obtained through preprocessing: based on a preset word part set, traversing word segmentation results of each clause, and screening the word parts of single words in each word segmentation result to respectively generate a business word set and an emotion word set; namely, a business word set TB and an emotion word set TE are respectively generated for the same sub-sentence set T. Specifically, each word segmentation result W in each clause in T is traversed _i W is paired according to a preset service part-of-speech set PB _i Part of speech p in (2) _i Screening is performed, requiring p _i E PB; namely W _i ＝{w _i ，p _i In the case of p _i E PB, will w _i Put into business vocabulary TB, ifThen consider w _i Neglecting for noise. Traversing is completed to obtain a business word set TB= { wb ₁ ，wb ₂ ，…，wb _n }. Similarly, each word segmentation result W is traversed _i Screening the emotion part-of-speech set PE through the emotion part-of-speech set PE, and requiring p _i E PE to obtain an emotion word set TE= { we ₁ ，we ₂ ，…，we _n }. In this embodiment, pb= { n, s, v, ns, vn, nt }; where n represents a common noun, s represents a place noun, v represents a common verb, ns represents a place name, vn represents a place name, and nt represents an organization name.

Pe= { a, u, d, vd, c }; wherein a represents adjectives, u represents auxiliary words, d represents adverbs, vd represents dynamic adverbs, and c represents conjunctions.

After obtaining the service vocabulary TB and the emotion vocabulary TE, step S30 is performed: and extracting keywords aiming at the two word sets respectively to form a business keyword set KWB and an emotion keyword set KWE. The embodiment provides a keyword extraction method based on a TextRank algorithm, which is specifically as follows:

step S301, according to co-occurrence relation of co-current emotion words, emotion words we are used _i And constructing a candidate emotion keyword undirected weighted graph GE= (TE, E) for the nodes and based on similar words appearing in the sliding window H, wherein E represents a non-empty finite set of each edge between emotion word sets TE. Iteratively propagating weights of all nodes until convergence to obtain a candidate emotion keyword weight value set TRE, wherein the calculation formula is as follows:

wherein TRE (we _i ) Is node we _i Weights of (2); d represents a damping coefficient and is set to 0.85; in (we) _i ) Representative pointing we _i A node set; out (we) _i ) Representative we _i The set of nodes pointed to; WEE (web-defined element) _ji Representative node we _i To node we _j Is a connection weight of (2); WEE (web-defined element) _jk Representative node we _j To node we _k Is a connection weight of (2); TRE (we) _j ) Is the word we _j Is a weight value of (a).

Similarly, according to co-occurrence relation among the emotion business words, business words wb are used _i And constructing a candidate business keyword undirected weighted graph GB= (TB, B) for the nodes and based on similar words appearing in the sliding window H, wherein B represents a non-empty finite set of each edge between the business word sets TB. Iteratively propagating weights of all nodes until convergence to obtain a candidate business keyword weight value set TRB, wherein the calculation formula is as follows:

wherein TRB (wb) _i ) For node wb _i Weights of (2); d represents a damping coefficient and is set to 0.85; in (wb) _i ) Representative pointing direction wb _i A node set; out (wb) _i ) Represents wb _i The set of nodes pointed to; web (WEB) _ji Representative node wb _i To node wb _j Is a connection weight of (2); web (WEB) _jk Representative node wb _j To node wb _k Is a connection weight of (2); TRB (wb) _j ) For the word wb _j Is a weight value of (a).

Step 302: and (3) sorting the candidate emotion keyword weight value set TRE and the candidate business keyword weight value set TRB obtained in the step 301 in a descending order according to weight values (namely, textRank values), and respectively taking the first 50 TextRank values as final keyword sets to obtain an emotion keyword set KWE and a business keyword set KWB. If the number of the keywords is less than 50, taking all the candidate words in the set as the keywords.

After obtaining the emotion keyword set KWE, executing step S40, matching each clause with the emotion keyword set to obtain a clause emotion word sequence corresponding to each clause, and matching each emotion word in the clause emotion word sequence with an emotion dictionary to obtain an emotion value corresponding to each emotion word; and obtaining emotion scores of the natural language text based on emotion values of the plurality of emotion words. In this embodiment, after the emotion value of each emotion word is obtained, emotion values of a plurality of emotion words together form emotion scores of the clauses, and emotion scores of the plurality of clauses are summed to obtain emotion scores of the natural language text. However, the present invention is not limited in any way thereto.

Specifically, step S40 includes:

step S401, matching the clause set T in the step S10 with the emotion keyword set KWE, reserving emotion words of each clause, and using we for emotion words of the ith clause _i Representation, i.e. we _i E, T.U. KWE, obtaining the i-th clause emotion word sequence ET _i ，ET _i Matching with emotion dictionary, if hit corresponding emotion word, assigning emotion value to emotion word, and calculating emotion value Score of ith clause according to clause emotion calculation rule _i 。

For step S401, the embodiment further provides a method for further improving accuracy of emotion value calculation based on emotion word judgment, which is specifically as follows:

step S4011, judge the part of speech of emotion word in the emotion word sequence of clause in order to confirm whether the current clause includes emotion degree word, emotion degree word includes the auxiliary word, moves adverb and adverb.

Step S4012, if it is determined that the current clause includes only one or more single adjectives and no emotion level word, calculating emotion values according to a preset first calculation rule related to only single adjectives, where the first calculation rule is as follows:

wherein s is _i The emotion value of the ith single-shape word is used as the emotion value, and N is the number of single-shape words in the clause; score _i Is the emotion value of the ith clause.

Step S4013, if step S4011 determines that the current clause includes an emotion degree word, calculating an emotion value according to a second calculation rule based on one or more single adjectives in combination with emotion degree word weights. Specifically, a unimorpheme word in a clause is taken as a node, and a sliding window M is adopted between adjacent nodes to divide the clause into windows. Matching the emotion degree word to a simplex word which is closest to the emotion degree word and appears at the rear side of the emotion degree word by taking the sliding window M as a measurement unit; and updating the emotion value of the unimorpheme word through the weight of the emotion degree word. For example, if a certain emotion degree word is matched to a single-shape word at the front side by taking the sliding window M as a measurement unit, the emotion degree word is considered to not influence the emotion value of the single-shape word at the front side, the emotion degree word is ignored in emotion value calculation, and a clause emotion value is calculated by using a first calculation rule. And window sliding, if the single adjective appears at the rear side of the emotion degree word at the moment, considering that the emotion degree word influences the emotion value of the single adjective positioned at the rear side of the single adjective, and calculating the emotion value of the single adjective positioned at the rear side of the emotion degree word by using a second calculation rule, wherein the second calculation rule is as follows:

wherein s is _i Representing the emotion value of the monoadjective; n represents the adjective number; weight (Weight) _ij Representing emotion words s _i A weight for each emotion level word, the weight being obtainable by matching an emotion level dictionary; k is emotion word s _i The corresponding number of emotion level words, the sliding window size is set to m=2. However, the present invention does not impose any limitation on the second calculation rule. In other embodiments, the emotion calculation rate may also be improved by summarizing the number of emotion degree values included in the clauses and directly assigning corresponding weights according to different numbers.

In addition, the distance between the emotion degree word and the emotion word can be calculated and judged through the word space distance. For example, when the judgment indicates that the current clause contains emotion degree words, obtaining the simplex appearance words which are closest to each emotion degree word and appear at the rear side of the emotion degree words based on the word space distance, and updating the emotion values of the simplex appearance words closest to the emotion degree words according to the weights of the emotion degree words.

Further, step S401 further includes step S4014: judging whether emotion words in the clause emotion word sequence contain conjunctions or not; if yes, step S4015 is executed to fuse the conjunctive weights based on the first calculation rule or the second calculation rule, specifically:

Score _i ′＝Score _i ×(1+Weight _ij )

wherein Weight is _ij Representing the weight of the conjunctive word, score _i Score for emotion value of ith clause calculated by first calculation rule or second calculation rule _i ' is the emotion value of the ith clause updated based on the conjunctive weight. In this embodiment, the weight of the conjunctions is given different weights according to the properties of the conjunctions, specifically:

(one) turning relation

a) If the ligature emphasizes the front sentence, weight _ij ＝0.5

b) If the conjunctions emphasize the post sentence, weight _ij ＝1.5

(II) progressive relationship

a) The emotion degree of clauses increases progressively before and after the progressive relationship, then Weight _ij ＝1.5

(III) parallel relationship and causal relationship

a) Parallel relationship and causal relationship, and the emotion relationship of the front clause and the back clause are the same, weight _ij ＝1.0

Step S402, each clause emotion Score _i And carrying out weighted average summation to obtain the total emotion score TotalScore of the natural language text.

After that, step S50 is performed: and matching the total emotion score of the natural language text with a preset emotion threshold value to label the emotion label. In this embodiment, the emotion threshold value TH1 is 1 and-1, respectively, and the labeling rules are as follows. However, the present invention is not limited in any way thereto.

Emotion label

Further, step S50 will also determine whether the natural language text is emotion intense text based on the strong emotion threshold value. Specifically, taking the absolute value of the total text emotion score, obtaining the absolute value AbsTotalScare of the total text score, comparing AbsTotalScare with a strong emotion threshold value TH, and if AbsTotalScare > TH2, considering the text as an emotion strong text, wherein TH2 is larger than the absolute value of the emotion threshold value TH1, namely TH2 > AbsTH1.

If step S50 judges that the current natural language text is emotion strong text, step S601 is executed to set each clause S in clause set T _i Intersecting the union of the business keyword set KWB and the emotion keyword set KWE to obtain intersecting elements containing emotion keywords and business keywords to form a business keyword sequence, i.e. w _i E, T U (KWB U KWE), and further realizing fusion of emotion words in service labeling, and viewing the emotion words as service keywords. If the current natural language text is judged to be a non-emotion strong text, executing step S602 to match each clause with a business keyword set KWB to obtain intersecting elements containing business keywordsTo form a sequence of business keywords, i.e. w _i ∈T∩KWB。

Step S70, inputting the service keyword sequence obtained in the step S601 or the step S602 into a trained Fasttext model to classify, respectively labeling texts with the same or similar service keywords but different emotion degrees based on whether emotion keywords exist in the service keyword sequence, and outputting classification with highest confidence as service classification of the natural language text to obtain a service tag; if the highest confidence does not meet the minimum requirement, then the text is deemed not to belong to any classification.

For example, when the input natural language text is "beautiful: in the section (from mountain front street to Wu Qiaolu), a person without quality breaks a sharing bicycle at 18 meters on the east-south side of the Wenyan music, which is very angry, hopefully, the relevant departments are willing to penalize the person, and the person is really big in fire (1 person, bad brake handle, green orange and no door sign) ", and the text labeling method applied to urban brain natural language processing by the worker of the embodiment can be as follows:

service label: urban and rural construction complaints/urban construction and municipal administration

Matching degree: 99.89%

Emotion label: is strongly and negatively

Emotion threshold: -3.427

In this example, the strong emotion threshold value is set to th2=3, and when the input natural language text is a strong text containing emotion keywords, the model outputs corresponding business labels based on the emotion keywords to represent the situation degree, such as "urban and rural construction complaints" in this case. In other embodiments, when the service keyword sequence does not include the emotion keyword, according to the preset service tag, the service tag marked by the text does not include the emotion word when the service tag is marked, or includes the service tag emotion word with low emotion degree, such as "urban and rural construction suggestion" or "urban and rural construction feedback" and the like. When the information processor receives the natural language text with the same or similar business keywords, the negative public opinion information with strong emotion can be rapidly and accurately screened based on the business tag emotion words related to emotion on the business tag, so that the problems reflected by the natural language text can be rapidly processed in time, and the information processing speed is improved.

In this embodiment, the business classification of natural language text is based on the FastText model, for which training will be performed using the following steps:

step S100, service marking is carried out on natural language texts representing the moods in a manual marking mode, and marked texts are subjected to 8:2 into training sets

Tr＝{{T _r1 ，Label}，{T _r2 ，Label}，…，{T _rn Label }, and test set

Te＝{T _e1 ，Label}，{T _e2 ，Label}，…，{T _en ，Label}}。

Step S200, preprocessing step S10 is performed on each sample in the training set Tr, and each training sample T is processed _ri Sentence segmentation is carried out, and word segmentation is carried out on each clause.

Step S300, for each training sample T _ri Step S20 is performed to obtain training text service keyword TrKWB _i And training text emotion keyword TrKWE _i . For TrKWE _i Steps S30 and S40 are executed to calculate the emotion total score of the training text and determine whether the training text is an emotion strong text.

Step S400, if the training text is judged to be emotion strong text, T is determined to be _ri Intersecting with TrKWB _i And TrKWE _i Is the union, T _ri Retaining intersecting elements, i.e. w _ri ∈T _ri ∩(TrKWB _i ∪TrKWE _i ) W which will satisfy this condition _ri The training text keyword sequences are combined together. Otherwise, T _ri With TrKWB _i Matching, T _ri Retaining intersecting elements, i.e. w _ri ∈T _ri ∩TrKWB _i W which will satisfy this condition _ri The training text keyword sequences are combined together, which will be obtained as training text keyword sequences.

And S500, performing N-gram model processing on a training text keyword sequence of a training set Tr, taking the training text keyword sequence as input, building a Softmax output layer based on Huffman, and building a Fasttext model. Wherein, the learning rate lr=0.1, the word vector dimension dim=100, the iteration number epoch=10, and the lowest word frequency min_count=1.

Step 600, introducing a test set Te, calculating accuracy Precision, recall rate Recall and harmonic mean value F1 of the model, and evaluating the model, wherein the calculation formula is as follows:

where TP represents the amount of correctly classified text, FP represents the amount of successfully classified text, and FN represents the total number of samples. If the harmonic mean value F1 does not meet the requirement, returning to step S500, introducing more training sets, and updating the model.

Corresponding to the above text labeling method applied to urban brain natural language processing, the present embodiment further provides a text labeling device applied to urban brain natural language processing, which includes a preprocessing unit 10, a word set screening unit 20, a keyword extraction unit 30, an emotion value calculation unit 40, an emotion labeling unit 50, a business parameter extraction unit 60, and a model output unit 70. The preprocessing unit 10 preprocesses the obtained natural language text, including clause segmentation and word segmentation processing of each clause. The word set screening unit 20 screens the parts of speech of the single vocabulary in each word segmentation result based on the preset part of speech set to generate a business word set and an emotion word set respectively. The keyword extraction unit 30 extracts a service keyword set and an emotion keyword set of a text from the service keyword set and the emotion keyword set, respectively. Emotion value calculation section 40 matches each clause with the emotion keyword set to obtain a clause emotion word sequence corresponding to each clause, and matches each emotion word in the clause emotion word sequence with the emotion dictionary to obtain an emotion value corresponding to each emotion word; the emotion values of the emotion words jointly form emotion scores of the clauses, and the emotion scores of the clauses are summed to obtain the emotion score of the natural language text. Emotion marking unit 50 matches the emotion total score of the natural language text with a preset emotion threshold value to mark an emotion label and judges whether the natural language text is an emotion strong text based on the strong emotion threshold value. If the judgment shows that the current natural language text is the emotion strong text, the emotion words are considered to influence the service label marking, and the service parameter extraction unit 60 intersects each clause with the union of the service keyword set and the emotion keyword set to obtain intersected elements containing the emotion keywords and the service keywords so as to form a service keyword sequence; if judging that the current natural language text is a non-emotion strong text, considering that emotion words do not affect service label marking, and matching each clause with a service keyword set in an intersecting manner to obtain intersecting elements containing service keywords so as to form a service keyword sequence. The model output unit 70 inputs the service keyword sequence into a trained FastText model, which marks texts with the same or similar service keywords but different emotion degrees based on whether the service keyword sequence has emotion keywords or not; and meanwhile, the classification with highest confidence level output by the Fasttext model is used as the service classification of the natural language text to obtain the service label.

Since each function of the text labeling device applied to the urban brain natural language processing is described in detail in the corresponding method steps S10 to S70, the description thereof is omitted.

Fig. 3 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure. It should be noted that the electronic device shown in fig. 3 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present disclosure. The electronic device 100 includes one or more processors 101 and a storage 102. The storage 102 is used to store one or more programs. The one or more programs, when executed by the one or more processors 101, cause the one or more processors 101 to implement the text labeling method applied to urban brain natural language processing provided in the present embodiment.

Although the invention has been described with reference to the preferred embodiments, it should be understood that the invention is not limited thereto, but rather may be modified and varied by those skilled in the art without departing from the spirit and scope of the invention.

Claims

1. A text labeling method applied to urban brain natural language processing, characterized by comprising the following steps:

2. The text labeling method applied to urban brain natural language processing according to claim 1, wherein, when calculating emotion values of each emotion word in the clause emotion word sequence:

3. The text labeling method applied to urban brain natural language processing according to claim 2, wherein when judging that the current clause contains emotion degree words, acquiring the unimorpheme containing words closest to each emotion degree word and appearing at the rear side of the emotion degree word based on word space distance, and updating emotion values of the nearest unimorpheme containing words at the rear side of the emotion degree word according to the weight of the emotion degree word.

4. The text labeling method applied to urban brain natural language processing according to claim 2, wherein a single-shape word in a clause is taken as a node, a sliding window M is adopted between adjacent nodes to divide the sub-sentence into windows, the sliding window M is taken as a measurement unit to match emotion degree words to the single-shape word which is closest to the sliding window M and appears at the rear side of the emotion degree words, and the emotion value of the single-shape word at the rear side of the emotion degree words is updated according to the weight of the emotion degree words.

5. The text labeling method applied to urban brain natural language processing according to claim 2, wherein when the emotion value of each emotion word in the clause emotion word sequence is calculated, whether the emotion word in the clause emotion word sequence contains a conjunctive is judged; if yes, fusing the interlinking weight on the basis of the first calculation rule or the second calculation rule.

6. The text labeling method applied to urban brain natural language processing according to claim 1, wherein the natural language text obtained by preprocessing comprises:

dividing the obtained natural language text into a plurality of clauses and constructing a clause set T= { S ₁ ,S ₂ ,…,S _n }；

Each clause S within a set of clauses _i Performing word segmentation to obtain multiple word segmentation results, each clause S _i ＝{W ₁ ＝{w ₁ ,p ₁ },W ₂ ,…,W _n Each word segmentation result W _i Each comprises a single word w after word segmentation _i And part of speech p of the vocabulary _i ；

Filtering each clause S by a preset stop word set ST _i Nonsensical stop words.

7. The text labeling method applied to urban brain natural language processing according to claim 1, wherein the emotion keyword set is obtained by extracting the corresponding emotion keyword set by the following steps:

8. A text labeling device for urban brain natural language processing, comprising:

the preprocessing unit is used for preprocessing the obtained natural language text and comprises clause segmentation and word segmentation processing of each clause;

the word set screening unit is used for screening the parts of speech of a single word in each word segmentation result based on the preset part of speech set traversal of the word segmentation result of each clause so as to respectively generate a business word set and an emotion word set;

the keyword extraction unit is used for extracting a business keyword set and an emotion keyword set of the text from the business keyword set and the emotion keyword set respectively;

the emotion value calculation unit is used for matching each clause with the emotion keyword set to obtain a clause emotion word sequence corresponding to each clause, and matching each emotion word in the clause emotion word sequence with the emotion dictionary to obtain an emotion value corresponding to each emotion word; obtaining emotion total scores of the natural language text based on emotion values of the emotion words;

the emotion marking unit is used for matching the emotion total score of the natural language text with a preset emotion threshold value to mark an emotion label and judging whether the natural language text is an emotion strong text or not based on a strong emotion threshold value;

the business parameter extraction unit considers that the emotion words influence business label marking if judging that the current natural language text is an emotion strong text, and intersects each clause with a union set of a business keyword set and the emotion keyword set to obtain intersected elements containing the emotion keywords and the business keywords so as to form a business keyword sequence; if judging that the current natural language text is a non-emotion strong text, considering that emotion words do not affect service label marking, and matching each clause with a service keyword set in an intersecting manner to obtain intersecting elements containing service keywords so as to form a service keyword sequence;

the model output unit inputs the service keyword sequences into the trained Fasttext model for classification and marks texts with the same or similar service keywords but different emotion degrees based on whether the service keyword sequences have emotion keywords or not; and meanwhile, the classification with highest confidence level output by the Fasttext model is used as the service classification of the natural language text to obtain the service label.

9. An electronic device, comprising:

one or more processors;

a storage means for storing one or more programs;

when executed by the one or more processors, causes the one or more processors to implement the text labeling method of any of claims 1-7 for application to urban brain natural language processing.