CN112860896A

CN112860896A - Corpus generalization method and man-machine conversation emotion analysis method for industrial field

Info

Publication number: CN112860896A
Application number: CN202110246998.3A
Authority: CN
Inventors: 王健健; 蒋华晨; 刘扬
Original assignee: Sany Heavy Industry Co Ltd
Current assignee: Sany Heavy Industry Co Ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-05-28

Abstract

The invention provides a corpus generalization method and a man-machine conversation emotion analysis method used in the industrial field. The corpus generalization method comprises the following steps: acquiring an initial text corpus in the industrial field, and replacing entity words in the initial text corpus to obtain a first type of text corpus; performing word segmentation on the initial text corpus and/or the first type of text corpus, and replacing words obtained by the word segmentation based on the similar meaning words of the words obtained by the word segmentation to obtain a second type of text corpus; performing dependency syntax analysis on at least one of the initial text corpus, the first type of text corpus and the second type of text corpus, and performing sentence pattern transformation on at least one item based on an analysis result to obtain a third type of text corpus; and generalizing the initial text corpus based on at least two items in the first type of text corpus, the second type of text corpus and the third type of text corpus. By the method, the expansion of text corpora required by functions such as man-machine conversation and the like in the industrial field can be completed.

Description

Corpus generalization method and man-machine conversation emotion analysis method for industrial field

Technical Field

The invention relates to the technical field of data processing, in particular to a corpus generalization method and a man-machine conversation emotion analysis method applied to the industrial field.

Background

In the related professional fields such as industry and the like, a large amount of corpus data is needed to be used as support for model training and effect evaluation, and related corpora are difficult to accumulate in the scenes. Therefore, corpus generalization is required to increase the corpus used for model training and effect assessment.

The linguistic generalization refers to expanding a specific sentence into a sentence with the same meaning or similar scenes, and at present, the linguistic generalization is usually performed by manually defining a sentence pattern template of a fixed application scene. The method of manually defining the sentence pattern template has great limitation on application scenes and effects.

Disclosure of Invention

The invention provides a corpus generalization method and a man-machine conversation emotion analysis method used in the industrial field, which are used for solving the defect that the corpus generalization is carried out by manually defining a sentence pattern template in the prior art and have great limitation, and realizing the corpus generalization in the man-machine conversation in the industrial field.

The invention provides a corpus generalization method, which comprises the following steps:

acquiring an initial text corpus in the industrial field, and replacing entity words in the initial text corpus to obtain a first type of text corpus;

performing word segmentation on the initial text corpus and/or the first type of text corpus, and replacing words obtained by word segmentation on the basis of near-synonyms of the words obtained by word segmentation to obtain a second type of text corpus;

performing dependency syntax analysis on at least one of the initial text corpus, the first type of text corpus and the second type of text corpus, and performing sentence pattern transformation on the at least one item based on an analysis result to obtain a third type of text corpus;

generalizing the initial text corpus based on at least two of the first type of text corpus, the second type of text corpus, and the third type of text corpus.

According to the corpus generalization method provided by the present invention, the replacing the entity words in the initial text corpus further comprises:

constructing an entity dictionary with the same service scene as the initial text corpus;

identifying entity words in the initial text corpus based on an entity identification model and/or the entity dictionary; the entity recognition model is obtained based on text corpus training with entity word labels.

According to the corpus generalization method provided by the present invention, the replacing of the entity words in the initial text corpus specifically includes:

determining an entity slot corresponding to an entity word in the initial text corpus;

and selecting entity words in the entity dictionary to fill the entity slots based on the similarity between the entity words in the initial text corpus and each entity word in the entity dictionary.

According to the corpus generalization method provided by the present invention, the term obtained by the word segmentation processing is replaced based on the near-synonym of the term obtained by the word segmentation processing to obtain the second type of text corpus, and the method further comprises:

determining target words belonging to the target part of speech in the words obtained by word segmentation processing;

calculating a near meaning word of the target word based on a word vector model;

correspondingly, the words obtained by the word segmentation processing are replaced based on the similar meaning words of the words obtained by the word segmentation processing to obtain a second type of text corpus, which specifically comprises:

and replacing the target word based on the similar meaning word of the target word to obtain the second type text corpus.

According to the corpus generalization method provided by the invention, the method for obtaining the initial text corpus in the industrial field further comprises the following steps:

acquiring a target template of the initial text corpus;

filling the target template, and determining a fourth type of text corpus;

correspondingly, generalizing the initial text corpus based on the first type of text corpus, the second type of text corpus, and the third type of text corpus specifically includes:

generalizing the initial text corpus based on at least two of the first type of text corpus, the second type of text corpus, the third type of text corpus, and the fourth type of text corpus.

translating the initial text corpus and then translating the initial text corpus back to determine a fifth type of text corpus;

generalizing the initial text corpus based on at least two of the first type of text corpus, the second type of text corpus, the third type of text corpus, the fourth type of text corpus, and the fifth type of text corpus.

The invention also provides a man-machine conversation emotion analysis method used in the industrial field, which comprises the following steps:

acquiring human-computer conversation text data to be analyzed;

inputting the human-computer conversation text data to be analyzed into an emotion classification model to obtain an emotion type corresponding to the human-computer conversation text data to be analyzed and output by the emotion classification model;

the emotion classification model is obtained by training a man-machine conversation text data sample carrying an emotion type label, and the man-machine conversation text data sample is obtained by generalization based on any one of the corpus generalization methods.

The invention also provides a corpus generalization system, comprising:

the first-class text corpus generating module is used for acquiring an initial text corpus in the industrial field and replacing entity words in the initial text corpus to obtain a first-class text corpus;

a second-class text corpus generation module, configured to perform word segmentation on the initial text corpus and/or the first-class text corpus, and replace words obtained by the word segmentation based on a near-synonym of the words obtained by the word segmentation to obtain a second-class text corpus;

a third type text corpus generating module, configured to perform dependency syntactic analysis on at least one of the initial text corpus, the first type text corpus, and the second type text corpus, and perform sentence pattern transformation on the at least one item based on an analysis result to obtain a third type text corpus;

and the corpus generalization module is used for generalizing the initial text corpus based on at least two items in the first type of text corpus, the second type of text corpus and the third type of text corpus.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the steps of the linguistic data generalization method or the man-machine conversation emotion analysis method used in the industrial field.

The present invention also provides a non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the corpus generalization method or the human-computer interaction emotion analysis method for industrial fields as described in any of the above.

The corpus generalization method and the man-machine conversation emotion analysis method used in the industrial field effectively expand the initial text corpus through entity word replacement, near meaning word replacement and sentence pattern transformation, and accumulate more text corpuses required by the functions of man-machine interaction or chat conversation and the like in the industrial field.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a corpus generalization method according to an embodiment of the present invention;

FIG. 2 is a second schematic flow chart illustrating a corpus generalization method according to an embodiment of the present invention;

FIG. 3 is a third schematic flow chart of a corpus generalization method according to an embodiment of the present invention;

FIG. 4 is a flow chart of a human-computer interaction emotion analysis method for industrial fields according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a corpus generalization system according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a human-computer interaction emotion analysis system for industrial fields according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Because the existing corpus generalization method has limitations and cannot be applied to rapid application and implementation, in order to solve the above technical problems, an embodiment of the present invention provides a corpus generalization method, including:

s1, acquiring an initial text corpus in the industrial field, and replacing entity words in the initial text corpus to obtain a first-class text corpus;

s2, performing word segmentation processing on the initial text corpus and/or the first type text corpus, and replacing words obtained by word segmentation processing based on the similar meaning words of the words obtained by word segmentation processing to obtain a second type text corpus;

s3, performing dependency syntax analysis on at least one item in the initial text corpus, the first type text corpus and the second type text corpus, and performing sentence pattern transformation on the at least one item based on the analysis result to obtain a third type text corpus;

s4, generalizing the initial text corpus based on at least two items of the first type text corpus, the second type text corpus and the third type text corpus.

Specifically, in the corpus generalization method provided in the embodiment of the present invention, the execution main body may be an independent server, the server may be a local server or a cloud server, and the local server may be a computer, a tablet computer, a smart phone, or the like.

Step S1 is performed first. In step S1, the industrial fields include a mechanical industrial field, a manufacturing industrial field, an electronic device industrial field, and an energy industrial field. The corpus refers to language materials, the initial text corpus in the industrial field can be a text corpus actually generated in man-machine interaction or chat interaction in the industrial field, and can be acquired manually or automatically.

The entity words refer to words with specific meanings in the text corpus, are core concepts of the conversation and are related to the intention of the user to a certain extent. In the embodiment of the present invention, the entity words may be divided into two types: the first type entity words comprise common names, companies, organizations, places, and times; the second type entity words refer to entities in business scenes, and comprise industrial products, industrial accessories, industrial raw materials and industrial brands. For example, "please describe the function of the working machine manufactured by company a," company a "is a company name entity, and" working machine "is an industrial product entity. The first type of text corpus is obtained by replacing entity words in the initial text corpus, for example, replacing "company a" with "company B", "company C", and "company D".

Then, step S2 is executed. In step S2, the initial text corpus may be subjected to word segmentation, or the first type of text corpus may be subjected to word segmentation, or the initial text corpus and the first type of text corpus are merged and then deduplicated, and then the merged and deduplicated text corpus is subjected to word segmentation.

In the embodiment of the invention, word segmentation refers to a process of segmenting the initial text corpus and/or the first type text corpus into individual words, namely recombining continuous word sequences into word sequences according to a certain standard. The word is the smallest unit expressing the complete meaning, and word segmentation can facilitate subsequent processing. For example, "WeChat is a good social software", after word segmentation, this sentence will be decomposed into "WeChat/Yes/good/social/software".

In the embodiment of the invention, the initial text corpus and/or the first type text corpus can be participled by adopting a word segmentation method based on dictionary matching. The word segmentation method comprises the steps of segmenting and adjusting a text corpus to be segmented according to a certain rule, matching with words in a dictionary, segmenting words according to the words in the dictionary if matching is successful, adjusting or reselecting if matching is failed, and repeating the steps repeatedly. The dictionary may be a commonly used Chinese dictionary, or a Chinese dictionary collected and sorted manually or automatically generated by a computer program.

In the embodiment of the present invention, a method based on a Conditional Random Field (CRF) may also be used to perform word segmentation on the initial text corpus and/or the first type of text corpus. The method carries out label training on Chinese characters, and adds time sequence consideration on the basis of classification, namely, beginning (B), middle (M), end (E) and words (S) formed by single characters. After the words are labeled, the words in the middle of the beginning and the end and the single words form word segmentation.

In the embodiment of the present invention, after the initial text corpus and/or the first type of text corpus is/are segmented, a plurality of single words are obtained, for example, "WeChat/Yes/good/social/software" after the segmentation, i.e., "WeChat", "Yes", "good", "social" and "software" can be obtained, and then the second type of text corpus is obtained by obtaining the synonyms of the words and replacing the words with the synonyms.

Step S3 is executed again. In step S3, dependency parsing may be performed on the initial text corpus, or on the first type of text corpus, or on the second type of text corpus, or after the initial text corpus, the first type of text corpus, and the second type of text corpus are merged and de-duplicated, dependency parsing is performed on the merged de-duplicated text corpus.

In an embodiment of the present invention, dependency parsing is a method of parsing in natural language processing. Dependency parsing considers that there is a master-slave relationship between words, which is a binary non-equivalent relationship. In a sentence, if one word modifies another word, the modified word is called a dependent word, the modified word is called a dominant word, and the grammatical relationship between the two is called a dependency relationship. The dependency relationship comprises a cardinal relationship, a dynamic guest relationship, a preposed object and other grammatical relationships.

In embodiments of the present invention, a graph-based dependency parsing method may be used. The dependency relationship of all words in the sentences in the initial text corpus is expressed in a form of directed edges, and a tree is obtained and is called a dependency syntax tree. The dependency syntax tree is a subgraph of the full graph. After the dependency syntax tree is obtained, the scores of the whole tree are decomposed into the sum of the scores on each edge, and then the graph is searched for the optimal solution, so that the dependency syntax analysis can be completed. The grammatical relation in the sentence of the text corpus can be obtained through dependency syntactic analysis, and sentence pattern transformation can be carried out on the original sentence based on the grammatical relation, for example, dependency nodes corresponding to continuous modification structures are deleted, node sequence transformation is carried out on sentence patterns such as a bingo structure, and words and sentences are changed into the words and sentences. For example, the original sentence "i eat bread" may be converted into "bread is eaten by i". And performing sentence pattern transformation on at least one of the initial text corpus, the first type of text corpus and the second type of text corpus to obtain a third type of text corpus.

Finally, step S4 is performed. In step S4, at least two of the first type text corpus, the second type text corpus and the third type text corpus are merged, and the merged text corpus is de-duplicated, so as to achieve the purpose of generalizing the initial text corpus.

According to the corpus generalization method provided by the embodiment of the invention, the first type of text corpus is obtained by performing entity word replacement on the obtained initial text corpus in the industrial field, so that the initial text corpus is generalized for the first time; performing word segmentation on the initial text corpus and/or the first type text corpus, and then performing word replacement on the word segmentation to obtain a second type text corpus, thereby completing the second generalization on the initial text corpus; and finally, performing dependency syntactic analysis on at least one of the initial text corpus, the first type text corpus and the second type text corpus, and then performing sentence pattern structure change to finish the third generalization of the initial text corpus. The method effectively expands the initial text corpus by three different corpus generalization methods, accumulates more text corpora which can be used for functions of human-computer interaction or chat conversation and the like in the industrial field, is simple and quick to realize, and saves the corpus accumulation cost.

On the basis of the foregoing embodiment, the corpus generalization method provided in the embodiment of the present invention further includes, before replacing the entity words in the initial text corpus,:

Specifically, a business scenario refers to a collection including people, things, and related elements. In the embodiment of the invention, the entity dictionary which has the same service scene with the initial text corpus is constructed by adopting a manual gathering and sorting method. The method comprises the steps of manually determining a service scene corresponding to an initial text corpus in advance, searching entity words related to the service scene through various channels such as books and the internet, and forming an entity dictionary after arrangement. Or automatically acquiring related entity words through analysis of the initial text corpus by a computer program to generate an entity dictionary.

In the embodiment of the invention, the entity dictionary comprises a professional entity dictionary and a common entity dictionary. The professional entity dictionary refers to a professional entity dictionary in a business scene and comprises dictionaries of industrial products, industrial accessories, industrial raw materials, industrial brands and the like. For example, if the service scene in the initial text corpus is related to a mobile phone, the professional entity dictionary may include related entity words such as a display screen, a chip, a processor, and lithium. The common entity dictionary refers to the common names, companies, organizations, places and time of people in the same business scene as the initial text corpus. If the service scenario in the initial text corpus is still related to the mobile phone, the common entity dictionary may include: the entity words such as millet company, Rejun, Huashi, Samsung, etc.

In the embodiment of the invention, the entity recognition model is a neural network model, and before entity word recognition is carried out on the initial text corpus, the entity recognition model is subjected to a large amount of text corpus training with entity word labels in advance. In one implementation manner of the embodiment of the invention, an entity recognition model can be constructed by adopting a Bi-directional Long Short-Term Memory (Bi-LSTM) and a CRF. The Bi-LSTM-CRF method is characterized in that characters of all characters in an input sentence are embedded, the character embedding can be obtained in a random initialization mode, and a prediction label corresponding to each single word obtained through a Bi-LSTM-CRF model is output. By predefining entity word categories and carrying out a large amount of training on the Bi-LSTM-CRF model, the Bi-LSTM-CRF model can identify various types of entity words.

In the embodiment of the invention, the entity recognition model can be trained in advance by predefining professional entity words and/or common entity words, so that the entity recognition model can recognize the professional entity words and/or common entity words in the initial text corpus. And inputting the initial text corpus into the entity recognition model, and returning professional entity words and/or common entity words in the initial text corpus. Or the professional entity words or the common entity words in the initial text corpus are recognized through the professional entity dictionary or the common entity dictionary. The method comprises the steps of firstly segmenting words of an initial text corpus, matching words to be matched obtained after segmentation with a professional entity dictionary or a common entity dictionary, and if the words to be matched are successfully matched with entity words in the professional entity dictionary or the common entity dictionary, determining the words to be matched as the professional entity words or the common entity words. Matching refers to the similarity between the word to be matched and the entity word in the professional entity dictionary or the common entity dictionary, and the similarity can be calculated by using a word vector method. The method comprises the steps of representing words to be matched into a vector form, then calculating the cosine distance between two vectors, and if the cosine distance is within a preset range, proving that the matching is successful.

In the embodiment of the invention, all entity words in the initial text corpus can be recognized through the entity recognition model, and then the recognized entity words are matched with the corresponding entity dictionary, so that the successfully matched entity words are reserved, and the recognition accuracy is increased.

According to the corpus generalization method provided by the embodiment of the invention, the entity dictionary with the same service scene as the initial text corpus is constructed, so that the entity word recognition efficiency is improved, and the time required by entity word recognition is saved. By identifying the entity words in the initial text corpus, the subsequent corpus generalization process is facilitated.

On the basis of the foregoing embodiment, the corpus generalization method provided in the embodiment of the present invention specifically includes, in the step of replacing an entity word in the initial text corpus:

In the embodiment of the invention, after the professional entity words and/or the common entity words are identified, the corresponding entity slots are arranged according to the identified entity words. Entity slots refer to attributes that entity words have been well defined. For example, "i want to go to beijing from martian tomorrow", wherein the attributes in the departure place slot, destination slot, and departure time slot correspond to "departure place", "destination", and "departure time", respectively.

Entity slot filling refers to extracting values of well-defined attributes of a given entity word from a large-scale corpus. Namely, the similarity between the entity words identified in the initial text corpus and the entity words in the constructed entity dictionary is calculated, and the entity words with high similarity between the entity words identified in the entity dictionary and the entity words are filled in a preset entity slot. The similarity between the recognized entity words and the entity words in the constructed entity dictionary can be calculated by the word vector-based method.

The corpus generalization method provided by the embodiment of the invention expands the initial text corpus and obtains more corpora under the same service scene by setting the entity slot for the identified entity word and filling other entity words similar to the entity word into the entity slot.

On the basis of the foregoing embodiment, in the corpus generalization method provided in the embodiment of the present invention, the word obtained by the word segmentation processing is replaced based on the near-synonym of the word obtained by the word segmentation processing to obtain the second-class text corpus, and the method further includes:

In the embodiment of the present invention, based on the word segmentation method, words may be segmented in the initial text corpus and/or the first-class text corpus, and a single word may be obtained after the word segmentation. And the part-of-speech tagging can be adopted to acquire the target part-of-speech of the target word obtained by word segmentation processing. Part of speech is the basic syntactic property of a word, which includes nouns, adjectives, verbs, and so on. Part-of-speech tagging refers to a procedure for tagging each word in the segmentation result with a correct part-of-speech, i.e., a process for determining whether each word is a noun, a verb, an adjective, or other part-of-speech.

In the embodiment of the invention, the part of speech tagging can be carried out by adopting a dictionary lookup algorithm based on character string matching. For the word after word segmentation, the part of speech is directly searched from the existing dictionary for labeling. Part-of-speech tagging based on hidden markov models can also be employed. According to the method, the initial probability, the emission probability and the transition probability are obtained through large-scale corpus statistics. And calculating the probability corresponding to each part of speech of the word under different conditions, and then converting the word after word segmentation into a part of speech tagging sequence by using a dimension bit algorithm based on the initial probability, the emission probability and the transition probability.

The word vector is measured in terms of word property, emotional color, degree and the like, and represents a word by a set of scores, so that the words can be replaced and compared. The conversion process between words and vectors is the vectorization of words. In an implementation manner of the embodiment of the present invention, a Word vector model of Word2Vec may be used. The model mainly comprises two models: Skip-Gram model (Skip-Gram) and Continuous Bag of Words model (Continuous Bag of Words, CBOW). The model can simplify the processing of text content into vector operation in k-dimensional vector space through training, and the similarity on the vector space can be used for expressing the semantic similarity of the text. Therefore, the distance between two words can be calculated to determine the similar meaning words of the words obtained by word segmentation.

After all the near-meaning words of all the target words after the initial text corpus is participled are obtained based on the method, near-meaning word replacement can be carried out on the target words in the initial text corpus, and all the text corpora after the replacement is completed are combined to obtain the second type of text corpus.

The corpus generalization method provided by the embodiment of the invention simplifies the subsequent process of calculating the near-meaning words by performing part-of-speech tagging on the segmented target words. And then, the words after word segmentation are replaced by near-meaning words, so that the initial text corpus and/or the first-class text corpus are effectively expanded, and more text corpora under the same service scene are accumulated.

On the basis of the foregoing embodiment, the corpus generalization method provided in the embodiment of the present invention, after acquiring the initial text corpus in the industrial field, further includes:

acquiring a target template matched with the initial text corpus;

filling the target template, and determining a fourth type of text corpus;

In the embodiment of the present invention, the target template of the initial text corpus is obtained, and the entity word recognition may be performed on the initial text corpus based on the entity word recognition method. And after the entity words in the initial text corpus are identified, extracting the entity words, and setting a target template according to the extracted entity words. After the target template is determined, the substitute words with high similarity with the target template are calculated according to the entity words extracted from the template, and the corresponding substitute words are filled in the corresponding positions in the template, so that the fourth type of text corpora is determined.

In the embodiment of the present invention, generalizing the initial text corpus based on at least two items of the first-type text corpus, the second-type text corpus, the third-type text corpus and the fourth-type text corpus means that at least two items of the first-type text corpus, the second-type text corpus, the third-type text corpus and the fourth-type text corpus are combined, and since the same text corpus appears when entity word replacement, near-sense word replacement, sentence pattern conversion and target filling are performed, the combined text corpus needs to be de-duplicated, and repeated text corpora are removed, so that generalization of the initial text corpus is completed.

The corpus generalization method provided by the embodiment of the invention expands the initial text corpus by constructing the sentence pattern template related to the initial text corpus, so as to obtain more text corpora under the same service scene as the initial text corpus.

In the embodiment of the invention, the initial text corpus can be translated into English or other foreign languages, and then the translated initial text corpus is translated back into Chinese. The translation tool can be common translation software such as Baidu translation, a dictionary with tracks, Google translation and the like or language translation software of other children. Since the grammar of Chinese and foreign language is different in the translation process, the initial text corpus after translation and translation can be changed.

In the embodiment of the present invention, generalizing the initial text corpus based on at least two of the first type text corpus, the second type text corpus, the third type text corpus, the fourth type text corpus and the fifth type text corpus means that at least two of the first type text corpus, the second type text corpus, the third type text corpus, the fourth type text corpus and the fifth type text corpus are combined, and since the same text corpus appears when performing entity word replacement, near word replacement, sentence pattern conversion, target filling and translation, the combined text corpus needs to be deduplicated, and repeated text corpora are removed, so as to complete generalization of the initial text corpus.

The corpus generalization method provided by the embodiment of the invention expands the initial text corpus by translating the initial text corpus and then translating the initial text corpus back, thereby obtaining more text corpora under the same service scene as the initial text corpus.

On the basis of the foregoing embodiment, the corpus generalization method provided in the embodiment of the present invention is configured to generalize, based on at least two of the first-type text corpus, the second-type text corpus, and the third-type text corpus, the initial text corpus, and specifically includes:

and combining and de-duplicating at least two items in the first type of text corpus, the second type of text corpus and the third type of text corpus to obtain a generalized text corpus.

In the embodiment of the present invention, because there are repeated text corpora in the generated first-class text corpus, the second-class text corpus, and the third-class text corpus, at least two items of the repeated text corpora need to be combined, and then the combined text corpora needs to be deduplicated. The combined text corpus can be deduplicated by adopting a simhash method. The main idea of simhash is to reduce dimensions, map high-dimensional feature vectors into low-dimensional feature vectors, and determine whether an article is repeated or highly similar according to the Hamming distance between the two vectors. In the information theory, the hamming distance between two character strings with equal length is the number of different characters at the corresponding positions of the two character strings.

The text corpus to be merged is participled to obtain effective feature vectors, and then 5 levels of weights such as 1-5 are set for each feature vector. Where the numbers represent the degree of importance of the word in the whole sentence, larger numbers represent more important.

And calculating the hash value of each feature vector through a hash function, wherein the hash value is a signature consisting of binary numbers 0 and 1, and the character string can be changed into a series of numbers. And weighting all the eigenvectors on the basis of the hash value, wherein the hash value is multiplied by the weight value positively when 1 is met, and the hash value is multiplied by the weight value negatively when 0 is met. And accumulating the weighted results of the feature vectors to form a sequence string. And for the accumulated result of the signatures, if the accumulated result is greater than 0, setting 1, otherwise, setting 0, so as to obtain a simhash value of the statement, and finally, judging the similarity of the simhash value of each text corpus according to the hamming distance of the simhash of the text corpus, and if the hamming distance of two or more texts is within a preset range, considering that the text corpora are similar. Only one text with high similarity is reserved, and the rest texts are removed, so that the aim of removing the duplicate is fulfilled.

According to the corpus generalization method provided by the embodiment of the invention, the combined text corpus is de-duplicated, so that more accurate text corpus can be obtained, and a man-machine interaction or chat conversation scene can be better supported.

Fig. 2 is a schematic flow chart of a corpus generalization method according to an embodiment of the present invention. As shown in fig. 2, the method includes:

and S21, acquiring the initial text corpus.

After the execution of S21 is finished, simultaneously executing S22, and replacing entity words in the initial text corpus to generate a first type of text corpus; s23, performing word segmentation on the initial text corpus, calculating near-meaning words of the segmented words, performing near-meaning word replacement, and generating a second type of text corpus; and S24, performing dependency syntactic analysis and sentence pattern transformation on the initial text corpus to generate a third class text corpus.

S25, performing linguistic data generalization based on the first type of text linguistic data, the second type of text linguistic data and the third type of text linguistic data.

FIG. 3 is a schematic flow chart of a corpus generalization method according to another embodiment of the present invention. As shown in fig. 3, the method includes:

and S31, acquiring the initial text corpus.

And S32, replacing the entity words in the initial text corpus to generate a first type of text corpus.

And S33, performing word segmentation on the first text corpus, calculating the near-meaning words of the segmented words, performing near-meaning word replacement, and generating a second type of text corpus.

And S34, performing dependency syntactic analysis and sentence pattern transformation on the first type text corpus and/or the second type text corpus to generate a third type text corpus.

S35, performing linguistic data generalization based on the first type of text linguistic data, the second type of text linguistic data and the third type of text linguistic data.

In the industrial field, due to the lack of linguistic data and more professional terms, if the current emotion analysis technology is still adopted, the analysis effect is not good, the analysis result is inaccurate, and further the wrong emotion type is obtained. Therefore, the embodiment of the invention provides a man-machine conversation emotion analysis method used in the industrial field.

Fig. 4 is a schematic flowchart of a method for analyzing emotion of human-computer conversation in industrial field according to an embodiment of the present invention, as shown in fig. 4, the method includes:

s41, acquiring the man-machine conversation text data to be analyzed;

s42, inputting the human-computer conversation text data to be analyzed into an emotion classification model to obtain an emotion type corresponding to the human-computer conversation text data to be analyzed and output by the emotion classification model;

the emotion classification model is obtained by training a man-machine conversation text data sample carrying an emotion type label, and the man-machine conversation text data sample is obtained by generalization based on the corpus generalization method provided by any one of the embodiments.

Specifically, in the method for analyzing human-computer conversation emotion in the industrial field provided in the embodiment of the present invention, the execution main body is a server, which may be a local server or a cloud server, and the local server may be a computer, a tablet computer, a smart phone, or the like, which is not specifically limited in the embodiment of the present invention. The execution main body of the man-machine conversation emotion analysis method for the industrial field provided by the embodiment of the invention can be the same as or different from the corpus generalization provided by the embodiment.

Step S41 is performed first. The man-machine conversation text data to be analyzed can refer to man-machine conversation text data in which emotion types need to be determined in a man-machine conversation scene in the industrial field. The man-machine conversation scene refers to a scene that a user has a conversation with a machine, and the machine having the conversation may be a background service robot or the like. And man-machine conversation voice data can be obtained under the man-machine conversation scene. The man-machine conversation text data refers to text data corresponding to man-machine conversation voice data and can be obtained by carrying out voice recognition on the man-machine conversation voice data. The man-machine conversation voice data may be complete voice data including user voice data and machine voice data, or may include only user voice data. The user voice data may be obtained from the user end device, and the machine voice data may be obtained from the machine end device, which is not specifically limited in the embodiment of the present invention. After the man-machine conversation text data to be analyzed is obtained, the man-machine conversation text data to be analyzed can be subjected to preprocessing operations such as cleaning, special character removal, complex body to simplified body conversion and the like.

Then, step S42 is executed. The emotion classification model is used for performing emotion analysis on input man-machine conversation data to be analyzed, and obtaining and outputting emotion types corresponding to the man-machine conversation data to be analyzed. The emotional type can refer to emotional tendency of target objects such as products and/or services in the industrial field in the man-machine conversation data to be analyzed, and the emotional tendency can comprise positive emotional tendency, neutral emotional tendency and negative emotional tendency. Accordingly, the emotion types may include a positive emotion, a neutral emotion, and a negative emotion, and the positive emotion may be an active response to the target object, such as "good product quality" or the like. Neutral sentiment may be a response to the target object that is neither positive nor negative, such as "product quality general" or the like. Negative emotions may be negative responses to the target object, such as "product oil leakage" or the like.

The emotion classification model can be constructed through a neural network and obtained through training based on a man-machine conversation data sample carrying an emotion type label. Specifically, an emotion classification model can be constructed through a convolutional neural network, then the emotion classification model is trained through a man-machine conversation data sample carrying an emotion type label, the man-machine conversation data sample is input into the emotion classification model, a classification result output by the emotion classification model is obtained, a difference value between the classification result and the carried emotion type label is calculated, and a loss function is calculated based on the difference value. And adjusting model parameters of the emotion classification model until the loss function is minimum, and finishing training to obtain the trained emotion classification model.

It should be noted that, in the emotion classification model training process, the adopted human-computer dialogue data samples are obtained by generalizing the corpus text data in the industrial field by using the corpus generalization method provided in any one of the above embodiments, so that the number of the human-computer dialogue data samples can be enough for training the emotion classification model.

The method can acquire corpus text data with emotional tendency about industrial product and service evaluation in a man-machine conversation scene in the industrial field, and carry out preprocessing operations such as cleaning, removing special characters, and carrying out complex body to simplified body conversion on the corpus text data. And then, generalizing the preprocessed corpus text data by adopting the corpus generalization method provided by any one of the embodiments, and finally labeling emotion type labels such as positive emotion, neutral emotion and negative emotion to the corpus text data by combining the constructed entity dictionary to obtain a human-computer conversation text data sample.

The man-machine conversation emotion analysis method for the industrial field provided by the embodiment of the invention comprises the following steps: acquiring human-computer conversation text data to be analyzed; and inputting the human-computer conversation text data to be analyzed into an emotion classification model to obtain the emotion type corresponding to the human-computer conversation text data to be analyzed and output by the emotion classification model. Because the human-computer conversation text data sample adopted by the emotion classification model in the training process is obtained by generalizing the corpus text data in the industrial field based on the corpus generalization method provided by any one of the embodiments, the corpus text data volume in the industrial field can be increased, and further the human-computer conversation text data sample volume is increased, so that the emotion classification model has sufficient training samples, the accuracy and stability of the emotion classification model obtained by training are ensured, and the emotion type obtained by the emotion classification model is more accurate.

FIG. 5 is a corpus generalization system according to an embodiment of the present invention. As shown in fig. 5, the system includes:

a first-class text corpus generating module 501, configured to obtain an initial text corpus in the industrial field, and replace an entity word in the initial text corpus to obtain a first-class text corpus;

a second-class text corpus generating module 502, configured to perform word segmentation on the initial text corpus and/or the first-class text corpus, and replace words obtained by the word segmentation based on a near-synonym of the words obtained by the word segmentation to obtain a second-class text corpus;

a third-type text corpus generating module 503, configured to perform dependency syntactic analysis on at least one of the initial text corpus, the first-type text corpus, and the second-type text corpus, and perform sentence pattern transformation on the at least one item based on an analysis result to obtain a third-type text corpus;

a corpus generalization module 504, configured to generalize the initial text corpus based on at least two of the first type of text corpus, the second type of text corpus, and the third type of text corpus.

Specifically, the corpus generalization system provided by the embodiment of the present invention can be applied to a local server and can also be applied to a cloud.

In the embodiment of the present invention, the first-type text corpus generating module may perform entity word replacement on the initial text corpus to generate the first-type text corpus.

In the embodiment of the present invention, the second-class text corpus generating module may perform word segmentation on the initial text corpus and/or the first-class text corpus, and perform near-synonym calculation and near-synonym replacement on a single word obtained after word segmentation to generate the second-class text corpus.

In this embodiment of the present invention, the third-type text corpus generating module may perform dependency syntactic analysis on at least one of the initial text corpus, the first-type text corpus, and the second-type text corpus, and after obtaining a dependency relationship of sentences, perform sentence pattern transformation on the obtained dependency relationship to generate the third-type text corpus.

In this embodiment of the present invention, the corpus generalization module may merge at least two of the first type of text corpus, the second type of text corpus, and the third type of text corpus to obtain a merged text corpus, so as to generalize the initial text corpus.

According to the corpus generalization system provided by the embodiment of the invention, the three text corpus generation modules are used for generating the new text corpus, and the corpus generalization module is used for combining the new text corpus, so that the expansion of the initial text corpus is completed.

On the basis of the foregoing embodiment, the corpus generalization system provided in the embodiment of the present invention further includes:

and the entity dictionary generating module is used for constructing an entity dictionary which has the same service scene with the initial text corpus.

The entity word recognition module is used for recognizing entity words in the initial text corpus based on an entity recognition model and/or the entity dictionary; the entity recognition model is obtained by training based on the text corpus carrying the entity word labels.

On the basis of the foregoing embodiment, in the corpus generalization system provided in the embodiment of the present invention, the first-class text corpus generating module specifically includes:

an entity slot determining submodule, configured to determine an entity slot corresponding to an entity word in the initial text corpus;

and the entity slot filling sub-module is used for selecting entity words in the entity dictionary to fill the entity slots based on the similarity between the entity words in the initial text corpus and each entity word in the entity dictionary.

the target word determining module is used for determining target words belonging to the target part of speech in the words obtained by word segmentation processing;

a near meaning word determining module, configured to calculate a near meaning word of the target word based on a word vector model;

correspondingly, the second-class text corpus generating module is further specifically configured to:

a target template obtaining module, configured to obtain a target template of the initial text corpus;

the template filling module is used for filling the target template and determining a fourth type of text corpus;

correspondingly, the corpus generalization module is specifically configured to:

the translation module is used for translating the initial text corpus firstly and then translating the initial text corpus back to determine a fifth type of text corpus;

correspondingly, the corpus generalization module is further specifically configured to:

On the basis of the foregoing embodiment, in the corpus generalization system provided in the embodiment of the present invention, the corpus generalization module is further specifically configured to:

FIG. 6 is a schematic structural diagram of a human-computer interaction emotion analysis system for industrial fields according to an embodiment of the present invention. As shown in fig. 6, the system includes:

a text data obtaining module 601, configured to obtain text data of a human-computer conversation to be analyzed;

the emotion analysis module 602 is configured to input the to-be-analyzed human-computer conversation text data to an emotion classification model, and obtain an emotion type corresponding to the to-be-analyzed human-computer conversation text data output by the emotion classification model;

the emotion classification model is obtained by training a man-machine conversation text data sample carrying an emotion type label, and the man-machine conversation text data sample is obtained by generalizing the corpus text data in the industrial field based on the corpus generalization method provided by any one of the embodiments.

Specifically, the functions of the modules in the human-computer conversation emotion analysis system for the industrial field provided in the embodiment of the present invention correspond to the operation flows of the steps in the above method embodiments one to one, and the implementation effects are also consistent.

Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. The processor 710 can call the logic instructions in the memory 730 to execute the corpus generalization method or the human-machine conversation emotion analysis method for industrial fields provided in the above embodiments. The corpus generalization method comprises the following steps: acquiring an initial text corpus in the industrial field, and replacing entity words in the initial text corpus to obtain a first type of text corpus; performing word segmentation on the initial text corpus and/or the first type of text corpus, and replacing words obtained by word segmentation on the basis of near-synonyms of the words obtained by word segmentation to obtain a second type of text corpus; performing dependency syntax analysis on at least one of the initial text corpus, the first type of text corpus and the second type of text corpus, and performing sentence pattern transformation on the at least one item based on an analysis result to obtain a third type of text corpus; generalizing the initial text corpus based on at least two of the first type of text corpus, the second type of text corpus, and the third type of text corpus. The man-machine conversation emotion analysis method for the industrial field comprises the following steps: acquiring human-computer conversation text data to be analyzed; inputting the human-computer conversation text data to be analyzed into an emotion classification model to obtain an emotion type corresponding to the human-computer conversation text data to be analyzed and output by the emotion classification model; the emotion classification model is obtained by training a man-machine conversation text data sample carrying an emotion type label, and the man-machine conversation text data sample is obtained by generalization based on the corpus generalization method in any one of the embodiments.

In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention further provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer, the computer being capable of executing the corpus generalization method or the man-machine conversation emotion analysis method for industrial fields provided in the above embodiments. The corpus generalization method comprises the following steps: acquiring an initial text corpus in the industrial field, and replacing entity words in the initial text corpus to obtain a first type of text corpus; performing word segmentation on the initial text corpus and/or the first type of text corpus, and replacing words obtained by word segmentation on the basis of near-synonyms of the words obtained by word segmentation to obtain a second type of text corpus; performing dependency syntax analysis on at least one of the initial text corpus, the first type of text corpus and the second type of text corpus, and performing sentence pattern transformation on the at least one item based on an analysis result to obtain a third type of text corpus; generalizing the initial text corpus based on at least two of the first type of text corpus, the second type of text corpus, and the third type of text corpus. The man-machine conversation emotion analysis method for the industrial field comprises the following steps: acquiring human-computer conversation text data to be analyzed; inputting the human-computer conversation text data to be analyzed into an emotion classification model to obtain an emotion type corresponding to the human-computer conversation text data to be analyzed and output by the emotion classification model; the emotion classification model is obtained by training a man-machine conversation text data sample carrying an emotion type label, and the man-machine conversation text data sample is obtained by generalization based on the corpus generalization method in any one of the embodiments.

In still another aspect, the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method for generalizing a corpus or the method for analyzing human-computer interaction emotion in the industrial field, provided in the above embodiments. The corpus generalization method comprises the following steps: acquiring an initial text corpus in the industrial field, and replacing entity words in the initial text corpus to obtain a first type of text corpus; performing word segmentation on the initial text corpus and/or the first type of text corpus, and replacing words obtained by word segmentation on the basis of near-synonyms of the words obtained by word segmentation to obtain a second type of text corpus; performing dependency syntax analysis on at least one of the initial text corpus, the first type of text corpus and the second type of text corpus, and performing sentence pattern transformation on the at least one item based on an analysis result to obtain a third type of text corpus; generalizing the initial text corpus based on at least two of the first type of text corpus, the second type of text corpus, and the third type of text corpus. The man-machine conversation emotion analysis method for the industrial field comprises the following steps: acquiring human-computer conversation text data to be analyzed; inputting the human-computer conversation text data to be analyzed into an emotion classification model to obtain an emotion type corresponding to the human-computer conversation text data to be analyzed and output by the emotion classification model; the emotion classification model is obtained by training a man-machine conversation text data sample carrying an emotion type label, and the man-machine conversation text data sample is obtained by generalization based on the corpus generalization method in any one of the embodiments. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A corpus generalization method, comprising:

2. The corpus generalization method according to claim 1, wherein said replacing entity words in said initial text corpus further comprises:

3. The corpus generalization method according to claim 2, wherein said replacing entity words in said initial text corpus specifically comprises:

4. The corpus generalization method according to claim 1, wherein said terms obtained by participle processing are replaced based on the near-synonyms of the terms obtained by participle processing to obtain a second type of text corpus, and wherein said method further comprises:

5. The corpus generalization method according to claim 1, wherein said obtaining an initial text corpus of an industrial domain further comprises:

acquiring a target template of the initial text corpus;

filling the target template, and determining a fourth type of text corpus;

6. The corpus generalization method according to claim 5, wherein said obtaining an initial text corpus of an industrial domain further comprises:

7. A human-computer dialogue emotion analysis method for industrial field is characterized by comprising:

acquiring human-computer conversation text data to be analyzed;

the emotion classification model is obtained by training a man-machine conversation text data sample carrying an emotion type label, and the man-machine conversation text data sample is obtained by generalization based on the corpus generalization method of any one of claims 1 to 6.

8. A corpus generalization system, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the corpus generalization method according to any one of claims 1 to 6 or the human-computer interaction emotion analysis method for industrial application according to claim 7 when executing the program.

10. A non-transitory computer readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the corpus generalization method according to any one of claims 1 to 6 or the human-machine interaction emotion analysis method for industrial fields according to claim 7.