CN111985215A - Domain phrase dictionary construction method - Google Patents
Domain phrase dictionary construction method Download PDFInfo
- Publication number
- CN111985215A CN111985215A CN202010841791.6A CN202010841791A CN111985215A CN 111985215 A CN111985215 A CN 111985215A CN 202010841791 A CN202010841791 A CN 202010841791A CN 111985215 A CN111985215 A CN 111985215A
- Authority
- CN
- China
- Prior art keywords
- phrase
- word
- words
- dictionary
- phrases
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010276 construction Methods 0.000 title claims abstract description 23
- 238000000034 method Methods 0.000 claims abstract description 49
- 238000005065 mining Methods 0.000 claims abstract description 35
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 27
- 238000012549 training Methods 0.000 claims abstract description 13
- 230000011218 segmentation Effects 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 29
- 238000013527 convolutional neural network Methods 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 230000007547 defect Effects 0.000 abstract description 15
- 230000000694 effects Effects 0.000 abstract description 10
- 238000013135 deep learning Methods 0.000 abstract description 5
- 238000002474 experimental method Methods 0.000 description 10
- 238000004458 analytical method Methods 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 5
- 230000000052 comparative effect Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 150000001875 compounds Chemical class 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000004880 explosion Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000004566 building material Substances 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000002485 combustion reaction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000003631 expected effect Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000002269 spontaneous effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a domain phrase dictionary construction method, which comprises the following steps: mining phrases; constructing a domain word stock; and constructing a dictionary model. Mining phrases includes: preprocessing and segmenting original data, and then extracting all possibly-appearing phrase sets in sentences by adopting an adjacent word frequency phrase mining method for segmentation results. Constructing a domain word stock, comprising: and (3) training the phrase set by using a TF-IDF algorithm to obtain words with weights, and dividing the words into field-related words and irrelevant words by using a weight threshold. According to the field phrase dictionary building method, the statistical word frequency and word weight are used for quantifying the correlation degree of phrases and fields, the deep learning network is combined with the field dictionary building direction, the robustness of the field dictionary is obviously improved, good performance is realized in the building of consumer product field dictionaries, the building effect of consumer product defect field dictionaries is improved, and high accuracy, recall rate and F1 value can be achieved.
Description
Technical Field
The application relates to the technical field of text processing, in particular to a domain phrase dictionary construction method.
Background
With the development of modern economy, online shopping is more and more popular, and more consumer products appear in the life of people. Different kinds of consumer goods can cause various faults while improving the life quality of people. The online shopping website or APP receives a great deal of consumer product defect clue report information every day, and the defect clue information is mostly composed of short texts. Accurately mining the fault description information of the product from the clue text is helpful for dynamically controlling the direction of the defect field of the consumer product, and constructing a dictionary of the defect field of the consumer product is the basic work for realizing the aim. The domain dictionary is characterized in that key information of a professional domain is expressed by refined and short words, the content of the domain dictionary is essentially 'information extraction' of a text, namely, domain related words are extracted from a large amount of disordered texts and are classified according to different topics.
The combined words are more accurate and richer in text theme expression capability than ordinary single words. For example: compared with the three words of fingerprint, unlocking and invalidation, the fingerprint unlocking invalidation expresses the fault information of the product more simply and accurately; for another example, "sharing bicycle" is completely contrary to the original phrase if it is divided into two words "sharing" and "bicycle". Such a compound word having a strong intrinsic relation is called a "field related word".
So far, the mainstream method for constructing a domain dictionary in the industry adopts a rule-based expert system, and experts manually make a deterministic flow rule and extract domain words from a text by adopting a text matching mode. The biggest weakness of the method is that the system is difficult to maintain and expand, the language is constantly developed and changed, and the workload of manual maintenance and rule adaptation is huge. Meanwhile, conflicts are easily generated among a plurality of rules, and various experts are needed to eliminate the rule conflicts in different fields.
Disclosure of Invention
The application aims to provide a domain phrase dictionary construction method. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
According to an aspect of an embodiment of the present application, there is provided a domain phrase dictionary construction method, including:
mining phrases;
constructing a domain word stock;
and constructing a dictionary model.
Further, the mining phrases comprise:
preprocessing and segmenting original data, and then extracting all possibly-appearing phrase sets in sentences by adopting an adjacent word frequency phrase mining method for segmentation results.
Further, the mining phrases comprise:
in the document M, the sentence sequence T ═ T0,t1,t2,...,tnIs passed through t0And t1,t1、t2And t3Combined method generates phrase setsIn the process of traversing the document M to generate a phrase set, counting the generated phrases p, and counting p1Number of occurrences is C1,pnThe number of occurrences is counted as Cpn(ii) a At the same time, tnOccurrence of the secondary note C in the entire datasettn(ii) a The following formula is designed to calculate the importance of the phrase:
defining parameters by manual work, and setting offset weight for words and phrases; determining that the phrase p is by the phrase importance level EIf not, if E is greater than E (E >), adding it to the candidate word library
Further, the constructing of the domain lexicon comprises:
and (3) training the phrase set by using a TF-IDF algorithm to obtain words with weights, and dividing the words into field-related words and irrelevant words by using a weight threshold.
Further, the constructing of the domain lexicon comprises:
vocabulary x of calculation methodi,jTF-IDF values for document set D
tfidfi,j=tfi,j×idfi
Constructing an important vocabulary dictionary D according to the tfidf value of the vocabularytf={xi,j|tfidfi,jTheta is a threshold value for judging whether the word is added into the dictionary, and when the tfidf value is larger than theta, the word is added into the dictionary Dtf: the tfidf values for the phrases are then averaged:
training the weight of each word in the sentence sequence T through a TF-IDF algorithm, and constructing a phrase tag library by using the weight value; constructing candidate words extracted from the determined corpus into a noise-containing irrelevant word library when the candidate words do not match any high-quality phrases in the domain; instead, it is matched as a related thesaurus.
Further, the constructing the dictionary model includes:
constructing a dictionary model based on the convolutional neural network;
the word embedding layer of the CNN-PD model converts the text and the position characteristics into word vectors containing semantic characteristics;
the convolution layer constructs the word vector into a distributed multi-dimensional feature vector H;
and H, mapping through a full connection layer to obtain the score of each word.
According to another aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the domain phrase dictionary construction method described above.
According to another aspect of embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon, the program being executed by a processor to implement the above-mentioned domain phrase dictionary construction method.
The technical scheme provided by one aspect of the embodiment of the application can have the following beneficial effects:
according to the method for constructing the domain phrase dictionary, the phrase and domain correlation degree is quantified by means of statistical word frequency and word weight, the deep learning network is combined with the domain dictionary constructing direction, the robustness of the domain dictionary is obviously improved, good performance is achieved in construction of the consumer product domain dictionary, construction effects of the consumer product defect domain dictionary are improved, and high accuracy, recall rate and F1 value can be achieved.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application, or may be learned by the practice of the embodiments. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart illustrating the steps of constructing a domain phrase dictionary in one embodiment of the present application;
FIG. 2 is a diagram illustrating a method for generating phrases in an embodiment of the present application;
fig. 3 is a schematic structural diagram of a CNN-PD model in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Text mining refers to the process of mining out potential and valuable information from large-scale text data collections. With the emergence of a large amount of text data in the internet, text mining becomes an important research topic in the field of natural language processing.
The current research work focuses on applying machine learning and deep learning algorithms to domain dictionary construction directions. On the basis, the embodiment of the application extracts the syntactic and semantic features of the corpus by using the deep convolutional neural network, and simultaneously fuses text position information to improve the accuracy and generalization capability of the domain dictionary.
Firstly, combining binary phrases and ternary phrases from a text by adopting an adjacent word frequency analysis method, and judging the correlation degree of words and fields according to the occurrence frequency of the words; secondly, filtering out high-quality phrases in the phrase result through a TF-IDF (Term Frequency-Inverse Document Frequency) algorithm; and finally, extracting the text context semantic relation by using a Convolutional Neural Network, and constructing a domain Dictionary model-CNN-PD (conditional Neural Network-Phrase Dictionary) by combining the position information of the words.
Definition of
In the document set M, a sentence sequence T ═ T is given0,t1,t2,...,tnT is the result after word segmentation, and the phrase t is used0The first phrase may be represented by t1,t2...tnComposition, noted as:
at this time setA set of possible phrases may be generated for sentence T. CollectionThe valid phrases in (1) are referred to as high quality phrases, and the remaining phrases are referred to as noise. Among other things, high quality phrases must meet the following requirements:
(1) high frequency: the most important feature in determining whether a phrase conveys important information about a topic is the frequency of its occurrence in the topic. Phrases that do not occur frequently in a topic may not matter about the topic; conversely, phrases that appear very frequently in a topic may play an important role in that topic.
(2) The collocating property is as follows: in linguistics, colloquiality means that a compound word appears significantly more frequently in a corpus than it encounters by chance. Where common examples in phrase collocation are two candidate collocations, such as "high quality" and "strong quality". Usually, people will consider the two phrases to appear with similar frequency, but in the corpus, the phrase of "high quality" is considered to be more correct, and the usage frequency is higher, so that the phrase is more in line with the mainstream of people.
(3) Integrity: a phrase is considered complete when it can be interpreted as a complete semantic unit in a given document context. Where included phrases and sub-phrases may be considered complete phrases, depending on their contextual information. For example, "relational database systems," "relational databases," and "database systems" may all be valid in a particular context.
Thus, the task of the present embodiment can be visually expressed as: using method f in the setFinding high quality phrase collections that satisfy the above conditions(as shown in formula (2)). Meanwhile, the words meeting the high-quality phrase requirements are called domain related words; conversely, undesirable words are referred to as domain-independent words.
Constructing a domain phrase dictionary
The primary goal of the present embodiment is to build a dictionary library of consumer product failure information domain phrases. In order for the extracted phrases to meet the requirements, the whole task is divided into two main parts: the first part is phrase mining, and the main steps are inputting a corpus, wherein the corpus is a character sequence with any length in a specific language and a specific field, and outputting a phrase quality ranking list. The second part is dictionary construction, which inputs phrase word stock in the phrase mining result and outputs related phrases in the field.
The main construction steps are shown in figure 1. Firstly, preprocessing and segmenting original data, and then extracting all possibly-appearing phrase sets in sentences by adopting an adjacent word frequency phrase mining method for segmentation results. And then, training a phrase set by using a TF-IDF algorithm to obtain words with weights, and dividing the words into field related words and unrelated words by using a weight threshold. And finally, training through a convolutional neural network to obtain a domain dictionary classification model.
Adjacent word frequency phrase mining method
Due to the characteristics of Chinese word segmentation, a phrase combination word is usually segmented into two or more isolated words, and the segmented words cannot accurately express the original phrase meaning (such as 'fingerprint unlocking failure' and 'shared bicycle' mentioned in the introduction). The phrase mining is difficult and serious due to the complexity of Chinese word segmentation algorithm and the change of the Chinese word segmentation algorithm along with the change of the context. Therefore, the present embodiment provides a method for mining phrases by analyzing the combination frequency of adjacent words, which effectively solves the above problems, and the algorithm is briefly described as follows:
in the document M, the sentence sequence T ═ T0,t1,t2,...,tnIs passed through t0And t1,t1、t2And t3Combined method generates phrase setsThe generation flow is shown in fig. 2. The sharing and the bicycle are combined to generate a new word of the sharing bicycle, the child and the seat are combined to generate a child seat, and the like. In the process of traversing the document M to generate a phrase set, counting the generated phrases p, and counting p1Number of occurrences is C1,pnThe number of occurrences is counted as Cpn(ii) a At the same time, tnOccurrence of the secondary note C in the entire datasettn. The following formula is designed to calculate the importance of the phrase:
in equation (3): μ is a manually defined parameter that sets the offset weight for words and phrases. Judging whether the phrase p is a high-frequency phrase according to the phrase importance degree E, and adding the phrase p into the candidate word bank when the E is larger than (E >)The algorithm is based on: the combined phrases of a language are generally composed of adjacent words and are in a form of regression from left to right; thus, the set of all possible phrases in a sentence may be exhausted in a manner that combines the phrases from left to right. On the other hand, in the task of constructing the dictionary, the high frequency is one of the important factors of the domain dictionary, and the combination correctness and the collocating property of the phrases can be ensured through the calculation of the formula (3).
Compared with the traditional text mining algorithm (such as LDA algorithm, TextRank algorithm and the like), the adjacent word frequency analysis method collects and constructs the statistical data of the phrases in the phrase mining task, and guides the segmentation of the whole document set by calculating the weight values of the phrases. Meanwhile, the method utilizes the phrase context and the phrase importance in the construction process to ensure the effectiveness of high-frequency phrases; the algorithm has obvious effect in the task of mining the subject words conforming to the text center, and has the capability of mining new words in the field and rare words in the field.
Field word library construction based on TF-IDF algorithm
Through experimentation, an important conclusion is drawn: although a large number of combined phrases can be obtained by the adjacent word frequency phrase mining method, most of the phrases are inferior phrases-phrases that do not meet the standard are called inferior phrases, for example: "Beijing and", "they are" and so on, refer to the combination of words and media words. In fact, of the large number of candidate phrases, typically only about 10% of the phrases belong to the high-quality phrase, and fewer are in line with the high-quality phrase. Therefore, it is necessary to establish a standard word library of domain correlation.
On constructBefore a word bank is built, a TF-IDF value of each word in a text sequence T needs to be calculated, and the TF-IDF algorithm is a common weighting technology for information retrieval and data mining. The result of the recommendation algorithm can be screened and filtered by using the TF-IDF value of the text. Specifically, tf (term frequency) refers to the frequency of occurrence of a given term in the Document, and idf (inverse Document frequency) refers to the inverse Document frequency, which is a measure of the general importance of the term. Vocabulary xi,jThe TF-IDF value for the document set D is calculated as follows:
tfidfi,j=tfi,j×idfi (6)
according to tfidf value of vocabulary, an important vocabulary dictionary D can be constructedtf={xi,j|tfidfi,jTheta is a threshold value for judging whether the word is added into the dictionary, and when the tfidf value is larger than theta, the word is added into the dictionary Dtf. For the tfidf value of the phrase, it is averaged using equation (7):
and training the weight of each word in the sentence sequence T through a TF-IDF algorithm, and constructing a phrase label library by using the weight value. Constructing candidate words extracted from the determined corpus into a noise-containing irrelevant word library when the candidate words do not match any high-quality phrases in the domain; instead, it is matched as a related thesaurus.
Using noisy data as a training set is obviously not a sensible option, with a set of subtractionsThe method of stopping words achieves the effect of primarily reducing noise. If any element in the stop word set is included in the phrase p, it is discarded, for example: if the stop word is included in the ' Beijing and ' the ' phrase, the phrase is deleted from the candidate word stock. After noise is removed, a training set of a word bank is constructed through tfidf weight, according to an experimental result, tfidf values are larger than 0.4 (namely theta is 0.4), manually screened words are added into a positive word bank to form a positive sample, and words with tfidf values smaller than 0.4 are added into a negative word bank to form a negative sample. Meanwhile, 500 manually labeled high-quality words and phrases are added into the forward word stock to expand the diversity of the sample.
Dictionary model constructed based on convolutional neural network
The expected effect cannot be achieved by singly relying on the shallow term frequency characteristics of the text to carry out model training, so the embodiment utilizes the deep convolutional neural network to extract the syntactic and semantic characteristics of the sentence so as to improve the task accuracy. Given a sentence sequence of length N, T ═ T0,t1,t2,...,tNAnd (3) calculating the score of each word in the sentence by the model CNN-PD, and determining whether the word is a domain-related word by a score judgment, wherein the model structure is shown in FIG. 3, embedding means embedding, convoluting means convolution, posing means pooling, and scores means scores. Firstly, a word embedding layer of a CNN-PD model converts text and position features into word vectors containing semantic features; then, the convolution layer constructs the word vector into a distributed multi-dimensional feature vector H; finally, H obtains the score of each word through full-link layer mapping.
(1) Word embedding layer
Given sentence sequence T ═ T0,t1,t2,...,tNAnd the word embedding layer updates the word vector corresponding to each word by back propagation of a neural networkWhen the word vector matrix isThe word embedding layer encodes the input words into vector form by the vector dot-product matrix operation in equation (8). Wherein the content of the first and second substances,v is the training dictionary size and d is the artificially defined word vector dimension. v. ofpIs a one-hot vector with 1 at position p and 0 at the remaining positions.
et=vpWemb (8)
(2) Position coding layer
In the task, the relative position of the word is an important feature, and the classification boundary of the related word can be determined by the position distance between the word and the target. The relative position information of the words is used to track the proximity relationship between the target word and other words. This embodiment will adopt Word Position Embedding (WPE) to promote the effect of the model. WPE refers to encoding the relative position information of a target word and other words into a vector form so as to be fused with other features. For example, the relative position of "in" and the target words "mobile phone" and "explosion" in fig. 3 is [ -1, 7 [ -1 [ -7 [ ]]Mapping the position vector to dwpeVector, dwpeThe model hyper-parameters are initialized by random values. Thus, the position-coding vector of sentence T isWhereinThe position-coding vector will then enter the convolutional layer network by concatenating the word-embedded vector.
(3) Convolutional neural network
The convolutional layer will perform a point-product operation of the vector matrix using a convolution kernel of size k and the input word vector. To deal with the problem of indexing words outside sentence boundaries, the input vector will be filled with zero vectorsNext, the process is carried out.
[et]j=max1<n<N[f(Wcembed+bc)] (9)
wherein the content of the first and second substances,the training weight matrix for the convolutional layer, the output of which is the convolutional kernel feature with a window size of k. The max operation in the formula maps the convolution kernel output vector to a vector with the same length as the sentenceFinally, Z isNAnd performing softmax operation on the output matrix to obtain a score vector of the word.
score=Softmax(ZN) (10)
The embodiment of the application provides a relatively complete method for constructing a domain phrase professional dictionary; a method for quantifying the correlation degree of phrases and fields by using statistical word frequency and word weight is provided; a method for combining a deep learning network with the direction of constructing a domain dictionary is provided, and the robustness of the domain dictionary is obviously improved. The method provided by the embodiment of the application can effectively solve a plurality of problems of manual dictionary construction.
The embodiment of the application extracts a single word, a combined word or a phrase associated with the defect information of the consumer goods from a large amount of defect clue report data, thereby realizing the construction of a dictionary in the defect field of the consumer goods.
The experimental data of another embodiment of the application is desensitized consumer product defect clue report data and collected internet e-commerce commodity comment data provided by a certain online shopping website. With about 1.5 million pieces of thread report data (data set a) and about 3 million pieces of internet data (data set B), totaling about 4.5 million pieces of data. The data comprises electronic appliances, hardware building materials, children toys and other articles, each piece of data is submitted by a real consumer, and part of data samples are shown in table 1. The related words of the field determined by manual screening generally comprise consumer product names, fault description phrases and the like, and the number of the related words is about 1000; the domain-independent words are generally composed of place names, person names and domain-independent dynamic terms, and the number of the domain-independent dynamic terms is about 4000.
TABLE 1 corpus sample and artificially defined domain word sample
Evaluation criteria
The embodiment provides a method for constructing a phrase dictionary in the defect field of consumer goods, which can be divided into three steps: mining based on adjacent word frequency phrases; constructing a domain related word bank based on a TF-IDF algorithm; and constructing a domain dictionary model based on the convolutional neural network. Because the first two steps have no unified numerical standard, the effect of the method is highlighted by adopting a mode of displaying an experimental result; the third step is essentially a multi-label classification process, so the present embodiment will use Macro Average numerical index to evaluate the result of the dictionary mining experiment, and the calculation method is shown in the following formula (11) (12) (13):
wherein TPcRepresenting the number of domain-related words predicted to be correct; FPcRepresenting that the words are actually irrelevant words, but the prediction result is the number of relevant words; FN (FN)cIndicating that the word is actually related, but the prediction result is the number of unrelated words.
Results and analysis of the experiments
(1) Word frequency phrase mining
The experiment is based on a corpus after word segmentation, and domain short words are mined from a text by adopting an adjacent word frequency phrase mining method. Partial results are shown in table 2, where the words with spaces in the table are phrase words, such as: the "charging port" means a word segmentation result of "charging" and "port", and its phrase is "charging port". E is the result of normalization of formula (3). As can be seen from the table, the phrases with high frequency of appearance include phrases related to the failure of the consumer goods, such as "quality problem", "product defect", etc., but also include phrases unrelated to "may", "in use", etc. It is worth mentioning that the method has good performance on the mining results of long phrase words such as ' quality inspection bureau ', ' consumer ' right protection method '; meanwhile, the mining result of the new words of the network such as the 'shared bicycle' is satisfactory.
TABLE 2 phrase mining Experimental results
(2) Word bank experimental result and analysis constructed based on TF-IDF algorithm
The experimental results of constructing the domain-related word bank based on the TF-IDF algorithm are shown in Table 3. The table lists the part words with higher tfidf values (as related word lexicon) and the part words with lower tfidf values (as unrelated word lexicon), wherein the division word relevance threshold is 0.4.
TABLE 3 phrase thesaurus construction Experimental results
From the above table, it can be concluded that the words related to the field of defects in consumer products have higher tfidf values, such as "flash screen", "spontaneous combustion", "explosion", etc.; in contrast, the domain-independent word tfidf is generally small, such as "cause", "find", "cause a problem", and the like. Meanwhile, in the experimental result, although the occurrence frequency of the charging port is high in the word frequency mining experiment, the tfidf value of the charging port is smaller than the threshold value theta, but the charging port is a related word in the field. Through observation and analysis, the situation is mainly related to the TF-IDF algorithm, and the number of clues about 'charging ports' in the whole corpus is too large to report data, so that the denominator term in the formula (4) is increased, and the tfidf value in the formula (5) is reduced. The above conclusions also indicate the limitations of the TF-IDF algorithm.
(3) Dictionary model constructed based on convolutional neural network
In order to verify the experimental effect of the embodiment, based on the data corpus, a hidden dirichlet allocation (LDA) model and a Support Vector Machine (SVM) model are used as comparison experiments. Meanwhile, the LSTM is used for replacing the convolutional layer in the model to serve as another set of comparison experiments, so that the effectiveness of the convolutional neural network used in the embodiment is verified. The results of the comparative experiments are shown in Table 4, where the term "Noother" is a statistical result of calculating only the domain-related words and the domain-independent words.
As can be seen from the data in Table 4, compared with the traditional text model method, the F-1 value of the deep learning method is improved by 6-8%, and compared with the LSTM model, the result based on the CNN model is improved by 4%. In order to show the effect of the method in different data sets, experiments are respectively performed in the data set a and the data set B, and the results have no obvious difference. In addition, to verify the importance of WPE in this experiment, it was used as a variable in comparative experiments; according to experimental results, compared with an original model, the model fused with the WPE is obviously improved in all evaluation indexes. On the other hand, the experimental results of the CNN model and the LSTM model are compared, the conclusion is drawn at the beginning, and in the aspect of combining with the position information, the time sequence of the convolutional network has a better information extraction effect.
Table 4 comparative experimental results
The embodiment of the application provides a consumer product defect field phrase dictionary construction method based on a convolutional neural network. Firstly, constructing a large amount of phrase texts containing noise by using an adjacent word frequency phrase mining method, and then filtering related phrases in the field by using phrase frequency weight of the phrases; secondly, a field-related word bank is constructed based on the TF-IDF algorithm, so that the manual labeling cost is reduced; and finally, constructing a domain dictionary model based on the convolutional neural network to generate a domain dictionary. The method provided by the embodiment of the application has good performance in construction of the consumer goods domain dictionary, improves construction effects of the consumer goods defect domain dictionary, and can provide effective ideas and solutions for construction of dictionaries in other domains. The method of the embodiment of the application can achieve high accuracy, recall rate and F1 value.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in a strict order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The above-mentioned embodiments only express the embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.
Claims (8)
1. A domain phrase dictionary construction method is characterized by comprising the following steps:
mining phrases;
constructing a domain word stock;
and constructing a dictionary model.
2. The method of claim 1, wherein mining phrases comprises:
preprocessing and segmenting original data, and then extracting all possibly-appearing phrase sets in sentences by adopting an adjacent word frequency phrase mining method for segmentation results.
3. The method of claim 1, wherein mining phrases comprises:
in the document M, the sentence sequence T ═ T0,t1,t2,...,tnIs passed through t0And t1,t1、t2And t3Combined method generates phrase setsIn the process of traversing the document M to generate a phrase set, counting the generated phrases p, and counting p1Number of occurrences is C1,pnThe number of occurrences is counted as Cpn(ii) a At the same time, tnOccurrence of the secondary note C in the entire datasettn(ii) a The following formula is designed to calculate the importance of the phrase:
4. The method of claim 1, wherein the constructing a domain lexicon comprises:
and (3) training the phrase set by using a TF-IDF algorithm to obtain words with weights, and dividing the words into field-related words and irrelevant words by using a weight threshold.
5. The method of claim 1, wherein the constructing a domain lexicon comprises:
vocabulary x of calculation methodi,jTF-IDF values for document set D
tfidfi,j=tfi,j×idfi
Constructing an important vocabulary dictionary D according to the tfidf value of the vocabularytf={xi,j|tfidfi,jTheta is a threshold value for judging whether the word is added into the dictionary, and when the tfidf value is larger than theta, the word is added into the dictionary Dtf(ii) a The tfidf values for the phrases are then averaged:
training the weight of each word in the sentence sequence T through a TF-IDF algorithm, and constructing a phrase tag library by using the weight value; constructing candidate words extracted from the determined corpus into a noise-containing irrelevant word library when the candidate words do not match any high-quality phrases in the domain; instead, it is matched as a related thesaurus.
6. The method of claim 1, wherein the constructing a dictionary model comprises:
constructing a dictionary model based on the convolutional neural network;
the word embedding layer of the CNN-PD model converts the text and the position characteristics into word vectors containing semantic characteristics;
the convolution layer constructs the word vector into a distributed multi-dimensional feature vector H;
and H, mapping through a full connection layer to obtain the score of each word.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the method of any one of claims 1-6.
8. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor to implement the method according to any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010841791.6A CN111985215A (en) | 2020-08-19 | 2020-08-19 | Domain phrase dictionary construction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010841791.6A CN111985215A (en) | 2020-08-19 | 2020-08-19 | Domain phrase dictionary construction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111985215A true CN111985215A (en) | 2020-11-24 |
Family
ID=73442383
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010841791.6A Pending CN111985215A (en) | 2020-08-19 | 2020-08-19 | Domain phrase dictionary construction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111985215A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597760A (en) * | 2020-12-04 | 2021-04-02 | 光大科技有限公司 | Method and device for extracting domain words in document |
CN112613612A (en) * | 2020-12-29 | 2021-04-06 | 合肥工业大学 | Method and device for constructing green design knowledge base based on patent library |
CN113705200A (en) * | 2021-08-31 | 2021-11-26 | 中国平安财产保险股份有限公司 | Method, device and equipment for analyzing complaint behavior data and storage medium |
CN115827854A (en) * | 2022-12-28 | 2023-03-21 | 数据堂(北京)科技股份有限公司 | Voice abstract generation model training method, voice abstract generation method and device |
CN116257601A (en) * | 2023-03-01 | 2023-06-13 | 云目未来科技(湖南)有限公司 | Illegal word stock construction method and system based on deep learning |
-
2020
- 2020-08-19 CN CN202010841791.6A patent/CN111985215A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597760A (en) * | 2020-12-04 | 2021-04-02 | 光大科技有限公司 | Method and device for extracting domain words in document |
CN112613612A (en) * | 2020-12-29 | 2021-04-06 | 合肥工业大学 | Method and device for constructing green design knowledge base based on patent library |
CN112613612B (en) * | 2020-12-29 | 2022-08-02 | 合肥工业大学 | Method and device for constructing green design knowledge base based on patent library |
CN113705200A (en) * | 2021-08-31 | 2021-11-26 | 中国平安财产保险股份有限公司 | Method, device and equipment for analyzing complaint behavior data and storage medium |
CN113705200B (en) * | 2021-08-31 | 2023-09-15 | 中国平安财产保险股份有限公司 | Analysis method, analysis device, analysis equipment and analysis storage medium for complaint behavior data |
CN115827854A (en) * | 2022-12-28 | 2023-03-21 | 数据堂(北京)科技股份有限公司 | Voice abstract generation model training method, voice abstract generation method and device |
CN115827854B (en) * | 2022-12-28 | 2023-08-11 | 数据堂(北京)科技股份有限公司 | Speech abstract generation model training method, speech abstract generation method and device |
CN116257601A (en) * | 2023-03-01 | 2023-06-13 | 云目未来科技(湖南)有限公司 | Illegal word stock construction method and system based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hogenboom et al. | Determining negation scope and strength in sentiment analysis | |
Wang et al. | Using Wikipedia knowledge to improve text classification | |
CN111985215A (en) | Domain phrase dictionary construction method | |
CN103678565B (en) | Domain self-adaption sentence alignment system based on self-guidance mode | |
Zhang et al. | Short text classification based on feature extension using the n-gram model | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. | |
Li et al. | The mixture of TextRank and LexRank techniques of single document automatic summarization research in Tibetan | |
Gao et al. | Sentiment classification for stock news | |
Liu et al. | A new approach to process the unknown words in financial public opinion | |
Sanchez-Gomez et al. | Sentiment-oriented query-focused text summarization addressed with a multi-objective optimization approach | |
Bayoudhi et al. | Sentiment classification at discourse segment level: Experiments on multi-domain Arabic corpus | |
Xu et al. | Expanding Chinese sentiment dictionaries from large scale unlabeled corpus | |
Zheng et al. | Multi-dimensional sentiment analysis for large-scale E-commerce reviews | |
Zheng et al. | An adaptive LDA optimal topic number selection method in news topic identification | |
Feng et al. | Product feature extraction via topic model and synonym recognition approach | |
El Idrissi Esserhrouchni et al. | Learning domain taxonomies: The TaxoLine approach | |
Kang et al. | Sampling latent emotions and topics in a hierarchical Bayesian network | |
Mashina | Application of statistical methods to solve the problem of enriching ontologies of developing subject areas | |
Li et al. | Research on improve topic representation over short text | |
Wang et al. | Natural language processing systems and Big Data analytics | |
Pandi et al. | Reputation based online product recommendations | |
Yu et al. | Research on Chinese Text Sentiment Classification Process | |
Tutubalina et al. | Topic Models with Sentiment Priors Based on Distributed Representations | |
Yan | Research on keyword extraction based on abstract extraction | |
Yu et al. | Hidden Markov-based LDA Internet Sensitive Information Text Filtering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20201124 |
|
WD01 | Invention patent application deemed withdrawn after publication |