CN111985215A - Domain phrase dictionary construction method - Google Patents

Domain phrase dictionary construction method Download PDF

Info

Publication number
CN111985215A
CN111985215A CN202010841791.6A CN202010841791A CN111985215A CN 111985215 A CN111985215 A CN 111985215A CN 202010841791 A CN202010841791 A CN 202010841791A CN 111985215 A CN111985215 A CN 111985215A
Authority
CN
China
Prior art keywords
phrase
word
words
dictionary
phrases
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010841791.6A
Other languages
Chinese (zh)
Inventor
吕学强
孙宁
张乐
姜肇财
宋黎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
China National Institute of Standardization
Original Assignee
Beijing Information Science and Technology University
China National Institute of Standardization
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University, China National Institute of Standardization filed Critical Beijing Information Science and Technology University
Priority to CN202010841791.6A priority Critical patent/CN111985215A/en
Publication of CN111985215A publication Critical patent/CN111985215A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a domain phrase dictionary construction method, which comprises the following steps: mining phrases; constructing a domain word stock; and constructing a dictionary model. Mining phrases includes: preprocessing and segmenting original data, and then extracting all possibly-appearing phrase sets in sentences by adopting an adjacent word frequency phrase mining method for segmentation results. Constructing a domain word stock, comprising: and (3) training the phrase set by using a TF-IDF algorithm to obtain words with weights, and dividing the words into field-related words and irrelevant words by using a weight threshold. According to the field phrase dictionary building method, the statistical word frequency and word weight are used for quantifying the correlation degree of phrases and fields, the deep learning network is combined with the field dictionary building direction, the robustness of the field dictionary is obviously improved, good performance is realized in the building of consumer product field dictionaries, the building effect of consumer product defect field dictionaries is improved, and high accuracy, recall rate and F1 value can be achieved.

Description

Domain phrase dictionary construction method
Technical Field
The application relates to the technical field of text processing, in particular to a domain phrase dictionary construction method.
Background
With the development of modern economy, online shopping is more and more popular, and more consumer products appear in the life of people. Different kinds of consumer goods can cause various faults while improving the life quality of people. The online shopping website or APP receives a great deal of consumer product defect clue report information every day, and the defect clue information is mostly composed of short texts. Accurately mining the fault description information of the product from the clue text is helpful for dynamically controlling the direction of the defect field of the consumer product, and constructing a dictionary of the defect field of the consumer product is the basic work for realizing the aim. The domain dictionary is characterized in that key information of a professional domain is expressed by refined and short words, the content of the domain dictionary is essentially 'information extraction' of a text, namely, domain related words are extracted from a large amount of disordered texts and are classified according to different topics.
The combined words are more accurate and richer in text theme expression capability than ordinary single words. For example: compared with the three words of fingerprint, unlocking and invalidation, the fingerprint unlocking invalidation expresses the fault information of the product more simply and accurately; for another example, "sharing bicycle" is completely contrary to the original phrase if it is divided into two words "sharing" and "bicycle". Such a compound word having a strong intrinsic relation is called a "field related word".
So far, the mainstream method for constructing a domain dictionary in the industry adopts a rule-based expert system, and experts manually make a deterministic flow rule and extract domain words from a text by adopting a text matching mode. The biggest weakness of the method is that the system is difficult to maintain and expand, the language is constantly developed and changed, and the workload of manual maintenance and rule adaptation is huge. Meanwhile, conflicts are easily generated among a plurality of rules, and various experts are needed to eliminate the rule conflicts in different fields.
Disclosure of Invention
The application aims to provide a domain phrase dictionary construction method. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
According to an aspect of an embodiment of the present application, there is provided a domain phrase dictionary construction method, including:
mining phrases;
constructing a domain word stock;
and constructing a dictionary model.
Further, the mining phrases comprise:
preprocessing and segmenting original data, and then extracting all possibly-appearing phrase sets in sentences by adopting an adjacent word frequency phrase mining method for segmentation results.
Further, the mining phrases comprise:
in the document M, the sentence sequence T ═ T0,t1,t2,...,tnIs passed through t0And t1,t1、t2And t3Combined method generates phrase sets
Figure BSA0000217356920000021
In the process of traversing the document M to generate a phrase set, counting the generated phrases p, and counting p1Number of occurrences is C1,pnThe number of occurrences is counted as Cpn(ii) a At the same time, tnOccurrence of the secondary note C in the entire datasettn(ii) a The following formula is designed to calculate the importance of the phrase:
Figure BSA0000217356920000022
defining parameters by manual work, and setting offset weight for words and phrases; determining that the phrase p is by the phrase importance level EIf not, if E is greater than E (E >), adding it to the candidate word library
Figure BSA0000217356920000025
Further, the constructing of the domain lexicon comprises:
and (3) training the phrase set by using a TF-IDF algorithm to obtain words with weights, and dividing the words into field-related words and irrelevant words by using a weight threshold.
Further, the constructing of the domain lexicon comprises:
vocabulary x of calculation methodi,jTF-IDF values for document set D
Figure BSA0000217356920000023
Figure BSA0000217356920000024
tfidfi,j=tfi,j×idfi
Constructing an important vocabulary dictionary D according to the tfidf value of the vocabularytf={xi,j|tfidfi,jTheta is a threshold value for judging whether the word is added into the dictionary, and when the tfidf value is larger than theta, the word is added into the dictionary Dtf: the tfidf values for the phrases are then averaged:
Figure BSA0000217356920000031
training the weight of each word in the sentence sequence T through a TF-IDF algorithm, and constructing a phrase tag library by using the weight value; constructing candidate words extracted from the determined corpus into a noise-containing irrelevant word library when the candidate words do not match any high-quality phrases in the domain; instead, it is matched as a related thesaurus.
Further, the constructing the dictionary model includes:
constructing a dictionary model based on the convolutional neural network;
the word embedding layer of the CNN-PD model converts the text and the position characteristics into word vectors containing semantic characteristics;
the convolution layer constructs the word vector into a distributed multi-dimensional feature vector H;
and H, mapping through a full connection layer to obtain the score of each word.
According to another aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the domain phrase dictionary construction method described above.
According to another aspect of embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon, the program being executed by a processor to implement the above-mentioned domain phrase dictionary construction method.
The technical scheme provided by one aspect of the embodiment of the application can have the following beneficial effects:
according to the method for constructing the domain phrase dictionary, the phrase and domain correlation degree is quantified by means of statistical word frequency and word weight, the deep learning network is combined with the domain dictionary constructing direction, the robustness of the domain dictionary is obviously improved, good performance is achieved in construction of the consumer product domain dictionary, construction effects of the consumer product defect domain dictionary are improved, and high accuracy, recall rate and F1 value can be achieved.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application, or may be learned by the practice of the embodiments. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart illustrating the steps of constructing a domain phrase dictionary in one embodiment of the present application;
FIG. 2 is a diagram illustrating a method for generating phrases in an embodiment of the present application;
fig. 3 is a schematic structural diagram of a CNN-PD model in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Text mining refers to the process of mining out potential and valuable information from large-scale text data collections. With the emergence of a large amount of text data in the internet, text mining becomes an important research topic in the field of natural language processing.
The current research work focuses on applying machine learning and deep learning algorithms to domain dictionary construction directions. On the basis, the embodiment of the application extracts the syntactic and semantic features of the corpus by using the deep convolutional neural network, and simultaneously fuses text position information to improve the accuracy and generalization capability of the domain dictionary.
Firstly, combining binary phrases and ternary phrases from a text by adopting an adjacent word frequency analysis method, and judging the correlation degree of words and fields according to the occurrence frequency of the words; secondly, filtering out high-quality phrases in the phrase result through a TF-IDF (Term Frequency-Inverse Document Frequency) algorithm; and finally, extracting the text context semantic relation by using a Convolutional Neural Network, and constructing a domain Dictionary model-CNN-PD (conditional Neural Network-Phrase Dictionary) by combining the position information of the words.
Definition of
In the document set M, a sentence sequence T ═ T is given0,t1,t2,...,tnT is the result after word segmentation, and the phrase t is used0The first phrase may be represented by t1,t2...tnComposition, noted as:
Figure BSA0000217356920000051
at this time set
Figure BSA0000217356920000056
A set of possible phrases may be generated for sentence T. Collection
Figure BSA0000217356920000052
The valid phrases in (1) are referred to as high quality phrases, and the remaining phrases are referred to as noise. Among other things, high quality phrases must meet the following requirements:
(1) high frequency: the most important feature in determining whether a phrase conveys important information about a topic is the frequency of its occurrence in the topic. Phrases that do not occur frequently in a topic may not matter about the topic; conversely, phrases that appear very frequently in a topic may play an important role in that topic.
(2) The collocating property is as follows: in linguistics, colloquiality means that a compound word appears significantly more frequently in a corpus than it encounters by chance. Where common examples in phrase collocation are two candidate collocations, such as "high quality" and "strong quality". Usually, people will consider the two phrases to appear with similar frequency, but in the corpus, the phrase of "high quality" is considered to be more correct, and the usage frequency is higher, so that the phrase is more in line with the mainstream of people.
(3) Integrity: a phrase is considered complete when it can be interpreted as a complete semantic unit in a given document context. Where included phrases and sub-phrases may be considered complete phrases, depending on their contextual information. For example, "relational database systems," "relational databases," and "database systems" may all be valid in a particular context.
Thus, the task of the present embodiment can be visually expressed as: using method f in the set
Figure BSA0000217356920000053
Finding high quality phrase collections that satisfy the above conditions
Figure BSA0000217356920000054
(as shown in formula (2)). Meanwhile, the words meeting the high-quality phrase requirements are called domain related words; conversely, undesirable words are referred to as domain-independent words.
Figure BSA0000217356920000055
Constructing a domain phrase dictionary
The primary goal of the present embodiment is to build a dictionary library of consumer product failure information domain phrases. In order for the extracted phrases to meet the requirements, the whole task is divided into two main parts: the first part is phrase mining, and the main steps are inputting a corpus, wherein the corpus is a character sequence with any length in a specific language and a specific field, and outputting a phrase quality ranking list. The second part is dictionary construction, which inputs phrase word stock in the phrase mining result and outputs related phrases in the field.
The main construction steps are shown in figure 1. Firstly, preprocessing and segmenting original data, and then extracting all possibly-appearing phrase sets in sentences by adopting an adjacent word frequency phrase mining method for segmentation results. And then, training a phrase set by using a TF-IDF algorithm to obtain words with weights, and dividing the words into field related words and unrelated words by using a weight threshold. And finally, training through a convolutional neural network to obtain a domain dictionary classification model.
Adjacent word frequency phrase mining method
Due to the characteristics of Chinese word segmentation, a phrase combination word is usually segmented into two or more isolated words, and the segmented words cannot accurately express the original phrase meaning (such as 'fingerprint unlocking failure' and 'shared bicycle' mentioned in the introduction). The phrase mining is difficult and serious due to the complexity of Chinese word segmentation algorithm and the change of the Chinese word segmentation algorithm along with the change of the context. Therefore, the present embodiment provides a method for mining phrases by analyzing the combination frequency of adjacent words, which effectively solves the above problems, and the algorithm is briefly described as follows:
in the document M, the sentence sequence T ═ T0,t1,t2,...,tnIs passed through t0And t1,t1、t2And t3Combined method generates phrase sets
Figure BSA0000217356920000061
The generation flow is shown in fig. 2. The sharing and the bicycle are combined to generate a new word of the sharing bicycle, the child and the seat are combined to generate a child seat, and the like. In the process of traversing the document M to generate a phrase set, counting the generated phrases p, and counting p1Number of occurrences is C1,pnThe number of occurrences is counted as Cpn(ii) a At the same time, tnOccurrence of the secondary note C in the entire datasettn. The following formula is designed to calculate the importance of the phrase:
Figure BSA0000217356920000062
in equation (3): μ is a manually defined parameter that sets the offset weight for words and phrases. Judging whether the phrase p is a high-frequency phrase according to the phrase importance degree E, and adding the phrase p into the candidate word bank when the E is larger than (E >)
Figure BSA0000217356920000063
The algorithm is based on: the combined phrases of a language are generally composed of adjacent words and are in a form of regression from left to right; thus, the set of all possible phrases in a sentence may be exhausted in a manner that combines the phrases from left to right. On the other hand, in the task of constructing the dictionary, the high frequency is one of the important factors of the domain dictionary, and the combination correctness and the collocating property of the phrases can be ensured through the calculation of the formula (3).
Compared with the traditional text mining algorithm (such as LDA algorithm, TextRank algorithm and the like), the adjacent word frequency analysis method collects and constructs the statistical data of the phrases in the phrase mining task, and guides the segmentation of the whole document set by calculating the weight values of the phrases. Meanwhile, the method utilizes the phrase context and the phrase importance in the construction process to ensure the effectiveness of high-frequency phrases; the algorithm has obvious effect in the task of mining the subject words conforming to the text center, and has the capability of mining new words in the field and rare words in the field.
Field word library construction based on TF-IDF algorithm
Through experimentation, an important conclusion is drawn: although a large number of combined phrases can be obtained by the adjacent word frequency phrase mining method, most of the phrases are inferior phrases-phrases that do not meet the standard are called inferior phrases, for example: "Beijing and", "they are" and so on, refer to the combination of words and media words. In fact, of the large number of candidate phrases, typically only about 10% of the phrases belong to the high-quality phrase, and fewer are in line with the high-quality phrase. Therefore, it is necessary to establish a standard word library of domain correlation.
On constructBefore a word bank is built, a TF-IDF value of each word in a text sequence T needs to be calculated, and the TF-IDF algorithm is a common weighting technology for information retrieval and data mining. The result of the recommendation algorithm can be screened and filtered by using the TF-IDF value of the text. Specifically, tf (term frequency) refers to the frequency of occurrence of a given term in the Document, and idf (inverse Document frequency) refers to the inverse Document frequency, which is a measure of the general importance of the term. Vocabulary xi,jThe TF-IDF value for the document set D is calculated as follows:
Figure BSA0000217356920000071
Figure BSA0000217356920000072
tfidfi,j=tfi,j×idfi (6)
according to tfidf value of vocabulary, an important vocabulary dictionary D can be constructedtf={xi,j|tfidfi,jTheta is a threshold value for judging whether the word is added into the dictionary, and when the tfidf value is larger than theta, the word is added into the dictionary Dtf. For the tfidf value of the phrase, it is averaged using equation (7):
Figure BSA0000217356920000073
and training the weight of each word in the sentence sequence T through a TF-IDF algorithm, and constructing a phrase label library by using the weight value. Constructing candidate words extracted from the determined corpus into a noise-containing irrelevant word library when the candidate words do not match any high-quality phrases in the domain; instead, it is matched as a related thesaurus.
Using noisy data as a training set is obviously not a sensible option, with a set of subtractions
Figure BSA0000217356920000074
The method of stopping words achieves the effect of primarily reducing noise. If any element in the stop word set is included in the phrase p, it is discarded, for example: if the stop word is included in the ' Beijing and ' the ' phrase, the phrase is deleted from the candidate word stock. After noise is removed, a training set of a word bank is constructed through tfidf weight, according to an experimental result, tfidf values are larger than 0.4 (namely theta is 0.4), manually screened words are added into a positive word bank to form a positive sample, and words with tfidf values smaller than 0.4 are added into a negative word bank to form a negative sample. Meanwhile, 500 manually labeled high-quality words and phrases are added into the forward word stock to expand the diversity of the sample.
Dictionary model constructed based on convolutional neural network
The expected effect cannot be achieved by singly relying on the shallow term frequency characteristics of the text to carry out model training, so the embodiment utilizes the deep convolutional neural network to extract the syntactic and semantic characteristics of the sentence so as to improve the task accuracy. Given a sentence sequence of length N, T ═ T0,t1,t2,...,tNAnd (3) calculating the score of each word in the sentence by the model CNN-PD, and determining whether the word is a domain-related word by a score judgment, wherein the model structure is shown in FIG. 3, embedding means embedding, convoluting means convolution, posing means pooling, and scores means scores. Firstly, a word embedding layer of a CNN-PD model converts text and position features into word vectors containing semantic features; then, the convolution layer constructs the word vector into a distributed multi-dimensional feature vector H; finally, H obtains the score of each word through full-link layer mapping.
(1) Word embedding layer
Given sentence sequence T ═ T0,t1,t2,...,tNAnd the word embedding layer updates the word vector corresponding to each word by back propagation of a neural network
Figure BSA0000217356920000081
When the word vector matrix is
Figure BSA0000217356920000082
The word embedding layer encodes the input words into vector form by the vector dot-product matrix operation in equation (8). Wherein the content of the first and second substances,
Figure BSA0000217356920000083
v is the training dictionary size and d is the artificially defined word vector dimension. v. ofpIs a one-hot vector with 1 at position p and 0 at the remaining positions.
et=vpWemb (8)
(2) Position coding layer
In the task, the relative position of the word is an important feature, and the classification boundary of the related word can be determined by the position distance between the word and the target. The relative position information of the words is used to track the proximity relationship between the target word and other words. This embodiment will adopt Word Position Embedding (WPE) to promote the effect of the model. WPE refers to encoding the relative position information of a target word and other words into a vector form so as to be fused with other features. For example, the relative position of "in" and the target words "mobile phone" and "explosion" in fig. 3 is [ -1, 7 [ -1 [ -7 [ ]]Mapping the position vector to dwpeVector, dwpeThe model hyper-parameters are initialized by random values. Thus, the position-coding vector of sentence T is
Figure BSA0000217356920000091
Wherein
Figure BSA0000217356920000092
The position-coding vector will then enter the convolutional layer network by concatenating the word-embedded vector.
(3) Convolutional neural network
The convolutional layer will perform a point-product operation of the vector matrix using a convolution kernel of size k and the input word vector. To deal with the problem of indexing words outside sentence boundaries, the input vector will be filled with zero vectors
Figure BSA0000217356920000093
Next, the process is carried out.
Convolutional neural network computing jth
Figure BSA0000217356920000094
The vector method is as follows:
[et]j=max1<n<N[f(Wcembed+bc)] (9)
wherein the content of the first and second substances,
Figure BSA0000217356920000095
the training weight matrix for the convolutional layer, the output of which is the convolutional kernel feature with a window size of k. The max operation in the formula maps the convolution kernel output vector to a vector with the same length as the sentence
Figure BSA0000217356920000096
Finally, Z isNAnd performing softmax operation on the output matrix to obtain a score vector of the word.
score=Softmax(ZN) (10)
The embodiment of the application provides a relatively complete method for constructing a domain phrase professional dictionary; a method for quantifying the correlation degree of phrases and fields by using statistical word frequency and word weight is provided; a method for combining a deep learning network with the direction of constructing a domain dictionary is provided, and the robustness of the domain dictionary is obviously improved. The method provided by the embodiment of the application can effectively solve a plurality of problems of manual dictionary construction.
The embodiment of the application extracts a single word, a combined word or a phrase associated with the defect information of the consumer goods from a large amount of defect clue report data, thereby realizing the construction of a dictionary in the defect field of the consumer goods.
The experimental data of another embodiment of the application is desensitized consumer product defect clue report data and collected internet e-commerce commodity comment data provided by a certain online shopping website. With about 1.5 million pieces of thread report data (data set a) and about 3 million pieces of internet data (data set B), totaling about 4.5 million pieces of data. The data comprises electronic appliances, hardware building materials, children toys and other articles, each piece of data is submitted by a real consumer, and part of data samples are shown in table 1. The related words of the field determined by manual screening generally comprise consumer product names, fault description phrases and the like, and the number of the related words is about 1000; the domain-independent words are generally composed of place names, person names and domain-independent dynamic terms, and the number of the domain-independent dynamic terms is about 4000.
TABLE 1 corpus sample and artificially defined domain word sample
Figure BSA0000217356920000101
Evaluation criteria
The embodiment provides a method for constructing a phrase dictionary in the defect field of consumer goods, which can be divided into three steps: mining based on adjacent word frequency phrases; constructing a domain related word bank based on a TF-IDF algorithm; and constructing a domain dictionary model based on the convolutional neural network. Because the first two steps have no unified numerical standard, the effect of the method is highlighted by adopting a mode of displaying an experimental result; the third step is essentially a multi-label classification process, so the present embodiment will use Macro Average numerical index to evaluate the result of the dictionary mining experiment, and the calculation method is shown in the following formula (11) (12) (13):
Figure BSA0000217356920000111
Figure BSA0000217356920000112
Figure BSA0000217356920000113
wherein TPcRepresenting the number of domain-related words predicted to be correct; FPcRepresenting that the words are actually irrelevant words, but the prediction result is the number of relevant words; FN (FN)cIndicating that the word is actually related, but the prediction result is the number of unrelated words.
Results and analysis of the experiments
(1) Word frequency phrase mining
The experiment is based on a corpus after word segmentation, and domain short words are mined from a text by adopting an adjacent word frequency phrase mining method. Partial results are shown in table 2, where the words with spaces in the table are phrase words, such as: the "charging port" means a word segmentation result of "charging" and "port", and its phrase is "charging port". E is the result of normalization of formula (3). As can be seen from the table, the phrases with high frequency of appearance include phrases related to the failure of the consumer goods, such as "quality problem", "product defect", etc., but also include phrases unrelated to "may", "in use", etc. It is worth mentioning that the method has good performance on the mining results of long phrase words such as ' quality inspection bureau ', ' consumer ' right protection method '; meanwhile, the mining result of the new words of the network such as the 'shared bicycle' is satisfactory.
TABLE 2 phrase mining Experimental results
Figure BSA0000217356920000114
Figure BSA0000217356920000121
(2) Word bank experimental result and analysis constructed based on TF-IDF algorithm
The experimental results of constructing the domain-related word bank based on the TF-IDF algorithm are shown in Table 3. The table lists the part words with higher tfidf values (as related word lexicon) and the part words with lower tfidf values (as unrelated word lexicon), wherein the division word relevance threshold is 0.4.
TABLE 3 phrase thesaurus construction Experimental results
Figure BSA0000217356920000122
From the above table, it can be concluded that the words related to the field of defects in consumer products have higher tfidf values, such as "flash screen", "spontaneous combustion", "explosion", etc.; in contrast, the domain-independent word tfidf is generally small, such as "cause", "find", "cause a problem", and the like. Meanwhile, in the experimental result, although the occurrence frequency of the charging port is high in the word frequency mining experiment, the tfidf value of the charging port is smaller than the threshold value theta, but the charging port is a related word in the field. Through observation and analysis, the situation is mainly related to the TF-IDF algorithm, and the number of clues about 'charging ports' in the whole corpus is too large to report data, so that the denominator term in the formula (4) is increased, and the tfidf value in the formula (5) is reduced. The above conclusions also indicate the limitations of the TF-IDF algorithm.
(3) Dictionary model constructed based on convolutional neural network
In order to verify the experimental effect of the embodiment, based on the data corpus, a hidden dirichlet allocation (LDA) model and a Support Vector Machine (SVM) model are used as comparison experiments. Meanwhile, the LSTM is used for replacing the convolutional layer in the model to serve as another set of comparison experiments, so that the effectiveness of the convolutional neural network used in the embodiment is verified. The results of the comparative experiments are shown in Table 4, where the term "Noother" is a statistical result of calculating only the domain-related words and the domain-independent words.
As can be seen from the data in Table 4, compared with the traditional text model method, the F-1 value of the deep learning method is improved by 6-8%, and compared with the LSTM model, the result based on the CNN model is improved by 4%. In order to show the effect of the method in different data sets, experiments are respectively performed in the data set a and the data set B, and the results have no obvious difference. In addition, to verify the importance of WPE in this experiment, it was used as a variable in comparative experiments; according to experimental results, compared with an original model, the model fused with the WPE is obviously improved in all evaluation indexes. On the other hand, the experimental results of the CNN model and the LSTM model are compared, the conclusion is drawn at the beginning, and in the aspect of combining with the position information, the time sequence of the convolutional network has a better information extraction effect.
Table 4 comparative experimental results
Figure BSA0000217356920000131
The embodiment of the application provides a consumer product defect field phrase dictionary construction method based on a convolutional neural network. Firstly, constructing a large amount of phrase texts containing noise by using an adjacent word frequency phrase mining method, and then filtering related phrases in the field by using phrase frequency weight of the phrases; secondly, a field-related word bank is constructed based on the TF-IDF algorithm, so that the manual labeling cost is reduced; and finally, constructing a domain dictionary model based on the convolutional neural network to generate a domain dictionary. The method provided by the embodiment of the application has good performance in construction of the consumer goods domain dictionary, improves construction effects of the consumer goods defect domain dictionary, and can provide effective ideas and solutions for construction of dictionaries in other domains. The method of the embodiment of the application can achieve high accuracy, recall rate and F1 value.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in a strict order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The above-mentioned embodiments only express the embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (8)

1. A domain phrase dictionary construction method is characterized by comprising the following steps:
mining phrases;
constructing a domain word stock;
and constructing a dictionary model.
2. The method of claim 1, wherein mining phrases comprises:
preprocessing and segmenting original data, and then extracting all possibly-appearing phrase sets in sentences by adopting an adjacent word frequency phrase mining method for segmentation results.
3. The method of claim 1, wherein mining phrases comprises:
in the document M, the sentence sequence T ═ T0,t1,t2,...,tnIs passed through t0And t1,t1、t2And t3Combined method generates phrase sets
Figure FSA0000217356910000011
In the process of traversing the document M to generate a phrase set, counting the generated phrases p, and counting p1Number of occurrences is C1,pnThe number of occurrences is counted as Cpn(ii) a At the same time, tnOccurrence of the secondary note C in the entire datasettn(ii) a The following formula is designed to calculate the importance of the phrase:
Figure FSA0000217356910000012
defining parameters by manual work, and setting offset weight for words and phrases; judging whether the phrase p is a high-frequency phrase according to the phrase importance degree E, and adding the phrase p into the candidate word bank when the E is larger than (E >)
Figure FSA0000217356910000013
4. The method of claim 1, wherein the constructing a domain lexicon comprises:
and (3) training the phrase set by using a TF-IDF algorithm to obtain words with weights, and dividing the words into field-related words and irrelevant words by using a weight threshold.
5. The method of claim 1, wherein the constructing a domain lexicon comprises:
vocabulary x of calculation methodi,jTF-IDF values for document set D
Figure FSA0000217356910000014
Figure FSA0000217356910000015
tfidfi,j=tfi,j×idfi
Constructing an important vocabulary dictionary D according to the tfidf value of the vocabularytf={xi,j|tfidfi,jTheta is a threshold value for judging whether the word is added into the dictionary, and when the tfidf value is larger than theta, the word is added into the dictionary Dtf(ii) a The tfidf values for the phrases are then averaged:
Figure FSA0000217356910000021
training the weight of each word in the sentence sequence T through a TF-IDF algorithm, and constructing a phrase tag library by using the weight value; constructing candidate words extracted from the determined corpus into a noise-containing irrelevant word library when the candidate words do not match any high-quality phrases in the domain; instead, it is matched as a related thesaurus.
6. The method of claim 1, wherein the constructing a dictionary model comprises:
constructing a dictionary model based on the convolutional neural network;
the word embedding layer of the CNN-PD model converts the text and the position characteristics into word vectors containing semantic characteristics;
the convolution layer constructs the word vector into a distributed multi-dimensional feature vector H;
and H, mapping through a full connection layer to obtain the score of each word.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the method of any one of claims 1-6.
8. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor to implement the method according to any of claims 1-6.
CN202010841791.6A 2020-08-19 2020-08-19 Domain phrase dictionary construction method Pending CN111985215A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010841791.6A CN111985215A (en) 2020-08-19 2020-08-19 Domain phrase dictionary construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010841791.6A CN111985215A (en) 2020-08-19 2020-08-19 Domain phrase dictionary construction method

Publications (1)

Publication Number Publication Date
CN111985215A true CN111985215A (en) 2020-11-24

Family

ID=73442383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010841791.6A Pending CN111985215A (en) 2020-08-19 2020-08-19 Domain phrase dictionary construction method

Country Status (1)

Country Link
CN (1) CN111985215A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597760A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Method and device for extracting domain words in document
CN112613612A (en) * 2020-12-29 2021-04-06 合肥工业大学 Method and device for constructing green design knowledge base based on patent library
CN113705200A (en) * 2021-08-31 2021-11-26 中国平安财产保险股份有限公司 Method, device and equipment for analyzing complaint behavior data and storage medium
CN115827854A (en) * 2022-12-28 2023-03-21 数据堂(北京)科技股份有限公司 Voice abstract generation model training method, voice abstract generation method and device
CN116257601A (en) * 2023-03-01 2023-06-13 云目未来科技(湖南)有限公司 Illegal word stock construction method and system based on deep learning

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597760A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Method and device for extracting domain words in document
CN112613612A (en) * 2020-12-29 2021-04-06 合肥工业大学 Method and device for constructing green design knowledge base based on patent library
CN112613612B (en) * 2020-12-29 2022-08-02 合肥工业大学 Method and device for constructing green design knowledge base based on patent library
CN113705200A (en) * 2021-08-31 2021-11-26 中国平安财产保险股份有限公司 Method, device and equipment for analyzing complaint behavior data and storage medium
CN113705200B (en) * 2021-08-31 2023-09-15 中国平安财产保险股份有限公司 Analysis method, analysis device, analysis equipment and analysis storage medium for complaint behavior data
CN115827854A (en) * 2022-12-28 2023-03-21 数据堂(北京)科技股份有限公司 Voice abstract generation model training method, voice abstract generation method and device
CN115827854B (en) * 2022-12-28 2023-08-11 数据堂(北京)科技股份有限公司 Speech abstract generation model training method, speech abstract generation method and device
CN116257601A (en) * 2023-03-01 2023-06-13 云目未来科技(湖南)有限公司 Illegal word stock construction method and system based on deep learning

Similar Documents

Publication Publication Date Title
Hogenboom et al. Determining negation scope and strength in sentiment analysis
Wang et al. Using Wikipedia knowledge to improve text classification
CN111985215A (en) Domain phrase dictionary construction method
CN103678565B (en) Domain self-adaption sentence alignment system based on self-guidance mode
Zhang et al. Short text classification based on feature extension using the n-gram model
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
Li et al. The mixture of textrank and lexrank techniques of single document automatic summarization research in Tibetan
Gao et al. Sentiment classification for stock news
Luo et al. Entity sentiment analysis in the news: A case study based on negative sentiment smoothing model (nssm)
Liu et al. A new approach to process the unknown words in financial public opinion
Sanchez-Gomez et al. Sentiment-oriented query-focused text summarization addressed with a multi-objective optimization approach
Bayoudhi et al. Sentiment classification at discourse segment level: Experiments on multi-domain Arabic corpus
Xu et al. Expanding Chinese sentiment dictionaries from large scale unlabeled corpus
Zheng et al. Multi-dimensional sentiment analysis for large-scale E-commerce reviews
Feng et al. Product feature extraction via topic model and synonym recognition approach
Zheng et al. An adaptive LDA optimal topic number selection method in news topic identification
Kang et al. Sampling latent emotions and topics in a hierarchical Bayesian network
Mashina Application of statistical methods to solve the problem of enriching ontologies of developing subject areas
Li et al. Research on improve topic representation over short text
Nejjari et al. SAHAR-LSTM: an enhanced model for sentiment analysis of hotels’ Arabic reviews based on LSTM
Pandi et al. Reputation based online product recommendations
Yu et al. Research on Chinese Text Sentiment Classification Process
Tutubalina et al. Topic Models with Sentiment Priors Based on Distributed Representations
Yan Research on keyword extraction based on abstract extraction
Yu et al. Hidden Markov-based LDA Internet Sensitive Information Text Filtering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201124

WD01 Invention patent application deemed withdrawn after publication