CN111985215A

CN111985215A - Domain phrase dictionary construction method

Info

Publication number: CN111985215A
Application number: CN202010841791.6A
Authority: CN
Inventors: 吕学强; 孙宁; 张乐; 姜肇财; 宋黎
Original assignee: Beijing Information Science and Technology University; China National Institute of Standardization
Current assignee: Beijing Information Science and Technology University; China National Institute of Standardization
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2020-11-24

Abstract

The application discloses a domain phrase dictionary construction method, which comprises the following steps: mining phrases; constructing a domain word stock; and constructing a dictionary model. Mining phrases includes: preprocessing and segmenting original data, and then extracting all possibly-appearing phrase sets in sentences by adopting an adjacent word frequency phrase mining method for segmentation results. Constructing a domain word stock, comprising: and (3) training the phrase set by using a TF-IDF algorithm to obtain words with weights, and dividing the words into field-related words and irrelevant words by using a weight threshold. According to the field phrase dictionary building method, the statistical word frequency and word weight are used for quantifying the correlation degree of phrases and fields, the deep learning network is combined with the field dictionary building direction, the robustness of the field dictionary is obviously improved, good performance is realized in the building of consumer product field dictionaries, the building effect of consumer product defect field dictionaries is improved, and high accuracy, recall rate and F1 value can be achieved.

Description

Domain phrase dictionary construction method

Technical Field

The application relates to the technical field of text processing, in particular to a domain phrase dictionary construction method.

Background

With the development of modern economy, online shopping is more and more popular, and more consumer products appear in the life of people. Different kinds of consumer goods can cause various faults while improving the life quality of people. The online shopping website or APP receives a great deal of consumer product defect clue report information every day, and the defect clue information is mostly composed of short texts. Accurately mining the fault description information of the product from the clue text is helpful for dynamically controlling the direction of the defect field of the consumer product, and constructing a dictionary of the defect field of the consumer product is the basic work for realizing the aim. The domain dictionary is characterized in that key information of a professional domain is expressed by refined and short words, the content of the domain dictionary is essentially 'information extraction' of a text, namely, domain related words are extracted from a large amount of disordered texts and are classified according to different topics.

The combined words are more accurate and richer in text theme expression capability than ordinary single words. For example: compared with the three words of fingerprint, unlocking and invalidation, the fingerprint unlocking invalidation expresses the fault information of the product more simply and accurately; for another example, "sharing bicycle" is completely contrary to the original phrase if it is divided into two words "sharing" and "bicycle". Such a compound word having a strong intrinsic relation is called a "field related word".

So far, the mainstream method for constructing a domain dictionary in the industry adopts a rule-based expert system, and experts manually make a deterministic flow rule and extract domain words from a text by adopting a text matching mode. The biggest weakness of the method is that the system is difficult to maintain and expand, the language is constantly developed and changed, and the workload of manual maintenance and rule adaptation is huge. Meanwhile, conflicts are easily generated among a plurality of rules, and various experts are needed to eliminate the rule conflicts in different fields.

Disclosure of Invention

The application aims to provide a domain phrase dictionary construction method. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

According to an aspect of an embodiment of the present application, there is provided a domain phrase dictionary construction method, including:

mining phrases;

constructing a domain word stock;

and constructing a dictionary model.

Further, the mining phrases comprise:

preprocessing and segmenting original data, and then extracting all possibly-appearing phrase sets in sentences by adopting an adjacent word frequency phrase mining method for segmentation results.

Further, the mining phrases comprise:

in the document M, the sentence sequence T ═ T₀，t₁，t₂，...，t_nIs passed through t₀And t₁，t₁、t₂And t₃Combined method generates phrase sets

In the process of traversing the document M to generate a phrase set, counting the generated phrases p, and counting p₁Number of occurrences is C₁，p_nThe number of occurrences is counted as C_pn(ii) a At the same time, t_nOccurrence of the secondary note C in the entire dataset_tn(ii) a The following formula is designed to calculate the importance of the phrase:

defining parameters by manual work, and setting offset weight for words and phrases; determining that the phrase p is by the phrase importance level EIf not, if E is greater than E (E >), adding it to the candidate word library

Further, the constructing of the domain lexicon comprises:

and (3) training the phrase set by using a TF-IDF algorithm to obtain words with weights, and dividing the words into field-related words and irrelevant words by using a weight threshold.

Further, the constructing of the domain lexicon comprises:

vocabulary x of calculation method_i，jTF-IDF values for document set D

tfidf_i，j＝tf_i，j×idf_i

Constructing an important vocabulary dictionary D according to the tfidf value of the vocabulary_tf＝{x_i，j|tfidf_i，jTheta is a threshold value for judging whether the word is added into the dictionary, and when the tfidf value is larger than theta, the word is added into the dictionary D_tf: the tfidf values for the phrases are then averaged:

training the weight of each word in the sentence sequence T through a TF-IDF algorithm, and constructing a phrase tag library by using the weight value; constructing candidate words extracted from the determined corpus into a noise-containing irrelevant word library when the candidate words do not match any high-quality phrases in the domain; instead, it is matched as a related thesaurus.

Further, the constructing the dictionary model includes:

constructing a dictionary model based on the convolutional neural network;

the word embedding layer of the CNN-PD model converts the text and the position characteristics into word vectors containing semantic characteristics;

the convolution layer constructs the word vector into a distributed multi-dimensional feature vector H;

and H, mapping through a full connection layer to obtain the score of each word.

According to another aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the domain phrase dictionary construction method described above.

According to another aspect of embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon, the program being executed by a processor to implement the above-mentioned domain phrase dictionary construction method.

The technical scheme provided by one aspect of the embodiment of the application can have the following beneficial effects:

according to the method for constructing the domain phrase dictionary, the phrase and domain correlation degree is quantified by means of statistical word frequency and word weight, the deep learning network is combined with the domain dictionary constructing direction, the robustness of the domain dictionary is obviously improved, good performance is achieved in construction of the consumer product domain dictionary, construction effects of the consumer product defect domain dictionary are improved, and high accuracy, recall rate and F1 value can be achieved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application, or may be learned by the practice of the embodiments. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flowchart illustrating the steps of constructing a domain phrase dictionary in one embodiment of the present application;

FIG. 2 is a diagram illustrating a method for generating phrases in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a CNN-PD model in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Text mining refers to the process of mining out potential and valuable information from large-scale text data collections. With the emergence of a large amount of text data in the internet, text mining becomes an important research topic in the field of natural language processing.

The current research work focuses on applying machine learning and deep learning algorithms to domain dictionary construction directions. On the basis, the embodiment of the application extracts the syntactic and semantic features of the corpus by using the deep convolutional neural network, and simultaneously fuses text position information to improve the accuracy and generalization capability of the domain dictionary.

Firstly, combining binary phrases and ternary phrases from a text by adopting an adjacent word frequency analysis method, and judging the correlation degree of words and fields according to the occurrence frequency of the words; secondly, filtering out high-quality phrases in the phrase result through a TF-IDF (Term Frequency-Inverse Document Frequency) algorithm; and finally, extracting the text context semantic relation by using a Convolutional Neural Network, and constructing a domain Dictionary model-CNN-PD (conditional Neural Network-Phrase Dictionary) by combining the position information of the words.

Definition of

In the document set M, a sentence sequence T ═ T is given₀，t₁，t₂，...，t_nT is the result after word segmentation, and the phrase t is used₀The first phrase may be represented by t₁，t₂...t_nComposition, noted as:

at this time set

A set of possible phrases may be generated for sentence T. Collection

The valid phrases in (1) are referred to as high quality phrases, and the remaining phrases are referred to as noise. Among other things, high quality phrases must meet the following requirements:

(1) high frequency: the most important feature in determining whether a phrase conveys important information about a topic is the frequency of its occurrence in the topic. Phrases that do not occur frequently in a topic may not matter about the topic; conversely, phrases that appear very frequently in a topic may play an important role in that topic.

(2) The collocating property is as follows: in linguistics, colloquiality means that a compound word appears significantly more frequently in a corpus than it encounters by chance. Where common examples in phrase collocation are two candidate collocations, such as "high quality" and "strong quality". Usually, people will consider the two phrases to appear with similar frequency, but in the corpus, the phrase of "high quality" is considered to be more correct, and the usage frequency is higher, so that the phrase is more in line with the mainstream of people.

(3) Integrity: a phrase is considered complete when it can be interpreted as a complete semantic unit in a given document context. Where included phrases and sub-phrases may be considered complete phrases, depending on their contextual information. For example, "relational database systems," "relational databases," and "database systems" may all be valid in a particular context.

Thus, the task of the present embodiment can be visually expressed as: using method f in the set

Finding high quality phrase collections that satisfy the above conditions

(as shown in formula (2)). Meanwhile, the words meeting the high-quality phrase requirements are called domain related words; conversely, undesirable words are referred to as domain-independent words.

Constructing a domain phrase dictionary

The primary goal of the present embodiment is to build a dictionary library of consumer product failure information domain phrases. In order for the extracted phrases to meet the requirements, the whole task is divided into two main parts: the first part is phrase mining, and the main steps are inputting a corpus, wherein the corpus is a character sequence with any length in a specific language and a specific field, and outputting a phrase quality ranking list. The second part is dictionary construction, which inputs phrase word stock in the phrase mining result and outputs related phrases in the field.

The main construction steps are shown in figure 1. Firstly, preprocessing and segmenting original data, and then extracting all possibly-appearing phrase sets in sentences by adopting an adjacent word frequency phrase mining method for segmentation results. And then, training a phrase set by using a TF-IDF algorithm to obtain words with weights, and dividing the words into field related words and unrelated words by using a weight threshold. And finally, training through a convolutional neural network to obtain a domain dictionary classification model.

Adjacent word frequency phrase mining method

Due to the characteristics of Chinese word segmentation, a phrase combination word is usually segmented into two or more isolated words, and the segmented words cannot accurately express the original phrase meaning (such as 'fingerprint unlocking failure' and 'shared bicycle' mentioned in the introduction). The phrase mining is difficult and serious due to the complexity of Chinese word segmentation algorithm and the change of the Chinese word segmentation algorithm along with the change of the context. Therefore, the present embodiment provides a method for mining phrases by analyzing the combination frequency of adjacent words, which effectively solves the above problems, and the algorithm is briefly described as follows:

The generation flow is shown in fig. 2. The sharing and the bicycle are combined to generate a new word of the sharing bicycle, the child and the seat are combined to generate a child seat, and the like. In the process of traversing the document M to generate a phrase set, counting the generated phrases p, and counting p₁Number of occurrences is C₁，p_nThe number of occurrences is counted as C_pn(ii) a At the same time, t_nOccurrence of the secondary note C in the entire dataset_tn. The following formula is designed to calculate the importance of the phrase:

in equation (3): μ is a manually defined parameter that sets the offset weight for words and phrases. Judging whether the phrase p is a high-frequency phrase according to the phrase importance degree E, and adding the phrase p into the candidate word bank when the E is larger than (E >)

The algorithm is based on: the combined phrases of a language are generally composed of adjacent words and are in a form of regression from left to right; thus, the set of all possible phrases in a sentence may be exhausted in a manner that combines the phrases from left to right. On the other hand, in the task of constructing the dictionary, the high frequency is one of the important factors of the domain dictionary, and the combination correctness and the collocating property of the phrases can be ensured through the calculation of the formula (3).

Compared with the traditional text mining algorithm (such as LDA algorithm, TextRank algorithm and the like), the adjacent word frequency analysis method collects and constructs the statistical data of the phrases in the phrase mining task, and guides the segmentation of the whole document set by calculating the weight values of the phrases. Meanwhile, the method utilizes the phrase context and the phrase importance in the construction process to ensure the effectiveness of high-frequency phrases; the algorithm has obvious effect in the task of mining the subject words conforming to the text center, and has the capability of mining new words in the field and rare words in the field.

Field word library construction based on TF-IDF algorithm

Through experimentation, an important conclusion is drawn: although a large number of combined phrases can be obtained by the adjacent word frequency phrase mining method, most of the phrases are inferior phrases-phrases that do not meet the standard are called inferior phrases, for example: "Beijing and", "they are" and so on, refer to the combination of words and media words. In fact, of the large number of candidate phrases, typically only about 10% of the phrases belong to the high-quality phrase, and fewer are in line with the high-quality phrase. Therefore, it is necessary to establish a standard word library of domain correlation.

On constructBefore a word bank is built, a TF-IDF value of each word in a text sequence T needs to be calculated, and the TF-IDF algorithm is a common weighting technology for information retrieval and data mining. The result of the recommendation algorithm can be screened and filtered by using the TF-IDF value of the text. Specifically, tf (term frequency) refers to the frequency of occurrence of a given term in the Document, and idf (inverse Document frequency) refers to the inverse Document frequency, which is a measure of the general importance of the term. Vocabulary x_i，jThe TF-IDF value for the document set D is calculated as follows:

tfidf_i，j＝tf_i，j×idf_i (6)

according to tfidf value of vocabulary, an important vocabulary dictionary D can be constructed_tf＝{x_i，j|tfidf_i，jTheta is a threshold value for judging whether the word is added into the dictionary, and when the tfidf value is larger than theta, the word is added into the dictionary D_tf. For the tfidf value of the phrase, it is averaged using equation (7):

and training the weight of each word in the sentence sequence T through a TF-IDF algorithm, and constructing a phrase label library by using the weight value. Constructing candidate words extracted from the determined corpus into a noise-containing irrelevant word library when the candidate words do not match any high-quality phrases in the domain; instead, it is matched as a related thesaurus.

Using noisy data as a training set is obviously not a sensible option, with a set of subtractions

The method of stopping words achieves the effect of primarily reducing noise. If any element in the stop word set is included in the phrase p, it is discarded, for example: if the stop word is included in the ' Beijing and ' the ' phrase, the phrase is deleted from the candidate word stock. After noise is removed, a training set of a word bank is constructed through tfidf weight, according to an experimental result, tfidf values are larger than 0.4 (namely theta is 0.4), manually screened words are added into a positive word bank to form a positive sample, and words with tfidf values smaller than 0.4 are added into a negative word bank to form a negative sample. Meanwhile, 500 manually labeled high-quality words and phrases are added into the forward word stock to expand the diversity of the sample.

Dictionary model constructed based on convolutional neural network

The expected effect cannot be achieved by singly relying on the shallow term frequency characteristics of the text to carry out model training, so the embodiment utilizes the deep convolutional neural network to extract the syntactic and semantic characteristics of the sentence so as to improve the task accuracy. Given a sentence sequence of length N, T ═ T₀，t₁，t₂，...，t_NAnd (3) calculating the score of each word in the sentence by the model CNN-PD, and determining whether the word is a domain-related word by a score judgment, wherein the model structure is shown in FIG. 3, embedding means embedding, convoluting means convolution, posing means pooling, and scores means scores. Firstly, a word embedding layer of a CNN-PD model converts text and position features into word vectors containing semantic features; then, the convolution layer constructs the word vector into a distributed multi-dimensional feature vector H; finally, H obtains the score of each word through full-link layer mapping.

(1) Word embedding layer

Given sentence sequence T ═ T₀，t₁，t₂，...，t_NAnd the word embedding layer updates the word vector corresponding to each word by back propagation of a neural network

When the word vector matrix is

The word embedding layer encodes the input words into vector form by the vector dot-product matrix operation in equation (8). Wherein the content of the first and second substances,

v is the training dictionary size and d is the artificially defined word vector dimension. v. of^pIs a one-hot vector with 1 at position p and 0 at the remaining positions.

e^t＝v^pW^emb (8)

(2) Position coding layer

In the task, the relative position of the word is an important feature, and the classification boundary of the related word can be determined by the position distance between the word and the target. The relative position information of the words is used to track the proximity relationship between the target word and other words. This embodiment will adopt Word Position Embedding (WPE) to promote the effect of the model. WPE refers to encoding the relative position information of a target word and other words into a vector form so as to be fused with other features. For example, the relative position of "in" and the target words "mobile phone" and "explosion" in fig. 3 is [ -1, 7 [ -1 [ -7 [ ]]Mapping the position vector to d^wpeVector, d^wpeThe model hyper-parameters are initialized by random values. Thus, the position-coding vector of sentence T is

Wherein

The position-coding vector will then enter the convolutional layer network by concatenating the word-embedded vector.

(3) Convolutional neural network

The convolutional layer will perform a point-product operation of the vector matrix using a convolution kernel of size k and the input word vector. To deal with the problem of indexing words outside sentence boundaries, the input vector will be filled with zero vectors

Next, the process is carried out.

Convolutional neural network computing jth

The vector method is as follows:

[e^t]_j＝max_1＜n＜N[f(W^cembed+b^c)] (9)

wherein the content of the first and second substances,

the training weight matrix for the convolutional layer, the output of which is the convolutional kernel feature with a window size of k. The max operation in the formula maps the convolution kernel output vector to a vector with the same length as the sentence

Finally, Z is_NAnd performing softmax operation on the output matrix to obtain a score vector of the word.

score＝Softmax(Z_N) (10)

The embodiment of the application provides a relatively complete method for constructing a domain phrase professional dictionary; a method for quantifying the correlation degree of phrases and fields by using statistical word frequency and word weight is provided; a method for combining a deep learning network with the direction of constructing a domain dictionary is provided, and the robustness of the domain dictionary is obviously improved. The method provided by the embodiment of the application can effectively solve a plurality of problems of manual dictionary construction.

The embodiment of the application extracts a single word, a combined word or a phrase associated with the defect information of the consumer goods from a large amount of defect clue report data, thereby realizing the construction of a dictionary in the defect field of the consumer goods.

The experimental data of another embodiment of the application is desensitized consumer product defect clue report data and collected internet e-commerce commodity comment data provided by a certain online shopping website. With about 1.5 million pieces of thread report data (data set a) and about 3 million pieces of internet data (data set B), totaling about 4.5 million pieces of data. The data comprises electronic appliances, hardware building materials, children toys and other articles, each piece of data is submitted by a real consumer, and part of data samples are shown in table 1. The related words of the field determined by manual screening generally comprise consumer product names, fault description phrases and the like, and the number of the related words is about 1000; the domain-independent words are generally composed of place names, person names and domain-independent dynamic terms, and the number of the domain-independent dynamic terms is about 4000.

TABLE 1 corpus sample and artificially defined domain word sample

Evaluation criteria

The embodiment provides a method for constructing a phrase dictionary in the defect field of consumer goods, which can be divided into three steps: mining based on adjacent word frequency phrases; constructing a domain related word bank based on a TF-IDF algorithm; and constructing a domain dictionary model based on the convolutional neural network. Because the first two steps have no unified numerical standard, the effect of the method is highlighted by adopting a mode of displaying an experimental result; the third step is essentially a multi-label classification process, so the present embodiment will use Macro Average numerical index to evaluate the result of the dictionary mining experiment, and the calculation method is shown in the following formula (11) (12) (13):

wherein TP_cRepresenting the number of domain-related words predicted to be correct; FP_cRepresenting that the words are actually irrelevant words, but the prediction result is the number of relevant words; FN (FN)_cIndicating that the word is actually related, but the prediction result is the number of unrelated words.

Results and analysis of the experiments

(1) Word frequency phrase mining

The experiment is based on a corpus after word segmentation, and domain short words are mined from a text by adopting an adjacent word frequency phrase mining method. Partial results are shown in table 2, where the words with spaces in the table are phrase words, such as: the "charging port" means a word segmentation result of "charging" and "port", and its phrase is "charging port". E is the result of normalization of formula (3). As can be seen from the table, the phrases with high frequency of appearance include phrases related to the failure of the consumer goods, such as "quality problem", "product defect", etc., but also include phrases unrelated to "may", "in use", etc. It is worth mentioning that the method has good performance on the mining results of long phrase words such as ' quality inspection bureau ', ' consumer ' right protection method '; meanwhile, the mining result of the new words of the network such as the 'shared bicycle' is satisfactory.

TABLE 2 phrase mining Experimental results

(2) Word bank experimental result and analysis constructed based on TF-IDF algorithm

The experimental results of constructing the domain-related word bank based on the TF-IDF algorithm are shown in Table 3. The table lists the part words with higher tfidf values (as related word lexicon) and the part words with lower tfidf values (as unrelated word lexicon), wherein the division word relevance threshold is 0.4.

TABLE 3 phrase thesaurus construction Experimental results

From the above table, it can be concluded that the words related to the field of defects in consumer products have higher tfidf values, such as "flash screen", "spontaneous combustion", "explosion", etc.; in contrast, the domain-independent word tfidf is generally small, such as "cause", "find", "cause a problem", and the like. Meanwhile, in the experimental result, although the occurrence frequency of the charging port is high in the word frequency mining experiment, the tfidf value of the charging port is smaller than the threshold value theta, but the charging port is a related word in the field. Through observation and analysis, the situation is mainly related to the TF-IDF algorithm, and the number of clues about 'charging ports' in the whole corpus is too large to report data, so that the denominator term in the formula (4) is increased, and the tfidf value in the formula (5) is reduced. The above conclusions also indicate the limitations of the TF-IDF algorithm.

(3) Dictionary model constructed based on convolutional neural network

In order to verify the experimental effect of the embodiment, based on the data corpus, a hidden dirichlet allocation (LDA) model and a Support Vector Machine (SVM) model are used as comparison experiments. Meanwhile, the LSTM is used for replacing the convolutional layer in the model to serve as another set of comparison experiments, so that the effectiveness of the convolutional neural network used in the embodiment is verified. The results of the comparative experiments are shown in Table 4, where the term "Noother" is a statistical result of calculating only the domain-related words and the domain-independent words.

As can be seen from the data in Table 4, compared with the traditional text model method, the F-1 value of the deep learning method is improved by 6-8%, and compared with the LSTM model, the result based on the CNN model is improved by 4%. In order to show the effect of the method in different data sets, experiments are respectively performed in the data set a and the data set B, and the results have no obvious difference. In addition, to verify the importance of WPE in this experiment, it was used as a variable in comparative experiments; according to experimental results, compared with an original model, the model fused with the WPE is obviously improved in all evaluation indexes. On the other hand, the experimental results of the CNN model and the LSTM model are compared, the conclusion is drawn at the beginning, and in the aspect of combining with the position information, the time sequence of the convolutional network has a better information extraction effect.

Table 4 comparative experimental results

The embodiment of the application provides a consumer product defect field phrase dictionary construction method based on a convolutional neural network. Firstly, constructing a large amount of phrase texts containing noise by using an adjacent word frequency phrase mining method, and then filtering related phrases in the field by using phrase frequency weight of the phrases; secondly, a field-related word bank is constructed based on the TF-IDF algorithm, so that the manual labeling cost is reduced; and finally, constructing a domain dictionary model based on the convolutional neural network to generate a domain dictionary. The method provided by the embodiment of the application has good performance in construction of the consumer goods domain dictionary, improves construction effects of the consumer goods defect domain dictionary, and can provide effective ideas and solutions for construction of dictionaries in other domains. The method of the embodiment of the application can achieve high accuracy, recall rate and F1 value.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in a strict order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The above-mentioned embodiments only express the embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A domain phrase dictionary construction method is characterized by comprising the following steps:

mining phrases;

constructing a domain word stock;

and constructing a dictionary model.

2. The method of claim 1, wherein mining phrases comprises:

3. The method of claim 1, wherein mining phrases comprises:

defining parameters by manual work, and setting offset weight for words and phrases; judging whether the phrase p is a high-frequency phrase according to the phrase importance degree E, and adding the phrase p into the candidate word bank when the E is larger than (E >)

4. The method of claim 1, wherein the constructing a domain lexicon comprises:

5. The method of claim 1, wherein the constructing a domain lexicon comprises:

vocabulary x of calculation method_i，jTF-IDF values for document set D

tfidf_i，j＝tf_i，j×idf_i

Constructing an important vocabulary dictionary D according to the tfidf value of the vocabulary_tf＝{x_i，j|tfidf_i，jTheta is a threshold value for judging whether the word is added into the dictionary, and when the tfidf value is larger than theta, the word is added into the dictionary D_tf(ii) a The tfidf values for the phrases are then averaged:

6. The method of claim 1, wherein the constructing a dictionary model comprises:

constructing a dictionary model based on the convolutional neural network;

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the method of any one of claims 1-6.

8. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor to implement the method according to any of claims 1-6.