CN111325018A

CN111325018A - Domain dictionary construction method based on web retrieval and new word discovery

Info

Publication number: CN111325018A
Application number: CN202010068095.6A
Authority: CN
Inventors: 杜梦豪; 赵琨; 刘杰鹏; 丁健; 梁栋彬; 袁显峰
Original assignee: Shanghai Hengqi Education And Training Co ltd
Current assignee: Shanghai Hengqi Education And Training Co ltd
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-23
Anticipated expiration: 2040-01-21
Also published as: CN111325018B

Abstract

The invention discloses a domain dictionary construction method based on web retrieval and new word discovery, and provides a domain dictionary construction method based on web retrieval and new word discovery aiming at the characteristics of diversity and richness (including network data and literature data) of text data, existence of domain words in new words and the like. The method consists of the following two parts: crawling network data based on the seed dictionary, and extracting the domain words based on a self-defined extraction mode; learning the degree of freedom and the degree of adhesion between the characters based on mutual information and left and right entropy, and then realizing new word discovery based on the BiLstm-CRF. Compared with the prior art, the invention has the advantages that: according to the method, the degree of adhesion and the degree of freedom between the characters are learned based on mutual information and left and right entropy, the context information of the text is learned based on a BiLstm-CRF model, the recognition rate of low-frequency words is integrally improved, the extracted new words and the found words are verified based on a retrieval and statistics method, manual verification is omitted, and the quality of the extracted field words can be improved.

Description

Domain dictionary construction method based on web retrieval and new word discovery

Technical Field

The invention relates to the fields of natural language processing, deep learning and web retrieval, in particular to a field dictionary construction method based on web retrieval and new word discovery.

Background

Since the 21 st century, "artificial intelligence" has gone into the public vision, while our lives are constantly changing. For example, Siri of apple, warehouse robot in kyoto, two-dimensional code of Paibao, chat robots of Baidu, Microsoft, Ali and the like are all embodiments of artificial intelligence, and bring convenience to daily life of people. Among them, natural language processing is one of typical fields of artificial intelligence technology, and through many years of research, the field has achieved remarkable results. Such as machine translation, question and answer systems, reading comprehension, text generation, etc. In these efforts, there are different degrees of dependence on the dictionary. The corresponding characteristic words extracted from the dictionary are used as input, machine translation, problem analysis and the like are carried out on the basis of a machine learning model or algorithm, and the reliability of analysis is greatly improved. Especially in the knowledge question answering and reading understanding in the specific field, a large and wide dictionary is used, and a good analysis effect cannot be achieved. Therefore, the method can quickly and effectively construct a domain dictionary, and is an important task in natural language processing.

With the explosive development of communication internet technology and the arrival of the web3.0 era, national governments, enterprise groups, and internet users have participated in different forms. The state is in the form of news media, the enterprise group is in the form of enterprise websites, and internet users share information in different fields in the form of discussion platforms and forums, wherein the information has different field topics, and a large number of new words emerge. Therefore, the internet is an important information source for domain dictionary construction. Meanwhile, some documents such as books, periodicals and the like in a specific field are also an important information source for dictionary construction. Therefore, it is one of the effective ways to acquire domain-specific data from networks and books.

There are many feasible methods for domain dictionary construction, and related research is also more. Yi Wen medicine et al put forward a domain dictionary construction method based on a Wikipedia link structure diagram, and by integrating an LSI algorithm and a CPMw algorithm. However, the LSI model has many disadvantages, it is difficult to select an appropriate number of subjects, and the SVD calculation is time-consuming, and the result is difficult to interpret intuitively because the LSI is not a probabilistic model and lacks a statistical basis. The method calculates semantic similarity based on a synonym forest expansion edition and a Word2Vec tool on the basis of a large amount of labeled data, summarizes the characteristics, and accordingly constructs a product characteristic dictionary. However, the data needs to be annotated, which is time consuming and labor intensive. And these dictionary construction methods do not take into account the appearance of some new words. Therefore, when constructing the domain dictionary, a new word discovery module needs to be added.

The traditional new word discovery method mainly realizes new word discovery based on a statistical method and a rule-based method. The statistical-based method has the defects of wide application field, low accuracy, large requirement on a large amount of linguistic data and large calculation amount. The rule-based method is to construct a rule base by using language features, so that the accuracy is high, but the process of constructing the rule is complex and the field migration capability is poor. And calculating the correlation of two adjacent words based on the information entropy and left-right entropy method, and better determining the word adhesion and word boundary. Based on the BiLstm-CRF deep neural network model, the text context can be better learned, and meanwhile, more low-frequency new words can be recognized. Therefore, firstly, the domain words are extracted based on the web retrieval and the seed dictionary, then, the new domain words are found based on the BiLstm-CRF, the seed dictionary, the mutual information and the statistical rule method, and finally the domain dictionary is constructed.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the technical defects and provide a domain dictionary construction method based on web retrieval and new word discovery.

In order to solve the problems, the technical scheme of the invention is as follows: a domain dictionary construction method based on web retrieval and new word discovery comprises the following implementation steps:

(1) constructing a seed dictionary;

(2) searching the seed words in Baidu encyclopedia, network news and forum respectively based on the seed dictionary to obtain web field data;

(3) extracting field words in the Baidu encyclopedia based on rules, performing word segmentation and dependency syntax analysis on news network data and forum data based on a word segmentation method and a dependency syntax method in LTP, and then extracting the field words based on specific rules;

(4) evaluating and analyzing the extracted domain words;

(5) performing OCR recognition on the books in the specific field to obtain text documents, removing cases, formulas, charts and the like from the recognized text documents to obtain the text documents of the data in the specific field, and storing the text documents in a text database;

(6) performing word segmentation, part of speech tagging and dependency syntax analysis on book data in a specific field based on a part of speech tagging method and a dependency syntax analysis method in LTP, then extracting field words based on a specific rule, and then executing a step 4 aiming at the extracted new words;

(7) using the ending segmentation, loading an unrecognized word dictionary and a seed dictionary, and segmenting and labeling book data in a specific field;

(8) using mutual information and left-right entropy to carry out statistical analysis on the data, and using the statistical analysis as the weight of a word vector input by a sequence model; and then training the labeled data based on a sequence model BiLstm-CRF, then performing model evaluation by using the accuracy, the recall rate and the F value, finally predicting a new word based on the trained model, and finally executing the step 4.

As an improvement, the step (1) of constructing the seed dictionary comprises: firstly, a domain scope is given, seed words are added by experts in a specific domain or a part of domain words are obtained from the Weiwei website and are added into a seed dictionary as the seed words.

As an improvement, the method for acquiring the web domain data in the step (3) comprises the following steps: based on the seed dictionary, two crawler methods, namely a focus type web crawler method and an increment type web crawler method, are used.

As an improvement, the data word segmentation and dependency syntax analysis implemented in the step (3) and the step (6) is as follows: data segmentation and dependency parsing are implemented using LTP (language technology platform for haardard) tools.

As an improvement, the evaluation method in the step (4) is as follows: the evaluation method mainly comprises two parts, wherein one part is evaluation on the extraction mode, the other part is statistical analysis on the extracted field words, and the evaluation on the extraction mode can evaluate the quality of the extracted field words according to the accuracy and the quantity of the extracted information.

(4.1) calculating the extraction accuracy, namely the correlation between the information extracted by the extraction mode and the field, wherein the calculation formula is as follows:

wherein rel-freq_iIs the amount of information extracted by the extraction mode i in the relevant text, total-freq_iIs the amount of information the extraction pattern i extracts in the training corpus (including both relevant and irrelevant text) and can therefore be used to measure relevance (relevance rate).

(4.2) calculating the number of extractions by the following formula:

log₂(frequency)

where frequency is the amount of information the extraction pattern extracts from the relevant text.

(4.3) combining the equations of step (4.1) and step (4.2) with the equation of Relevant rate log₂(frequency) to score the extraction pattern, wherein when the relevant rate<When 0.5, a score of 0 is obtained because the pattern is not considered relevant to the relevant text.

And (4.4) carrying out statistical analysis on the extracted words according to the selected extraction model, giving a corpus and using the extraction mode existing in the text corpus. Giving a seed word set, such as value-added tax, accounting, additional tax and the like, finding out all extraction modes capable of extracting the seed words, and scoring each extraction model according to the correlation between the extraction words and the seed words, wherein the calculation formula is as follows:

wherein score (pattern)_k) Is the score of the pattern k, and can be calculated by the evaluation method of the extraction pattern. For extracted information NP_iCan be covered with N_iThe extracted pattern is extracted, the score of the extracted information is the sum of the scores of all the extracted patterns capable of extracting the extracted pattern multiplied by 0.01 plus N_i。

(4.5) calculate score (pattern)_k) The formula of (1) is as follows:

score(pattern_i)＝R_i*log(F_i)

wherein F_iThe de-duplication quantity of the words of the seed set of i is extracted for the pattern. N is a radical of_iIs the de-duplication quantity of all words extracted by the extraction mode i,

indicating the relevance of the extracted word to the seed set.

As a refinement, the step (5) includes: ocr recognition is carried out on books in a specific field by using a third-party module tesserocr of python, and a corresponding text document is obtained and stored in a text database.

As a refinement, the step (8) includes: and realizing new word discovery based on mutual information and a sequence model BiLstm-CRF. For each input X ═ X₁,x₂,…,x_n) There will be a predicted label sequence y ═ (y)₁,y₂,…,y_n) Wherein the prediction score is formulated as follows:

wherein

Output of softmax for the ith position as y_iThe probability of (a) of (b) being,

is from y_iTo y_i+1When the number of tag (B-person, B-location …) is n, the transition probability matrix is (n +2) × (n + 2).

Compared with the prior art, the invention has the advantages that:

according to the method, the degree of adhesion and the degree of freedom between the characters are learned based on mutual information and left and right entropy, the context information of the text is learned based on a BiLstm-CRF model, the recognition rate of low-frequency words is integrally improved, the extracted new words and the found words are verified based on a retrieval and statistics method, manual verification is omitted, and the quality of the extracted field words can be improved.

Drawings

FIG. 1 is a flowchart of a domain dictionary construction framework according to the present invention, which includes a seed dictionary construction, a domain word extraction method based on a seed dictionary and web search, and a domain new word discovery method based on a seed dictionary and a sequence model.

FIG. 2 is a flow chart of domain word extraction based on seed dictionary and web search according to the present invention, and domain word extraction is realized through the flow chart.

FIG. 3 is a flow chart of the present invention for discovering new words in the field based on a seed dictionary and a sequence model, and the new words in the field are discovered through the flow chart.

FIG. 4 is a diagram of a BiLstm-CRF new word discovery model in the present invention.

Fig. 5 shows the evaluation results of the new word finding results in terms of accuracy, recall rate, and F value in different sequence models according to the embodiment of the present invention.

FIG. 6 is a flow chart of the present invention.

Detailed Description

The present invention is further described below by way of specific examples, but the present invention is not limited to only the following examples. Variations, combinations, or substitutions of the invention, which are within the scope of the invention or the spirit, scope of the invention, will be apparent to those of skill in the art and are within the scope of the invention.

As shown in fig. 1 to 5, the main steps of the present invention are domain seed dictionary construction, Web search data acquisition, a domain word extraction method based on seed dictionary and Web search, a domain literature data acquisition and a domain new word discovery method based on seed dictionary and sequence model, and the steps are as follows:

ST 1: and (4) constructing a field seed dictionary, wherein the seed dictionary provides a reference basis for a field word extraction model and a field new word discovery model. Therefore, the size and quality of the domain seed dictionary will affect the construction of the final domain dictionary. The domain seed dictionary comprises two methods, wherein one method is expert construction; another takes a part from a professional book. The domain seed dictionary is proper in size (about fifty words), the types of the seed words are diversity (for example, an entity dictionary is constructed, the interior of the dictionary contains the names of people, places, institutional names and the like) and the diversity of the parts of speech (such as verbs and nouns).

ST 2: and realizing the retrieval of the Web data based on the seed dictionary.

The method comprises the steps of obtaining a domain seed dictionary from ST1, then carrying out retrieval by taking seed words as retrieval words from Baidu encyclopedia, network news and forum, then obtaining retrieval contents, and finally carrying out text preprocessing on the retrieval contents to remove some non-text information such as webpage formats, numbers, letters, underlines, emoticons and the like. The specific process is as follows.

(1) Inputting: seed dictionary

(2) The process is as follows:

in the form seed word in seed dictionary:

in Baidu encyclopedia, searching seed words based on a focused web crawler method and an incremental web crawler method to obtain a search result;

in the network news, searching seed words based on a focused web crawler and an incremental web crawler method to obtain a search result;

in the forum, searching seed words based on a focused web crawler and an incremental web crawler method, and obtaining a search result;

and preprocessing the retrieval result, removing non-text information, and storing the non-text information into text data.

Once completed, the seed word is updated.

The retrieval process is ended.

ST 3: extracting the domain words based on the rules, and then evaluating the extracted words based on the statistical method and the retrieval mode, wherein a domain word extraction flow chart is shown as reference in fig. 2.

A domain seed dictionary is first obtained based on ST1 and ST2, and data retrieved from the web. Secondly, constructing an extraction mode, activating the extraction mode by taking the seed dictionary as an activation condition, and finally, searching and counting to determine whether to put the seed dictionary into the field seed dictionary.

Wherein the extraction model is defined as follows:

part of speech of seed word	Dependency relationship between seed word and new word	Extracting word parts of speech
			Noun, verb	Relationship between major and minor	Nouns or verbs
Noun, verb	Moving guest relationship	Noun orVerb and its usage
			Noun, verb	Inter-guest relationships	Nouns or verbs
Noun, verb	Centering relationships	Noun (name)
			Noun, verb	Relationship between aspects	Noun (name)

The seed words are used as activation conditions, the dependency relationship in the exit mode is met, and new words are extracted to form a new word bank. The words are then ranked based on the method of retrieval and the method of extracting the pattern evaluation in step 4. The overall extraction method is as follows.

(1) Inputting: seed dictionary, web search data

(2) The process is as follows:

dependency word segmentation and part-of-speech tagging for web retrieval data based on LTP

In the form seed word in seed dictionary:

the seed word activates the above extraction mode

Extracting new words according to the parts of speech of the seeds and the parts of speech of the new words

Storing seed words, extracting mode, and new words into text database

And taking the new words as key words, searching in the Baidu encyclopedia, and directly putting the new words into a seed dictionary if results exist. And 4, ordering the new words based on the extraction mode evaluation in the step 4, extracting the first 5 words, putting the first 3 words into the seed words, and putting the second 2 words into the recognition words.

(3) And (3) outputting: a seed dictionary and an unrecognized dictionary.

ST 4: there are two main methods for domain literature data acquisition: one is to crawl documents from the network and obtain text data; one is to perform OCR recognition based on specific field data to acquire text data of a specific field.

ST 5: and (3) carrying out new word discovery based on the BiLstm-CRF model and the seed dictionary, wherein a new word discovery flow chart is shown by referring to FIG. 3.

Firstly, constructing an unrecognized dictionary for data acquired in ST4 based on ST3, then counting the adhesion degree and the freedom degree between words based on mutual information as the weight of an input word vector, then labeling the data based on a seed dictionary and the unrecognized dictionary, and finally, training data based on a BiLstm-CRF model to find out new domain words. The overall method for finding the domain word is as follows.

(1) Inputting: seed dictionary, unrecognized dictionary, document data

(2) Training field word vector X

(3) And calculating the adhesion degree and the degree of freedom W based on the mutual information and the left-right entropy, and inputting WX as a model.

(4) Based on the seed dictionary, the dictionary is not recognized, and the labels of the training are labeled as labels in (B-New, I-New, E-New, O) 4.

(5) Training a Bilstm-CRF network model, wherein the training process is as follows:

(5) evaluating the trained model

(6) And marking the corpus by using the trained model to obtain a new word.

(7) And taking the new words as key words, searching in the Baidu encyclopedia, and directly putting the new words into a seed dictionary if results exist. And 4, ordering the new words based on the extraction mode evaluation in the step 4, extracting the first 5 words, putting the first 3 words into the seed words, and putting the second 2 words into the recognition words.

(8) And (3) outputting: a seed dictionary and an unrecognized dictionary.

The first embodiment is as follows:

a domain dictionary construction method based on web retrieval and new word discovery.

Some of the experimental data used herein is field data obtained from the internet based on text literature.

The experimental environment was as follows: the memory is 16G, the CPU is an Inter (R) core (TM)2i5-8400 CPU @2.80GHz processor, the operating system is a Window10 system, and the programming language is python. The basic indexes for testing the effectiveness found by the new words mainly comprise accuracy, recall rate and F1 value, and the experiment is evaluated based on three evaluation indexes.

Example a method uses different sequence models, evaluation in terms of accuracy, recall, and F-value for new word discovery:

in the present example, the sequence models include four models of CRF, LSTM-CRF and BI-LSTM-CRF, then the four models are used for training and new word prediction respectively based on the labeled data, and finally, the sequence models are compared and analyzed.

The experimental results are shown in FIG. 5. By comparing and analyzing the results in fig. 5, the CRF sequence model alone has the lowest accuracy, recall rate and F value, and only learns a simple language model, and the LSTM long and short memory model breaks the limit of the text sequence length, so that the experimental result is improved. The accuracy, the recall rate and the F value of the BI-LSTM-CRF model used by the invention are 84.32, 80.67 and 82.45 respectively, and the model learns the relationship between upper and lower characters and breaks the limit of the text sequence length, so the effect is obviously improved.

The present invention and its embodiments have been described above, and the description is not intended to be limiting, and the drawings are only one embodiment of the present invention, and the actual structure is not limited thereto. In summary, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A domain dictionary construction method based on web retrieval and new word discovery is characterized by comprising the following steps: the method comprises the following implementation steps:

(1) constructing a seed dictionary;

(4) evaluating and analyzing the extracted domain words;

2. The domain dictionary construction method based on web retrieval and new word discovery as claimed in claim 1, wherein: the step (1) of constructing the seed dictionary comprises the following steps: firstly, a domain scope is given, seed words are added by experts in a specific domain or a part of domain words are obtained from the Weiwei website and are added into a seed dictionary as the seed words.

3. The domain dictionary construction method based on web retrieval and new word discovery as claimed in claim 1, wherein: the method for acquiring the web domain data in the step (3) comprises the following steps: based on the seed dictionary, two crawler methods, namely a focus type web crawler method and an increment type web crawler method, are used.

4. The domain dictionary construction method based on web retrieval and new word discovery as claimed in claim 1, wherein: the syntax analysis of data word segmentation and dependency in the step (3) and the step (6) is as follows: data segmentation and dependency parsing are implemented using LTP (language technology platform for haardard) tools.

5. The domain dictionary construction method based on web retrieval and new word discovery as claimed in claim 1, wherein: the evaluation method in the step (4) comprises the following steps: the evaluation method mainly comprises two parts, wherein one part is evaluation on the extraction mode, the other part is statistical analysis on the extracted field words, and the evaluation on the extraction mode can evaluate the quality of the extracted field words according to the accuracy and the quantity of the extracted information.

(4.2) calculating the number of extractions by the following formula:

log₂(frequency)

(4.5) calculate score (pattern)_k) The formula of (1) is as follows:

score(pattern_i)＝R_i*log(F_i)

indicating the relevance of the extracted word to the seed set.

6. The domain dictionary construction method based on web retrieval and new word discovery as claimed in claim 1, wherein: the step (5) comprises: ocr recognition is carried out on books in a specific field by using a third-party module tesserocr of python, and a corresponding text document is obtained and stored in a text database.

7. The domain dictionary construction method based on web retrieval and new word discovery as claimed in claim 1, wherein: the step (8) comprises: and realizing new word discovery based on mutual information and a sequence model BiLstm-CRF. For each input X ═ X₁,x₂,…,x_n) There will be a predicted label sequence y ═ (y)₁,y₂,…,y_n) Wherein the prediction score is formulated as follows:

wherein