CN111325018B

CN111325018B - Domain dictionary construction method based on web retrieval and new word discovery

Info

Publication number: CN111325018B
Application number: CN202010068095.6A
Authority: CN
Inventors: 杜梦豪; 赵琨; 刘杰鹏; 丁健; 梁栋彬; 袁显峰
Original assignee: Shanghai Hengqi Education And Training Co ltd
Current assignee: Shanghai Hengqi Education And Training Co ltd
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2023-08-11
Anticipated expiration: 2040-01-21
Also published as: CN111325018A

Abstract

The invention discloses a field dictionary construction method based on web retrieval and new word discovery, which aims at the characteristics of diversity and richness (including network data and document data) of text data, the existence of field words in new words and the like, and provides a field dictionary construction method based on web retrieval and new word discovery. The method comprises the following two parts: crawling network data based on a seed dictionary, and then extracting domain words based on a self-defined extraction mode; the degree of freedom and the degree of adhesion between the words are learned based on mutual information and left and right entropy, and then new word discovery is realized based on the Bilstm-CRF. Compared with the prior art, the invention has the advantages that: the invention learns the adhesiveness and the freedom degree between the words based on the mutual information and the left and right entropy, then learns the context information of the text based on the Bilstm-CRF model, integrally improves the recognition rate of the low-frequency words, and verifies the extracted new words and the found words based on the searching and statistics method, thereby omitting manual verification and being capable of improving the quality of the extracted domain words.

Description

Domain dictionary construction method based on web retrieval and new word discovery

Technical Field

The invention relates to the fields of natural language processing, deep learning and web retrieval, in particular to a field dictionary construction method based on web retrieval and new word discovery.

Background

For the 21 st century, "artificial intelligence" has gone into the general field of view and is constantly changing our lives. For example, siri of apples, warehouse robots of Beijing east, two-dimension codes of payment treasures, chat robots of hundred degrees, microsoft, ali and the like are all artificial intelligence, and bring convenience to our daily lives. Among them, natural language processing is one of typical fields of artificial intelligence technology, and through many years of research, this field has achieved remarkable results. Such as machine translation, question and answer systems, reading understanding, text generation, etc. Among these results, there are various degrees of dependence on the dictionary. The corresponding feature words extracted by the dictionary are used as input, and machine translation, problem analysis and the like are performed based on a machine learning model or algorithm, so that the reliability of analysis is greatly improved. Especially in knowledge question and answer and reading understanding in specific fields, a large and wide dictionary is used, and a better analysis effect cannot be achieved. Therefore, the method for quickly and effectively constructing the domain dictionary is an important task in natural language processing.

With the explosive development of communication internet technology and the advent of the web3.0 age, the national government, corporate communities, internet users have participated in different forms. The nations take the form of news media, the enterprise groups take the form of enterprise websites, and the Internet users share information in different fields in the form of discussion platforms and forums, and the information has different field topics, so that a large number of new words are emerging. Thus, the internet is an important source of information for domain dictionary construction. Meanwhile, books, journals and other documents in specific fields are also an important information source for dictionary construction. Thus, obtaining domain-specific data from networks and books is one of the effective approaches.

There are many possible methods for domain dictionary construction, and many related studies are also available. Yin Wenke et al propose a domain dictionary construction method based on the wikipedia link structure diagram, the aggregate LSI algorithm and the CPMw algorithm. However, LSI models have many disadvantages, it is difficult to select a proper number of subjects, and SVD calculation is very time-consuming, and since LSI is not a probabilistic model, there is a lack of statistical basis, and the results are difficult to interpret intuitively. Li Weiqing et al propose a feature dictionary method for building product features, which calculates semantic similarity based on synonym forest extension and Word2Vec tools on the basis of a large amount of labeling data, and summarizes the features, thereby building a product feature dictionary. However, the data needs to be annotated, which is time consuming, labor consuming and time consuming. And these dictionary construction methods do not take into account the occurrence of some new words. Therefore, when building a domain dictionary, a new word discovery module needs to be added.

The traditional new word discovery method mainly realizes new word discovery by a statistical-based method and a rule-based method. The statistical-based method has the advantages of wide application field, low accuracy, large amount of corpus required and large calculation amount. The rule-based method is to construct a rule base by using language features, and has high accuracy, complex rule constructing process and poor field migration capability. Based on the information entropy and the left entropy and right entropy, the correlation of two adjacent words is calculated, and the adhesiveness of the words and the word boundaries are better determined. Based on the Bilstm-CRF deep neural network model, the text context can be better learned, and more low-frequency new words can be identified. Therefore, firstly, extracting domain words based on web retrieval and a seed dictionary, then, discovering domain new words based on a Bilstm-CRF, a seed dictionary, mutual information and a statistical rule method, and finally constructing a domain dictionary.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the technical defects, and provide a field dictionary construction method based on web retrieval and new word discovery.

In order to solve the problems, the technical scheme of the invention is as follows: a field dictionary construction method based on web retrieval and new word discovery comprises the following implementation steps:

(1) Constructing a seed dictionary;

(2) Searching the seed words in the hundred degrees encyclopedia, the network news and the forum based on the seed dictionary respectively to acquire web field data;

(3) Extracting domain words in hundred degrees encyclopedia based on rules, performing word segmentation and dependency syntax analysis on news network data and forum data based on a word segmentation method and a dependency syntax method in LTP, and then extracting domain words based on specific rules;

(4) Evaluating and analyzing the extracted domain words;

(5) OCR recognition is carried out on books in the specific field to obtain text documents, cases, formulas, charts and the like of the text documents after recognition are removed, text documents of data in the specific field are obtained, and the text documents are stored in a text database;

(6) Performing word segmentation, part-of-speech tagging and dependency syntax analysis on book data in a specific field based on a part-of-speech tagging method and a dependency syntax analysis method in the LTP, extracting field words based on a specific rule, and executing step 4 for extracted new words;

(7) Using the crust word segmentation, loading an unrecognized word dictionary and a seed dictionary, and carrying out word segmentation and labeling on book data in a specific field;

(8) Using mutual information, performing statistical analysis on the data by using left and right entropy as the weight of the word vector input by the sequence model; training the marked data based on a sequence model Bilstm-CRF, performing model evaluation by using accuracy, recall and F value, predicting new words based on the trained model, and executing step 4.

As an improvement, the construction of the seed dictionary in the step (1) comprises the following steps: first, a domain range is given, and a specific domain expert is used for adding seed words or acquiring a part of domain words from a authority website, and the part of domain words are used as seed words to be added into a seed dictionary.

As an improvement, the method for acquiring web domain data in the step (3) includes: based on the seed dictionary, a focused web crawler method and an incremental web crawler method are used.

As an improvement, the implementation of data word segmentation and dependency syntactic analysis in the step (3) and the step (6) is as follows: data word segmentation and dependency syntax analysis are implemented using LTP (hastelloy language technology platform) tools.

As an improvement, the evaluation method in the step (4) is as follows: the evaluation method mainly comprises two parts, wherein one part is used for evaluating the extraction mode, the other part is used for carrying out statistical analysis on the extracted domain words, and the evaluation of the extraction mode can evaluate the quality of the extracted domain words according to the accuracy and the quantity of the extracted information.

(4.1) calculating the accuracy of extraction, namely the correlation between the information extracted by the extraction mode and the field, wherein the calculation formula is as follows:

wherein rel-freq _i Is the amount of information extracted in the relevant text by the extraction pattern i, total-freq _i Is the amount of information that the extraction pattern i extracts in the training corpus (including relevant text and irrelevant text) and can therefore be used to measure relevance (relatedness rate).

(4.2) calculating the number of extractions, wherein the calculation formula is as follows:

log ₂ (frequency)

where frequence is the amount of information that the extraction pattern extracts from the relevant text.

(4.3) combining the formulas of step (4.1) and step (4.2) using the formula releasant rate log ₂ (frequency) to score the extraction pattern, where when the releasant rate<When=0.5, the score is 0, because the pattern is not already considered to be irrelevant to the related text.

(4.4) for the selected extraction model, performing statistical analysis on the extracted words, and giving a corpus, wherein the extraction model existing in the text corpus is used. Giving a seed word set, such as value-added tax, accounting, additional tax and the like, finding out all extraction modes capable of extracting the seed words, scoring each extraction model according to the correlation between the extracted words and the seed words, and calculating the following formula:

wherein score (pattern _k ) The fraction of pattern k is calculated by the method of evaluation of the extraction pattern. For extraction information NP _i Can be N _i The extraction pattern is extracted, and then the score of this extraction information is the sum of the scores of all the extraction patterns that can extract it multiplied by 0.01 plus N _i 。

(4.5) calculating score (pattern _k ) The formula of (2) is as follows:

score(pattern _i )＝R _i *log(F _i )

wherein F is _i The number of de-duplication of words of the seed subset of i is extracted for the pattern. N (N) _i Is the number of de-duplication of all words extracted in extraction pattern i,representing the relevance of the extracted words to the seed set.

As an improvement, the step (5) includes: and ocr, identifying books in a specific field by using a third party module tesserocr of python, acquiring corresponding text documents, and storing the corresponding text documents in a text database.

As an improvement, the step (8) includes: new word discovery is achieved based on mutual information and a sequence model Bilstm-CRF. For each input x= (X ₁ ,x ₂ ,…,x _n ) There will be one predicted label sequence y= (y) ₁ ,y ₂ ,…,y _n ) Wherein the predictive score formula is as follows:

wherein the method comprises the steps ofOutput y for the i-th position softmax _i Probability of->Is from y _i To y _i+1 When the number of tags (B-person, B-location …) is n, the transition probability matrix is (n+2) ×n+2.

Compared with the prior art, the invention has the advantages that:

the invention learns the adhesiveness and the freedom degree between the words based on the mutual information and the left and right entropy, then learns the context information of the text based on the Bilstm-CRF model, integrally improves the recognition rate of the low-frequency words, and verifies the extracted new words and the found words based on the searching and statistics method, thereby omitting manual verification and being capable of improving the quality of the extracted domain words.

Drawings

FIG. 1 is a flow chart of a domain dictionary construction framework of the present invention, which includes a seed dictionary construction, a domain word extraction method based on a seed dictionary and web retrieval, and a domain new word discovery method based on a seed dictionary and a sequence model.

Fig. 2 is a flow chart of extracting domain words based on seed dictionary and web retrieval, through which the extraction of domain words is realized.

FIG. 3 is a flow chart of new word discovery in the field based on a seed dictionary and a sequence model, through which the new word discovery in the field is realized.

FIG. 4 is a diagram of a new word discovery model for Bilstm-CRF in the present invention.

FIG. 5 shows the evaluation results of new word discovery results in terms of accuracy, recall, and F values in different sequence models according to an embodiment of the present invention.

Fig. 6 is a flow chart of the present invention.

Detailed Description

The present invention is further described below by way of specific examples, but the present invention is not limited to the following examples only. Modifications, combinations, or substitutions of the present invention within the scope of the invention or without departing from the spirit and scope of the invention will be apparent to those skilled in the art and are included within the scope of the invention.

As shown in fig. 1 to 5, the main steps of the present invention are divided into a domain seed dictionary construction, web retrieval data acquisition, a domain word extraction method based on a seed dictionary and Web retrieval, a domain document data acquisition and a domain new word discovery method based on a seed dictionary and a sequence model, and the steps are as follows:

ST1: and constructing a domain seed dictionary, wherein the seed dictionary provides a reference standard for a domain word extraction model and a domain new word discovery model. Therefore, the size and quality of the domain seed dictionary will affect the construction of the final domain dictionary. The domain seed dictionary comprises two methods, one of which is expert construction; another obtains a portion from a professional book. The domain seed dictionary is at least proper (around fifty words), the types of seed words are diversified (for example, entity dictionary construction, dictionary interior contains as much as possible, person name, place name, organization name, etc.), and the variety of parts of speech (for example, verbs, nouns).

ST2: and retrieving Web data based on the seed dictionary.

The method comprises the steps of obtaining a field seed dictionary from ST1, then searching from hundred degrees encyclopedia, network news and forum by taking a seed word as a search word, obtaining search content, finally performing text preprocessing on the search content, and removing some non-text information such as web page formats, numbers, letters, underlines, expression symbols and the like. The specific flow is as follows.

(1) Input: seed dictionary

(2) The process comprises the following steps:

for seed word in seed dictionary:

in the hundred degrees encyclopedia, searching a seed word based on a focused web crawler and an incremental web crawler method, and acquiring a search result;

in the network news, searching a seed word based on a focused web crawler method and an incremental web crawler method to obtain a search result;

in the forum, searching a seed word based on a focused web crawler method and an incremental web crawler method, and acquiring a search result;

preprocessing the search result, removing non-text information, and storing the non-text information into text data.

And (5) completing one time, and updating the seed words.

The retrieval process is ended.

ST3: the domain words are extracted based on rules, and then the extracted words are evaluated based on a statistical method and a search mode, wherein a domain word extraction flow chart is shown with reference to fig. 2.

The domain seed dictionary is first acquired based on ST1 and ST2, as well as the data retrieved from the web. Secondly, constructing an extraction mode, taking the seed dictionary as an activation condition, activating the extraction mode, and finally searching and counting to determine whether the seed dictionary is put into the field seed dictionary.

Wherein, the extraction model is defined as the following table:

part of speech of seed word	Seed word and new word dependencies	Extracting word parts of speech
			Nouns, verbs	Relationship of main and secondary terms	Nouns or verbs
Nouns, verbs	Relation of moving guest	Nouns or verbs
			Nouns, verbs	Guest-guest relationship	Nouns or verbs
Nouns, verbs	Centering relationship	Nouns (noun)
			Nouns, verbs	Relationships in the form	Nouns (noun)

The seed word is used as an activation condition, the dependency relationship in the out mode is met, new words are extracted, and a new word stock is formed. Then, the words are ranked based on the method of retrieval and the method of extraction pattern evaluation in step 4. The whole extraction method is specifically as follows.

(1) Input: seed dictionary, web retrieval data

(2) The process comprises the following steps:

dependency word segmentation and part-of-speech tagging are carried out on web retrieval data based on LTP

For seed word in seed dictionary:

seed words activate the extraction mode

Extracting new words according to the seed part of speech and the new word part of speech

Storing seed words, extracting patterns, new words into text database

And taking the new words as key words, searching in hundred degrees encyclopedia, and directly putting the result into a seed dictionary. Based on the extraction mode evaluation in the step 4, new words are sequenced, the first 5 words are extracted, the first 3 words are put into seed words, and the last 2 words are put into recognition words.

(3) And (3) outputting: seed dictionary and unidentified dictionary.

ST4: the data acquisition of the literature in the field mainly comprises two methods: one is to crawl documents from a network to obtain text data; one is to perform OCR recognition based on domain-specific data to acquire domain-specific text data.

ST5: new word discovery is performed based on a Bilstm-CRF model and a seed dictionary, and a new word discovery flow chart is shown with reference to FIG. 3.

Firstly, constructing an unidentified dictionary based on data acquired in ST3 and ST4, then counting the adhesiveness and the freedom degree between words based on mutual information as the weight of an input word vector, then marking the data based on a seed dictionary and the unidentified dictionary, and finally, finding out new domain words based on the training data of a Bilstm-CRF model. The overall method of finding domain words is as follows.

(1) Input: seed dictionary, unidentified dictionary, document data

(2) Training field word vector X

(3) The degree of adhesion and the degree of freedom W are calculated based on the mutual information and the left and right entropy, and WX is input as a model.

(4) Based on the seed dictionary, the dictionary is not identified, and training is marked, and the labels are labels in (B-New, I-New, E-New, O) 4.

(5) The Bilstm-CRF network model was trained as follows:

(5) Evaluating a trained model

(6) And marking the language materials by using the trained model to obtain new words.

(7) And taking the new words as key words, searching in hundred degrees encyclopedia, and directly putting the result into a seed dictionary. Based on the extraction mode evaluation in the step 4, new words are sequenced, the first 5 words are extracted, the first 3 words are put into seed words, and the last 2 words are put into recognition words.

(8) And (3) outputting: seed dictionary and unidentified dictionary.

Embodiment one:

a domain dictionary construction method based on web retrieval and new word discovery.

Some of the experimental data used herein are field data obtained from the internet and based on text documents.

The experimental environment is as follows: the memory is 16G, the CPU is an Inter (R) Core (TM) 2i5-8400 CPU@2.80GHz processor, the operating system is a Window10 system, and the programming language is python language. The basic indexes for testing the validity of the new word discovery mainly comprise accuracy, recall rate and F1 value, and the experiment is evaluated based on three evaluation indexes.

Embodiments a method uses different sequence models for new word discovery, evaluation in terms of accuracy, recall, and F-value:

in this example, the sequence model comprises CRF, LSTM, LSTM-CRF and BI-LSTM-CRF, and the four models are used for training and predicting new words based on marked data, and finally, the sequence model is subjected to comparison analysis.

The experimental results are shown with reference to fig. 5. From the comparative analysis of fig. 5, in which the CRF sequence model alone is used, the accuracy, recall, and F value are the lowest, only a simple language mode is learned, and the LSTM long and short memory model breaks the limitation of the text sequence length, so that the experimental result is improved. The accuracy, recall rate and F value of the BI-LSTM-CRF model used in the invention are respectively 84.32, 80.67 and 82.45, and the model learns the relation between the upper text and the lower text and breaks the limitation of the text sequence length, so that the effect is obviously improved.

The invention and its embodiments have been described above with no limitation, and the actual construction is not limited to the embodiments of the invention as shown in the drawings. In summary, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical solution should not be creatively devised without departing from the gist of the present invention.

Claims

1. A field dictionary construction method based on web retrieval and new word discovery is characterized by comprising the following steps: the method comprises the following implementation steps:

(1) Constructing a seed dictionary;

(3) Extracting domain words in hundred degrees encyclopedia based on rules, performing word segmentation and dependency syntax analysis on news network data and forum data based on a word segmentation method and a dependency syntax method in LTP, and then extracting domain words based on the rules;

(4) The method comprises the steps of evaluating and analyzing the extracted domain words, wherein the evaluating method comprises two parts, one part is the evaluation of an extracting mode, the other part is the statistical analysis of the extracted domain words, and the evaluation of the extracting mode can evaluate the quality of the extracted domain words according to the accuracy and the quantity of the extracted information;

wherein rel-freq _i Is the amount of information extracted in the relevant text by the extraction pattern i, total-freq _i Is the amount of information extracted in the training corpus by the extraction pattern i, and can be used for measuring the correlation;

log ₂ (frequency)

where frequence is the amount of information that the extraction pattern extracts from the relevant text;

(4.3) combining the formulas of step (4.1) and step (4.2) using the formula releasant rate log ₂ (frequency) to score the extraction pattern, where when the releasant rate<When=0.5, the score is 0, because the pattern has been considered irrelevant to the relevant text;

(4.4) aiming at the selected extraction model, carrying out statistical analysis on the extracted words, giving a corpus, using the extraction modes existing in the text corpus, giving a seed word set, adding value tax, accounting and additional tax, finding out all extraction modes capable of extracting the seed words, scoring each extraction model according to the correlation between the extracted words and the seed words, and calculating the following formula:

wherein score (pattern _k ) Is the fraction of pattern k, which can be calculated by the evaluation method of the extraction pattern,for extraction information NP _i Can be N _i The extraction pattern is extracted, and then the score of this extraction information is the sum of the scores of all the extraction patterns that can extract it multiplied by 0.01 plus N _i ；

(4.5) calculating score (pattern _k ) The formula of (2) is as follows:

score(pattern _i )＝R _i *log(F _i )

wherein F is _i For extracting the number of de-duplication of words in the seed word set extracted by the pattern i, N _i Is the number of de-duplication of all words extracted in extraction pattern i,representing the relevance of the extracted words to the seed set;

(5) OCR recognition is carried out on books in the specific field to obtain text documents, cases, formulas and charts of the text documents after recognition are removed, text documents of data in the specific field are obtained, and the text documents are stored in a text database;

2. The domain dictionary construction method based on web retrieval and new word discovery according to claim 1, wherein: the construction of the seed dictionary in the step (1) comprises the following steps: first, a domain range is given, and a specific domain expert is used for adding seed words or acquiring a part of domain words from a authority website, and the part of domain words are used as seed words to be added into a seed dictionary.

3. The domain dictionary construction method based on web retrieval and new word discovery according to claim 1, wherein: the method for acquiring the web domain data in the step (3) comprises the following steps: based on the seed dictionary, a focused web crawler method and an incremental web crawler method are used.

4. The domain dictionary construction method based on web retrieval and new word discovery according to claim 1, wherein: the step (3) and the step (6) realize data word segmentation and dependency syntactic analysis as follows: data word segmentation and dependency syntax analysis are implemented using LTP tools.

5. The domain dictionary construction method based on web retrieval and new word discovery according to claim 1, wherein: the step (5) comprises: and ocr, identifying books in a specific field by using a third party module tesserocr of python, acquiring corresponding text documents, and storing the corresponding text documents in a text database.

6. The domain dictionary construction method based on web retrieval and new word discovery according to claim 1, wherein: the step (8) comprises: new word discovery is achieved based on mutual information and a sequence model Bilstm-CRF, and for each input X= (X) ₁ ,x ₂ ,., x), all have a predicted label sequence y= (y ₁ ,y ₂ ,., y), wherein the predictive score formula is as follows: