CN111325018B - Domain dictionary construction method based on web retrieval and new word discovery - Google Patents

Domain dictionary construction method based on web retrieval and new word discovery Download PDF

Info

Publication number
CN111325018B
CN111325018B CN202010068095.6A CN202010068095A CN111325018B CN 111325018 B CN111325018 B CN 111325018B CN 202010068095 A CN202010068095 A CN 202010068095A CN 111325018 B CN111325018 B CN 111325018B
Authority
CN
China
Prior art keywords
words
domain
seed
dictionary
extracted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010068095.6A
Other languages
Chinese (zh)
Other versions
CN111325018A (en
Inventor
杜梦豪
赵琨
刘杰鹏
丁健
梁栋彬
袁显峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Hengqi Education And Training Co ltd
Original Assignee
Shanghai Hengqi Education And Training Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Hengqi Education And Training Co ltd filed Critical Shanghai Hengqi Education And Training Co ltd
Priority to CN202010068095.6A priority Critical patent/CN111325018B/en
Publication of CN111325018A publication Critical patent/CN111325018A/en
Application granted granted Critical
Publication of CN111325018B publication Critical patent/CN111325018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a field dictionary construction method based on web retrieval and new word discovery, which aims at the characteristics of diversity and richness (including network data and document data) of text data, the existence of field words in new words and the like, and provides a field dictionary construction method based on web retrieval and new word discovery. The method comprises the following two parts: crawling network data based on a seed dictionary, and then extracting domain words based on a self-defined extraction mode; the degree of freedom and the degree of adhesion between the words are learned based on mutual information and left and right entropy, and then new word discovery is realized based on the Bilstm-CRF. Compared with the prior art, the invention has the advantages that: the invention learns the adhesiveness and the freedom degree between the words based on the mutual information and the left and right entropy, then learns the context information of the text based on the Bilstm-CRF model, integrally improves the recognition rate of the low-frequency words, and verifies the extracted new words and the found words based on the searching and statistics method, thereby omitting manual verification and being capable of improving the quality of the extracted domain words.

Description

Domain dictionary construction method based on web retrieval and new word discovery
Technical Field
The invention relates to the fields of natural language processing, deep learning and web retrieval, in particular to a field dictionary construction method based on web retrieval and new word discovery.
Background
For the 21 st century, "artificial intelligence" has gone into the general field of view and is constantly changing our lives. For example, siri of apples, warehouse robots of Beijing east, two-dimension codes of payment treasures, chat robots of hundred degrees, microsoft, ali and the like are all artificial intelligence, and bring convenience to our daily lives. Among them, natural language processing is one of typical fields of artificial intelligence technology, and through many years of research, this field has achieved remarkable results. Such as machine translation, question and answer systems, reading understanding, text generation, etc. Among these results, there are various degrees of dependence on the dictionary. The corresponding feature words extracted by the dictionary are used as input, and machine translation, problem analysis and the like are performed based on a machine learning model or algorithm, so that the reliability of analysis is greatly improved. Especially in knowledge question and answer and reading understanding in specific fields, a large and wide dictionary is used, and a better analysis effect cannot be achieved. Therefore, the method for quickly and effectively constructing the domain dictionary is an important task in natural language processing.
With the explosive development of communication internet technology and the advent of the web3.0 age, the national government, corporate communities, internet users have participated in different forms. The nations take the form of news media, the enterprise groups take the form of enterprise websites, and the Internet users share information in different fields in the form of discussion platforms and forums, and the information has different field topics, so that a large number of new words are emerging. Thus, the internet is an important source of information for domain dictionary construction. Meanwhile, books, journals and other documents in specific fields are also an important information source for dictionary construction. Thus, obtaining domain-specific data from networks and books is one of the effective approaches.
There are many possible methods for domain dictionary construction, and many related studies are also available. Yin Wenke et al propose a domain dictionary construction method based on the wikipedia link structure diagram, the aggregate LSI algorithm and the CPMw algorithm. However, LSI models have many disadvantages, it is difficult to select a proper number of subjects, and SVD calculation is very time-consuming, and since LSI is not a probabilistic model, there is a lack of statistical basis, and the results are difficult to interpret intuitively. Li Weiqing et al propose a feature dictionary method for building product features, which calculates semantic similarity based on synonym forest extension and Word2Vec tools on the basis of a large amount of labeling data, and summarizes the features, thereby building a product feature dictionary. However, the data needs to be annotated, which is time consuming, labor consuming and time consuming. And these dictionary construction methods do not take into account the occurrence of some new words. Therefore, when building a domain dictionary, a new word discovery module needs to be added.
The traditional new word discovery method mainly realizes new word discovery by a statistical-based method and a rule-based method. The statistical-based method has the advantages of wide application field, low accuracy, large amount of corpus required and large calculation amount. The rule-based method is to construct a rule base by using language features, and has high accuracy, complex rule constructing process and poor field migration capability. Based on the information entropy and the left entropy and right entropy, the correlation of two adjacent words is calculated, and the adhesiveness of the words and the word boundaries are better determined. Based on the Bilstm-CRF deep neural network model, the text context can be better learned, and more low-frequency new words can be identified. Therefore, firstly, extracting domain words based on web retrieval and a seed dictionary, then, discovering domain new words based on a Bilstm-CRF, a seed dictionary, mutual information and a statistical rule method, and finally constructing a domain dictionary.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the technical defects, and provide a field dictionary construction method based on web retrieval and new word discovery.
In order to solve the problems, the technical scheme of the invention is as follows: a field dictionary construction method based on web retrieval and new word discovery comprises the following implementation steps:
(1) Constructing a seed dictionary;
(2) Searching the seed words in the hundred degrees encyclopedia, the network news and the forum based on the seed dictionary respectively to acquire web field data;
(3) Extracting domain words in hundred degrees encyclopedia based on rules, performing word segmentation and dependency syntax analysis on news network data and forum data based on a word segmentation method and a dependency syntax method in LTP, and then extracting domain words based on specific rules;
(4) Evaluating and analyzing the extracted domain words;
(5) OCR recognition is carried out on books in the specific field to obtain text documents, cases, formulas, charts and the like of the text documents after recognition are removed, text documents of data in the specific field are obtained, and the text documents are stored in a text database;
(6) Performing word segmentation, part-of-speech tagging and dependency syntax analysis on book data in a specific field based on a part-of-speech tagging method and a dependency syntax analysis method in the LTP, extracting field words based on a specific rule, and executing step 4 for extracted new words;
(7) Using the crust word segmentation, loading an unrecognized word dictionary and a seed dictionary, and carrying out word segmentation and labeling on book data in a specific field;
(8) Using mutual information, performing statistical analysis on the data by using left and right entropy as the weight of the word vector input by the sequence model; training the marked data based on a sequence model Bilstm-CRF, performing model evaluation by using accuracy, recall and F value, predicting new words based on the trained model, and executing step 4.
As an improvement, the construction of the seed dictionary in the step (1) comprises the following steps: first, a domain range is given, and a specific domain expert is used for adding seed words or acquiring a part of domain words from a authority website, and the part of domain words are used as seed words to be added into a seed dictionary.
As an improvement, the method for acquiring web domain data in the step (3) includes: based on the seed dictionary, a focused web crawler method and an incremental web crawler method are used.
As an improvement, the implementation of data word segmentation and dependency syntactic analysis in the step (3) and the step (6) is as follows: data word segmentation and dependency syntax analysis are implemented using LTP (hastelloy language technology platform) tools.
As an improvement, the evaluation method in the step (4) is as follows: the evaluation method mainly comprises two parts, wherein one part is used for evaluating the extraction mode, the other part is used for carrying out statistical analysis on the extracted domain words, and the evaluation of the extraction mode can evaluate the quality of the extracted domain words according to the accuracy and the quantity of the extracted information.
(4.1) calculating the accuracy of extraction, namely the correlation between the information extracted by the extraction mode and the field, wherein the calculation formula is as follows:
wherein rel-freq i Is the amount of information extracted in the relevant text by the extraction pattern i, total-freq i Is the amount of information that the extraction pattern i extracts in the training corpus (including relevant text and irrelevant text) and can therefore be used to measure relevance (relatedness rate).
(4.2) calculating the number of extractions, wherein the calculation formula is as follows:
log 2 (frequency)
where frequence is the amount of information that the extraction pattern extracts from the relevant text.
(4.3) combining the formulas of step (4.1) and step (4.2) using the formula releasant rate log 2 (frequency) to score the extraction pattern, where when the releasant rate<When=0.5, the score is 0, because the pattern is not already considered to be irrelevant to the related text.
(4.4) for the selected extraction model, performing statistical analysis on the extracted words, and giving a corpus, wherein the extraction model existing in the text corpus is used. Giving a seed word set, such as value-added tax, accounting, additional tax and the like, finding out all extraction modes capable of extracting the seed words, scoring each extraction model according to the correlation between the extracted words and the seed words, and calculating the following formula:
wherein score (pattern k ) The fraction of pattern k is calculated by the method of evaluation of the extraction pattern. For extraction information NP i Can be N i The extraction pattern is extracted, and then the score of this extraction information is the sum of the scores of all the extraction patterns that can extract it multiplied by 0.01 plus N i
(4.5) calculating score (pattern k ) The formula of (2) is as follows:
score(pattern i )=R i *log(F i )
wherein F is i The number of de-duplication of words of the seed subset of i is extracted for the pattern. N (N) i Is the number of de-duplication of all words extracted in extraction pattern i,representing the relevance of the extracted words to the seed set.
As an improvement, the step (5) includes: and ocr, identifying books in a specific field by using a third party module tesserocr of python, acquiring corresponding text documents, and storing the corresponding text documents in a text database.
As an improvement, the step (8) includes: new word discovery is achieved based on mutual information and a sequence model Bilstm-CRF. For each input x= (X 1 ,x 2 ,…,x n ) There will be one predicted label sequence y= (y) 1 ,y 2 ,…,y n ) Wherein the predictive score formula is as follows:
wherein the method comprises the steps ofOutput y for the i-th position softmax i Probability of->Is from y i To y i+1 When the number of tags (B-person, B-location …) is n, the transition probability matrix is (n+2) ×n+2.
Compared with the prior art, the invention has the advantages that:
the invention learns the adhesiveness and the freedom degree between the words based on the mutual information and the left and right entropy, then learns the context information of the text based on the Bilstm-CRF model, integrally improves the recognition rate of the low-frequency words, and verifies the extracted new words and the found words based on the searching and statistics method, thereby omitting manual verification and being capable of improving the quality of the extracted domain words.
Drawings
FIG. 1 is a flow chart of a domain dictionary construction framework of the present invention, which includes a seed dictionary construction, a domain word extraction method based on a seed dictionary and web retrieval, and a domain new word discovery method based on a seed dictionary and a sequence model.
Fig. 2 is a flow chart of extracting domain words based on seed dictionary and web retrieval, through which the extraction of domain words is realized.
FIG. 3 is a flow chart of new word discovery in the field based on a seed dictionary and a sequence model, through which the new word discovery in the field is realized.
FIG. 4 is a diagram of a new word discovery model for Bilstm-CRF in the present invention.
FIG. 5 shows the evaluation results of new word discovery results in terms of accuracy, recall, and F values in different sequence models according to an embodiment of the present invention.
Fig. 6 is a flow chart of the present invention.
Detailed Description
The present invention is further described below by way of specific examples, but the present invention is not limited to the following examples only. Modifications, combinations, or substitutions of the present invention within the scope of the invention or without departing from the spirit and scope of the invention will be apparent to those skilled in the art and are included within the scope of the invention.
As shown in fig. 1 to 5, the main steps of the present invention are divided into a domain seed dictionary construction, web retrieval data acquisition, a domain word extraction method based on a seed dictionary and Web retrieval, a domain document data acquisition and a domain new word discovery method based on a seed dictionary and a sequence model, and the steps are as follows:
ST1: and constructing a domain seed dictionary, wherein the seed dictionary provides a reference standard for a domain word extraction model and a domain new word discovery model. Therefore, the size and quality of the domain seed dictionary will affect the construction of the final domain dictionary. The domain seed dictionary comprises two methods, one of which is expert construction; another obtains a portion from a professional book. The domain seed dictionary is at least proper (around fifty words), the types of seed words are diversified (for example, entity dictionary construction, dictionary interior contains as much as possible, person name, place name, organization name, etc.), and the variety of parts of speech (for example, verbs, nouns).
ST2: and retrieving Web data based on the seed dictionary.
The method comprises the steps of obtaining a field seed dictionary from ST1, then searching from hundred degrees encyclopedia, network news and forum by taking a seed word as a search word, obtaining search content, finally performing text preprocessing on the search content, and removing some non-text information such as web page formats, numbers, letters, underlines, expression symbols and the like. The specific flow is as follows.
(1) Input: seed dictionary
(2) The process comprises the following steps:
for seed word in seed dictionary:
in the hundred degrees encyclopedia, searching a seed word based on a focused web crawler and an incremental web crawler method, and acquiring a search result;
in the network news, searching a seed word based on a focused web crawler method and an incremental web crawler method to obtain a search result;
in the forum, searching a seed word based on a focused web crawler method and an incremental web crawler method, and acquiring a search result;
preprocessing the search result, removing non-text information, and storing the non-text information into text data.
And (5) completing one time, and updating the seed words.
The retrieval process is ended.
ST3: the domain words are extracted based on rules, and then the extracted words are evaluated based on a statistical method and a search mode, wherein a domain word extraction flow chart is shown with reference to fig. 2.
The domain seed dictionary is first acquired based on ST1 and ST2, as well as the data retrieved from the web. Secondly, constructing an extraction mode, taking the seed dictionary as an activation condition, activating the extraction mode, and finally searching and counting to determine whether the seed dictionary is put into the field seed dictionary.
Wherein, the extraction model is defined as the following table:
part of speech of seed word Seed word and new word dependencies Extracting word parts of speech
Nouns, verbs Relationship of main and secondary terms Nouns or verbs
Nouns, verbs Relation of moving guest Nouns or verbs
Nouns, verbs Guest-guest relationship Nouns or verbs
Nouns, verbs Centering relationship Nouns (noun)
Nouns, verbs Relationships in the form Nouns (noun)
The seed word is used as an activation condition, the dependency relationship in the out mode is met, new words are extracted, and a new word stock is formed. Then, the words are ranked based on the method of retrieval and the method of extraction pattern evaluation in step 4. The whole extraction method is specifically as follows.
(1) Input: seed dictionary, web retrieval data
(2) The process comprises the following steps:
dependency word segmentation and part-of-speech tagging are carried out on web retrieval data based on LTP
For seed word in seed dictionary:
seed words activate the extraction mode
Extracting new words according to the seed part of speech and the new word part of speech
Storing seed words, extracting patterns, new words into text database
And taking the new words as key words, searching in hundred degrees encyclopedia, and directly putting the result into a seed dictionary. Based on the extraction mode evaluation in the step 4, new words are sequenced, the first 5 words are extracted, the first 3 words are put into seed words, and the last 2 words are put into recognition words.
(3) And (3) outputting: seed dictionary and unidentified dictionary.
ST4: the data acquisition of the literature in the field mainly comprises two methods: one is to crawl documents from a network to obtain text data; one is to perform OCR recognition based on domain-specific data to acquire domain-specific text data.
ST5: new word discovery is performed based on a Bilstm-CRF model and a seed dictionary, and a new word discovery flow chart is shown with reference to FIG. 3.
Firstly, constructing an unidentified dictionary based on data acquired in ST3 and ST4, then counting the adhesiveness and the freedom degree between words based on mutual information as the weight of an input word vector, then marking the data based on a seed dictionary and the unidentified dictionary, and finally, finding out new domain words based on the training data of a Bilstm-CRF model. The overall method of finding domain words is as follows.
(1) Input: seed dictionary, unidentified dictionary, document data
(2) Training field word vector X
(3) The degree of adhesion and the degree of freedom W are calculated based on the mutual information and the left and right entropy, and WX is input as a model.
(4) Based on the seed dictionary, the dictionary is not identified, and training is marked, and the labels are labels in (B-New, I-New, E-New, O) 4.
(5) The Bilstm-CRF network model was trained as follows:
(5) Evaluating a trained model
(6) And marking the language materials by using the trained model to obtain new words.
(7) And taking the new words as key words, searching in hundred degrees encyclopedia, and directly putting the result into a seed dictionary. Based on the extraction mode evaluation in the step 4, new words are sequenced, the first 5 words are extracted, the first 3 words are put into seed words, and the last 2 words are put into recognition words.
(8) And (3) outputting: seed dictionary and unidentified dictionary.
Embodiment one:
a domain dictionary construction method based on web retrieval and new word discovery.
Some of the experimental data used herein are field data obtained from the internet and based on text documents.
The experimental environment is as follows: the memory is 16G, the CPU is an Inter (R) Core (TM) 2i5-8400 CPU@2.80GHz processor, the operating system is a Window10 system, and the programming language is python language. The basic indexes for testing the validity of the new word discovery mainly comprise accuracy, recall rate and F1 value, and the experiment is evaluated based on three evaluation indexes.
Embodiments a method uses different sequence models for new word discovery, evaluation in terms of accuracy, recall, and F-value:
in this example, the sequence model comprises CRF, LSTM, LSTM-CRF and BI-LSTM-CRF, and the four models are used for training and predicting new words based on marked data, and finally, the sequence model is subjected to comparison analysis.
The experimental results are shown with reference to fig. 5. From the comparative analysis of fig. 5, in which the CRF sequence model alone is used, the accuracy, recall, and F value are the lowest, only a simple language mode is learned, and the LSTM long and short memory model breaks the limitation of the text sequence length, so that the experimental result is improved. The accuracy, recall rate and F value of the BI-LSTM-CRF model used in the invention are respectively 84.32, 80.67 and 82.45, and the model learns the relation between the upper text and the lower text and breaks the limitation of the text sequence length, so that the effect is obviously improved.
The invention and its embodiments have been described above with no limitation, and the actual construction is not limited to the embodiments of the invention as shown in the drawings. In summary, if one of ordinary skill in the art is informed by this disclosure, a structural manner and an embodiment similar to the technical solution should not be creatively devised without departing from the gist of the present invention.

Claims (6)

1. A field dictionary construction method based on web retrieval and new word discovery is characterized by comprising the following steps: the method comprises the following implementation steps:
(1) Constructing a seed dictionary;
(2) Searching the seed words in the hundred degrees encyclopedia, the network news and the forum based on the seed dictionary respectively to acquire web field data;
(3) Extracting domain words in hundred degrees encyclopedia based on rules, performing word segmentation and dependency syntax analysis on news network data and forum data based on a word segmentation method and a dependency syntax method in LTP, and then extracting domain words based on the rules;
(4) The method comprises the steps of evaluating and analyzing the extracted domain words, wherein the evaluating method comprises two parts, one part is the evaluation of an extracting mode, the other part is the statistical analysis of the extracted domain words, and the evaluation of the extracting mode can evaluate the quality of the extracted domain words according to the accuracy and the quantity of the extracted information;
(4.1) calculating the accuracy of extraction, namely the correlation between the information extracted by the extraction mode and the field, wherein the calculation formula is as follows:
wherein rel-freq i Is the amount of information extracted in the relevant text by the extraction pattern i, total-freq i Is the amount of information extracted in the training corpus by the extraction pattern i, and can be used for measuring the correlation;
(4.2) calculating the number of extractions, wherein the calculation formula is as follows:
log 2 (frequency)
where frequence is the amount of information that the extraction pattern extracts from the relevant text;
(4.3) combining the formulas of step (4.1) and step (4.2) using the formula releasant rate log 2 (frequency) to score the extraction pattern, where when the releasant rate<When=0.5, the score is 0, because the pattern has been considered irrelevant to the relevant text;
(4.4) aiming at the selected extraction model, carrying out statistical analysis on the extracted words, giving a corpus, using the extraction modes existing in the text corpus, giving a seed word set, adding value tax, accounting and additional tax, finding out all extraction modes capable of extracting the seed words, scoring each extraction model according to the correlation between the extracted words and the seed words, and calculating the following formula:
wherein score (pattern k ) Is the fraction of pattern k, which can be calculated by the evaluation method of the extraction pattern,for extraction information NP i Can be N i The extraction pattern is extracted, and then the score of this extraction information is the sum of the scores of all the extraction patterns that can extract it multiplied by 0.01 plus N i
(4.5) calculating score (pattern k ) The formula of (2) is as follows:
score(pattern i )=R i *log(F i )
wherein F is i For extracting the number of de-duplication of words in the seed word set extracted by the pattern i, N i Is the number of de-duplication of all words extracted in extraction pattern i,representing the relevance of the extracted words to the seed set;
(5) OCR recognition is carried out on books in the specific field to obtain text documents, cases, formulas and charts of the text documents after recognition are removed, text documents of data in the specific field are obtained, and the text documents are stored in a text database;
(6) Performing word segmentation, part-of-speech tagging and dependency syntax analysis on book data in a specific field based on a part-of-speech tagging method and a dependency syntax analysis method in the LTP, extracting field words based on a specific rule, and executing step 4 for extracted new words;
(7) Using the crust word segmentation, loading an unrecognized word dictionary and a seed dictionary, and carrying out word segmentation and labeling on book data in a specific field;
(8) Using mutual information, performing statistical analysis on the data by using left and right entropy as the weight of the word vector input by the sequence model; training the marked data based on a sequence model Bilstm-CRF, performing model evaluation by using accuracy, recall and F value, predicting new words based on the trained model, and executing step 4.
2. The domain dictionary construction method based on web retrieval and new word discovery according to claim 1, wherein: the construction of the seed dictionary in the step (1) comprises the following steps: first, a domain range is given, and a specific domain expert is used for adding seed words or acquiring a part of domain words from a authority website, and the part of domain words are used as seed words to be added into a seed dictionary.
3. The domain dictionary construction method based on web retrieval and new word discovery according to claim 1, wherein: the method for acquiring the web domain data in the step (3) comprises the following steps: based on the seed dictionary, a focused web crawler method and an incremental web crawler method are used.
4. The domain dictionary construction method based on web retrieval and new word discovery according to claim 1, wherein: the step (3) and the step (6) realize data word segmentation and dependency syntactic analysis as follows: data word segmentation and dependency syntax analysis are implemented using LTP tools.
5. The domain dictionary construction method based on web retrieval and new word discovery according to claim 1, wherein: the step (5) comprises: and ocr, identifying books in a specific field by using a third party module tesserocr of python, acquiring corresponding text documents, and storing the corresponding text documents in a text database.
6. The domain dictionary construction method based on web retrieval and new word discovery according to claim 1, wherein: the step (8) comprises: new word discovery is achieved based on mutual information and a sequence model Bilstm-CRF, and for each input X= (X) 1 ,x 2 ,., x), all have a predicted label sequence y= (y 1 ,y 2 ,., y), wherein the predictive score formula is as follows:
wherein the method comprises the steps ofOutput y for the i-th position softmax i Probability of->Is from y i To y i+1 When the number of tags (B-person, B-location …) is n, the transition probability matrix is (n+2) ×n+2.
CN202010068095.6A 2020-01-21 2020-01-21 Domain dictionary construction method based on web retrieval and new word discovery Active CN111325018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010068095.6A CN111325018B (en) 2020-01-21 2020-01-21 Domain dictionary construction method based on web retrieval and new word discovery

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010068095.6A CN111325018B (en) 2020-01-21 2020-01-21 Domain dictionary construction method based on web retrieval and new word discovery

Publications (2)

Publication Number Publication Date
CN111325018A CN111325018A (en) 2020-06-23
CN111325018B true CN111325018B (en) 2023-08-11

Family

ID=71166965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010068095.6A Active CN111325018B (en) 2020-01-21 2020-01-21 Domain dictionary construction method based on web retrieval and new word discovery

Country Status (1)

Country Link
CN (1) CN111325018B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931501B (en) * 2020-09-22 2021-01-08 腾讯科技(深圳)有限公司 Text mining method based on artificial intelligence, related device and equipment
CN112364142A (en) * 2020-11-09 2021-02-12 上海恒企教育培训有限公司 Question matching method and device for vertical field, terminal and readable storage medium
CN114611486B (en) * 2022-03-09 2022-12-16 上海弘玑信息技术有限公司 Method and device for generating information extraction engine and electronic equipment
CN116108834A (en) * 2023-04-10 2023-05-12 中国民用航空飞行学院 Interactive user dictionary construction method, device and equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship
CN103544246A (en) * 2013-10-10 2014-01-29 清华大学 Method and system for constructing multi-emotion dictionary for internet
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN106407235A (en) * 2015-08-03 2017-02-15 北京众荟信息技术有限公司 A semantic dictionary establishing method based on comment data
CN107632974A (en) * 2017-08-08 2018-01-26 夏振宇 Suitable for multi-field Chinese analysis platform
CN107797991A (en) * 2017-10-23 2018-03-13 南京云问网络技术有限公司 A kind of knowledge mapping extending method and system based on interdependent syntax tree
CN110297913A (en) * 2019-06-12 2019-10-01 中电科大数据研究院有限公司 A kind of electronic government documents entity abstracting method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship
CN103544246A (en) * 2013-10-10 2014-01-29 清华大学 Method and system for constructing multi-emotion dictionary for internet
CN106407235A (en) * 2015-08-03 2017-02-15 北京众荟信息技术有限公司 A semantic dictionary establishing method based on comment data
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN107632974A (en) * 2017-08-08 2018-01-26 夏振宇 Suitable for multi-field Chinese analysis platform
CN107797991A (en) * 2017-10-23 2018-03-13 南京云问网络技术有限公司 A kind of knowledge mapping extending method and system based on interdependent syntax tree
CN110297913A (en) * 2019-06-12 2019-10-01 中电科大数据研究院有限公司 A kind of electronic government documents entity abstracting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄文明 ; 杨柳青青 ; 任冲.结合信息量和深度学习的领域新词发现.《计算机工程与设计》.2019, *

Also Published As

Publication number Publication date
CN111325018A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN111325018B (en) Domain dictionary construction method based on web retrieval and new word discovery
US20190065576A1 (en) Single-entity-single-relation question answering systems, and methods
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
CN112818093B (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN104484380A (en) Personalized search method and personalized search device
CN112231494B (en) Information extraction method and device, electronic equipment and storage medium
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN109299233A (en) Text data processing method, device, computer equipment and storage medium
Pandey et al. An unsupervised Hindi stemmer with heuristic improvements
CN111324771A (en) Video tag determination method and device, electronic equipment and storage medium
CN110688405A (en) Expert recommendation method, device, terminal and medium based on artificial intelligence
Celikyilmaz et al. A graph-based semi-supervised learning for question-answering
Haque et al. Approaches and trends of automatic bangla text summarization: challenges and opportunities
CN112445862A (en) Internet of things equipment data set construction method and device, electronic equipment and storage medium
CN110941713B (en) Self-optimizing financial information block classification method based on topic model
CN106776590A (en) A kind of method and system for obtaining entry translation
KR102454261B1 (en) Collaborative partner recommendation system and method based on user information
CN113254623A (en) Data processing method, device, server, medium and product
Gupta et al. Document summarisation based on sentence ranking using vector space model
CN113688633A (en) Outline determination method and device
Lazemi et al. Persian plagirisim detection using CNN s
CN112560425A (en) Template generation method and device, electronic equipment and storage medium
Sanabila et al. Automatic Wayang Ontology Construction using Relation Extraction from Free Text
Alashti et al. Parsisanj: an automatic component-based approach toward search engine evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant