CN111325018A - Domain dictionary construction method based on web retrieval and new word discovery - Google Patents

Domain dictionary construction method based on web retrieval and new word discovery Download PDF

Info

Publication number
CN111325018A
CN111325018A CN202010068095.6A CN202010068095A CN111325018A CN 111325018 A CN111325018 A CN 111325018A CN 202010068095 A CN202010068095 A CN 202010068095A CN 111325018 A CN111325018 A CN 111325018A
Authority
CN
China
Prior art keywords
words
extracted
seed
domain
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010068095.6A
Other languages
Chinese (zh)
Other versions
CN111325018B (en
Inventor
杜梦豪
赵琨
刘杰鹏
丁健
梁栋彬
袁显峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Hengqi Education And Training Co ltd
Original Assignee
Shanghai Hengqi Education And Training Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Hengqi Education And Training Co ltd filed Critical Shanghai Hengqi Education And Training Co ltd
Priority to CN202010068095.6A priority Critical patent/CN111325018B/en
Publication of CN111325018A publication Critical patent/CN111325018A/en
Application granted granted Critical
Publication of CN111325018B publication Critical patent/CN111325018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a domain dictionary construction method based on web retrieval and new word discovery, and provides a domain dictionary construction method based on web retrieval and new word discovery aiming at the characteristics of diversity and richness (including network data and literature data) of text data, existence of domain words in new words and the like. The method consists of the following two parts: crawling network data based on the seed dictionary, and extracting the domain words based on a self-defined extraction mode; learning the degree of freedom and the degree of adhesion between the characters based on mutual information and left and right entropy, and then realizing new word discovery based on the BiLstm-CRF. Compared with the prior art, the invention has the advantages that: according to the method, the degree of adhesion and the degree of freedom between the characters are learned based on mutual information and left and right entropy, the context information of the text is learned based on a BiLstm-CRF model, the recognition rate of low-frequency words is integrally improved, the extracted new words and the found words are verified based on a retrieval and statistics method, manual verification is omitted, and the quality of the extracted field words can be improved.

Description

Domain dictionary construction method based on web retrieval and new word discovery
Technical Field
The invention relates to the fields of natural language processing, deep learning and web retrieval, in particular to a field dictionary construction method based on web retrieval and new word discovery.
Background
Since the 21 st century, "artificial intelligence" has gone into the public vision, while our lives are constantly changing. For example, Siri of apple, warehouse robot in kyoto, two-dimensional code of Paibao, chat robots of Baidu, Microsoft, Ali and the like are all embodiments of artificial intelligence, and bring convenience to daily life of people. Among them, natural language processing is one of typical fields of artificial intelligence technology, and through many years of research, the field has achieved remarkable results. Such as machine translation, question and answer systems, reading comprehension, text generation, etc. In these efforts, there are different degrees of dependence on the dictionary. The corresponding characteristic words extracted from the dictionary are used as input, machine translation, problem analysis and the like are carried out on the basis of a machine learning model or algorithm, and the reliability of analysis is greatly improved. Especially in the knowledge question answering and reading understanding in the specific field, a large and wide dictionary is used, and a good analysis effect cannot be achieved. Therefore, the method can quickly and effectively construct a domain dictionary, and is an important task in natural language processing.
With the explosive development of communication internet technology and the arrival of the web3.0 era, national governments, enterprise groups, and internet users have participated in different forms. The state is in the form of news media, the enterprise group is in the form of enterprise websites, and internet users share information in different fields in the form of discussion platforms and forums, wherein the information has different field topics, and a large number of new words emerge. Therefore, the internet is an important information source for domain dictionary construction. Meanwhile, some documents such as books, periodicals and the like in a specific field are also an important information source for dictionary construction. Therefore, it is one of the effective ways to acquire domain-specific data from networks and books.
There are many feasible methods for domain dictionary construction, and related research is also more. Yi Wen medicine et al put forward a domain dictionary construction method based on a Wikipedia link structure diagram, and by integrating an LSI algorithm and a CPMw algorithm. However, the LSI model has many disadvantages, it is difficult to select an appropriate number of subjects, and the SVD calculation is time-consuming, and the result is difficult to interpret intuitively because the LSI is not a probabilistic model and lacks a statistical basis. The method calculates semantic similarity based on a synonym forest expansion edition and a Word2Vec tool on the basis of a large amount of labeled data, summarizes the characteristics, and accordingly constructs a product characteristic dictionary. However, the data needs to be annotated, which is time consuming and labor intensive. And these dictionary construction methods do not take into account the appearance of some new words. Therefore, when constructing the domain dictionary, a new word discovery module needs to be added.
The traditional new word discovery method mainly realizes new word discovery based on a statistical method and a rule-based method. The statistical-based method has the defects of wide application field, low accuracy, large requirement on a large amount of linguistic data and large calculation amount. The rule-based method is to construct a rule base by using language features, so that the accuracy is high, but the process of constructing the rule is complex and the field migration capability is poor. And calculating the correlation of two adjacent words based on the information entropy and left-right entropy method, and better determining the word adhesion and word boundary. Based on the BiLstm-CRF deep neural network model, the text context can be better learned, and meanwhile, more low-frequency new words can be recognized. Therefore, firstly, the domain words are extracted based on the web retrieval and the seed dictionary, then, the new domain words are found based on the BiLstm-CRF, the seed dictionary, the mutual information and the statistical rule method, and finally the domain dictionary is constructed.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the technical defects and provide a domain dictionary construction method based on web retrieval and new word discovery.
In order to solve the problems, the technical scheme of the invention is as follows: a domain dictionary construction method based on web retrieval and new word discovery comprises the following implementation steps:
(1) constructing a seed dictionary;
(2) searching the seed words in Baidu encyclopedia, network news and forum respectively based on the seed dictionary to obtain web field data;
(3) extracting field words in the Baidu encyclopedia based on rules, performing word segmentation and dependency syntax analysis on news network data and forum data based on a word segmentation method and a dependency syntax method in LTP, and then extracting the field words based on specific rules;
(4) evaluating and analyzing the extracted domain words;
(5) performing OCR recognition on the books in the specific field to obtain text documents, removing cases, formulas, charts and the like from the recognized text documents to obtain the text documents of the data in the specific field, and storing the text documents in a text database;
(6) performing word segmentation, part of speech tagging and dependency syntax analysis on book data in a specific field based on a part of speech tagging method and a dependency syntax analysis method in LTP, then extracting field words based on a specific rule, and then executing a step 4 aiming at the extracted new words;
(7) using the ending segmentation, loading an unrecognized word dictionary and a seed dictionary, and segmenting and labeling book data in a specific field;
(8) using mutual information and left-right entropy to carry out statistical analysis on the data, and using the statistical analysis as the weight of a word vector input by a sequence model; and then training the labeled data based on a sequence model BiLstm-CRF, then performing model evaluation by using the accuracy, the recall rate and the F value, finally predicting a new word based on the trained model, and finally executing the step 4.
As an improvement, the step (1) of constructing the seed dictionary comprises: firstly, a domain scope is given, seed words are added by experts in a specific domain or a part of domain words are obtained from the Weiwei website and are added into a seed dictionary as the seed words.
As an improvement, the method for acquiring the web domain data in the step (3) comprises the following steps: based on the seed dictionary, two crawler methods, namely a focus type web crawler method and an increment type web crawler method, are used.
As an improvement, the data word segmentation and dependency syntax analysis implemented in the step (3) and the step (6) is as follows: data segmentation and dependency parsing are implemented using LTP (language technology platform for haardard) tools.
As an improvement, the evaluation method in the step (4) is as follows: the evaluation method mainly comprises two parts, wherein one part is evaluation on the extraction mode, the other part is statistical analysis on the extracted field words, and the evaluation on the extraction mode can evaluate the quality of the extracted field words according to the accuracy and the quantity of the extracted information.
(4.1) calculating the extraction accuracy, namely the correlation between the information extracted by the extraction mode and the field, wherein the calculation formula is as follows:
Figure BDA0002376561950000021
wherein rel-freqiIs the amount of information extracted by the extraction mode i in the relevant text, total-freqiIs the amount of information the extraction pattern i extracts in the training corpus (including both relevant and irrelevant text) and can therefore be used to measure relevance (relevance rate).
(4.2) calculating the number of extractions by the following formula:
log2(frequency)
where frequency is the amount of information the extraction pattern extracts from the relevant text.
(4.3) combining the equations of step (4.1) and step (4.2) with the equation of Relevant rate log2(frequency) to score the extraction pattern, wherein when the relevant rate<When 0.5, a score of 0 is obtained because the pattern is not considered relevant to the relevant text.
And (4.4) carrying out statistical analysis on the extracted words according to the selected extraction model, giving a corpus and using the extraction mode existing in the text corpus. Giving a seed word set, such as value-added tax, accounting, additional tax and the like, finding out all extraction modes capable of extracting the seed words, and scoring each extraction model according to the correlation between the extraction words and the seed words, wherein the calculation formula is as follows:
Figure BDA0002376561950000031
wherein score (pattern)k) Is the score of the pattern k, and can be calculated by the evaluation method of the extraction pattern. For extracted information NPiCan be covered with NiThe extracted pattern is extracted, the score of the extracted information is the sum of the scores of all the extracted patterns capable of extracting the extracted pattern multiplied by 0.01 plus Ni
(4.5) calculate score (pattern)k) The formula of (1) is as follows:
score(patterni)=Ri*log(Fi)
wherein FiThe de-duplication quantity of the words of the seed set of i is extracted for the pattern. N is a radical ofiIs the de-duplication quantity of all words extracted by the extraction mode i,
Figure BDA0002376561950000032
indicating the relevance of the extracted word to the seed set.
As a refinement, the step (5) includes: ocr recognition is carried out on books in a specific field by using a third-party module tesserocr of python, and a corresponding text document is obtained and stored in a text database.
As a refinement, the step (8) includes: and realizing new word discovery based on mutual information and a sequence model BiLstm-CRF. For each input X ═ X1,x2,…,xn) There will be a predicted label sequence y ═ (y)1,y2,…,yn) Wherein the prediction score is formulated as follows:
Figure BDA0002376561950000033
wherein
Figure BDA0002376561950000034
Output of softmax for the ith position as yiThe probability of (a) of (b) being,
Figure BDA0002376561950000035
is from yiTo yi+1When the number of tag (B-person, B-location …) is n, the transition probability matrix is (n +2) × (n + 2).
Compared with the prior art, the invention has the advantages that:
according to the method, the degree of adhesion and the degree of freedom between the characters are learned based on mutual information and left and right entropy, the context information of the text is learned based on a BiLstm-CRF model, the recognition rate of low-frequency words is integrally improved, the extracted new words and the found words are verified based on a retrieval and statistics method, manual verification is omitted, and the quality of the extracted field words can be improved.
Drawings
FIG. 1 is a flowchart of a domain dictionary construction framework according to the present invention, which includes a seed dictionary construction, a domain word extraction method based on a seed dictionary and web search, and a domain new word discovery method based on a seed dictionary and a sequence model.
FIG. 2 is a flow chart of domain word extraction based on seed dictionary and web search according to the present invention, and domain word extraction is realized through the flow chart.
FIG. 3 is a flow chart of the present invention for discovering new words in the field based on a seed dictionary and a sequence model, and the new words in the field are discovered through the flow chart.
FIG. 4 is a diagram of a BiLstm-CRF new word discovery model in the present invention.
Fig. 5 shows the evaluation results of the new word finding results in terms of accuracy, recall rate, and F value in different sequence models according to the embodiment of the present invention.
FIG. 6 is a flow chart of the present invention.
Detailed Description
The present invention is further described below by way of specific examples, but the present invention is not limited to only the following examples. Variations, combinations, or substitutions of the invention, which are within the scope of the invention or the spirit, scope of the invention, will be apparent to those of skill in the art and are within the scope of the invention.
As shown in fig. 1 to 5, the main steps of the present invention are domain seed dictionary construction, Web search data acquisition, a domain word extraction method based on seed dictionary and Web search, a domain literature data acquisition and a domain new word discovery method based on seed dictionary and sequence model, and the steps are as follows:
ST 1: and (4) constructing a field seed dictionary, wherein the seed dictionary provides a reference basis for a field word extraction model and a field new word discovery model. Therefore, the size and quality of the domain seed dictionary will affect the construction of the final domain dictionary. The domain seed dictionary comprises two methods, wherein one method is expert construction; another takes a part from a professional book. The domain seed dictionary is proper in size (about fifty words), the types of the seed words are diversity (for example, an entity dictionary is constructed, the interior of the dictionary contains the names of people, places, institutional names and the like) and the diversity of the parts of speech (such as verbs and nouns).
ST 2: and realizing the retrieval of the Web data based on the seed dictionary.
The method comprises the steps of obtaining a domain seed dictionary from ST1, then carrying out retrieval by taking seed words as retrieval words from Baidu encyclopedia, network news and forum, then obtaining retrieval contents, and finally carrying out text preprocessing on the retrieval contents to remove some non-text information such as webpage formats, numbers, letters, underlines, emoticons and the like. The specific process is as follows.
(1) Inputting: seed dictionary
(2) The process is as follows:
in the form seed word in seed dictionary:
in Baidu encyclopedia, searching seed words based on a focused web crawler method and an incremental web crawler method to obtain a search result;
in the network news, searching seed words based on a focused web crawler and an incremental web crawler method to obtain a search result;
in the forum, searching seed words based on a focused web crawler and an incremental web crawler method, and obtaining a search result;
and preprocessing the retrieval result, removing non-text information, and storing the non-text information into text data.
Once completed, the seed word is updated.
The retrieval process is ended.
ST 3: extracting the domain words based on the rules, and then evaluating the extracted words based on the statistical method and the retrieval mode, wherein a domain word extraction flow chart is shown as reference in fig. 2.
A domain seed dictionary is first obtained based on ST1 and ST2, and data retrieved from the web. Secondly, constructing an extraction mode, activating the extraction mode by taking the seed dictionary as an activation condition, and finally, searching and counting to determine whether to put the seed dictionary into the field seed dictionary.
Wherein the extraction model is defined as follows:
part of speech of seed word Dependency relationship between seed word and new word Extracting word parts of speech
Noun, verb Relationship between major and minor Nouns or verbs
Noun, verb Moving guest relationship Noun orVerb and its usage
Noun, verb Inter-guest relationships Nouns or verbs
Noun, verb Centering relationships Noun (name)
Noun, verb Relationship between aspects Noun (name)
The seed words are used as activation conditions, the dependency relationship in the exit mode is met, and new words are extracted to form a new word bank. The words are then ranked based on the method of retrieval and the method of extracting the pattern evaluation in step 4. The overall extraction method is as follows.
(1) Inputting: seed dictionary, web search data
(2) The process is as follows:
dependency word segmentation and part-of-speech tagging for web retrieval data based on LTP
In the form seed word in seed dictionary:
the seed word activates the above extraction mode
Extracting new words according to the parts of speech of the seeds and the parts of speech of the new words
Storing seed words, extracting mode, and new words into text database
And taking the new words as key words, searching in the Baidu encyclopedia, and directly putting the new words into a seed dictionary if results exist. And 4, ordering the new words based on the extraction mode evaluation in the step 4, extracting the first 5 words, putting the first 3 words into the seed words, and putting the second 2 words into the recognition words.
(3) And (3) outputting: a seed dictionary and an unrecognized dictionary.
ST 4: there are two main methods for domain literature data acquisition: one is to crawl documents from the network and obtain text data; one is to perform OCR recognition based on specific field data to acquire text data of a specific field.
ST 5: and (3) carrying out new word discovery based on the BiLstm-CRF model and the seed dictionary, wherein a new word discovery flow chart is shown by referring to FIG. 3.
Firstly, constructing an unrecognized dictionary for data acquired in ST4 based on ST3, then counting the adhesion degree and the freedom degree between words based on mutual information as the weight of an input word vector, then labeling the data based on a seed dictionary and the unrecognized dictionary, and finally, training data based on a BiLstm-CRF model to find out new domain words. The overall method for finding the domain word is as follows.
(1) Inputting: seed dictionary, unrecognized dictionary, document data
(2) Training field word vector X
(3) And calculating the adhesion degree and the degree of freedom W based on the mutual information and the left-right entropy, and inputting WX as a model.
(4) Based on the seed dictionary, the dictionary is not recognized, and the labels of the training are labeled as labels in (B-New, I-New, E-New, O) 4.
(5) Training a Bilstm-CRF network model, wherein the training process is as follows:
Figure BDA0002376561950000061
(5) evaluating the trained model
(6) And marking the corpus by using the trained model to obtain a new word.
(7) And taking the new words as key words, searching in the Baidu encyclopedia, and directly putting the new words into a seed dictionary if results exist. And 4, ordering the new words based on the extraction mode evaluation in the step 4, extracting the first 5 words, putting the first 3 words into the seed words, and putting the second 2 words into the recognition words.
(8) And (3) outputting: a seed dictionary and an unrecognized dictionary.
The first embodiment is as follows:
a domain dictionary construction method based on web retrieval and new word discovery.
Some of the experimental data used herein is field data obtained from the internet based on text literature.
The experimental environment was as follows: the memory is 16G, the CPU is an Inter (R) core (TM)2i5-8400 CPU @2.80GHz processor, the operating system is a Window10 system, and the programming language is python. The basic indexes for testing the effectiveness found by the new words mainly comprise accuracy, recall rate and F1 value, and the experiment is evaluated based on three evaluation indexes.
Example a method uses different sequence models, evaluation in terms of accuracy, recall, and F-value for new word discovery:
in the present example, the sequence models include four models of CRF, LSTM-CRF and BI-LSTM-CRF, then the four models are used for training and new word prediction respectively based on the labeled data, and finally, the sequence models are compared and analyzed.
The experimental results are shown in FIG. 5. By comparing and analyzing the results in fig. 5, the CRF sequence model alone has the lowest accuracy, recall rate and F value, and only learns a simple language model, and the LSTM long and short memory model breaks the limit of the text sequence length, so that the experimental result is improved. The accuracy, the recall rate and the F value of the BI-LSTM-CRF model used by the invention are 84.32, 80.67 and 82.45 respectively, and the model learns the relationship between upper and lower characters and breaks the limit of the text sequence length, so the effect is obviously improved.
The present invention and its embodiments have been described above, and the description is not intended to be limiting, and the drawings are only one embodiment of the present invention, and the actual structure is not limited thereto. In summary, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A domain dictionary construction method based on web retrieval and new word discovery is characterized by comprising the following steps: the method comprises the following implementation steps:
(1) constructing a seed dictionary;
(2) searching the seed words in Baidu encyclopedia, network news and forum respectively based on the seed dictionary to obtain web field data;
(3) extracting field words in the Baidu encyclopedia based on rules, performing word segmentation and dependency syntax analysis on news network data and forum data based on a word segmentation method and a dependency syntax method in LTP, and then extracting the field words based on specific rules;
(4) evaluating and analyzing the extracted domain words;
(5) performing OCR recognition on the books in the specific field to obtain text documents, removing cases, formulas, charts and the like from the recognized text documents to obtain the text documents of the data in the specific field, and storing the text documents in a text database;
(6) performing word segmentation, part of speech tagging and dependency syntax analysis on book data in a specific field based on a part of speech tagging method and a dependency syntax analysis method in LTP, then extracting field words based on a specific rule, and then executing a step 4 aiming at the extracted new words;
(7) using the ending segmentation, loading an unrecognized word dictionary and a seed dictionary, and segmenting and labeling book data in a specific field;
(8) using mutual information and left-right entropy to carry out statistical analysis on the data, and using the statistical analysis as the weight of a word vector input by a sequence model; and then training the labeled data based on a sequence model BiLstm-CRF, then performing model evaluation by using the accuracy, the recall rate and the F value, finally predicting a new word based on the trained model, and finally executing the step 4.
2. The domain dictionary construction method based on web retrieval and new word discovery as claimed in claim 1, wherein: the step (1) of constructing the seed dictionary comprises the following steps: firstly, a domain scope is given, seed words are added by experts in a specific domain or a part of domain words are obtained from the Weiwei website and are added into a seed dictionary as the seed words.
3. The domain dictionary construction method based on web retrieval and new word discovery as claimed in claim 1, wherein: the method for acquiring the web domain data in the step (3) comprises the following steps: based on the seed dictionary, two crawler methods, namely a focus type web crawler method and an increment type web crawler method, are used.
4. The domain dictionary construction method based on web retrieval and new word discovery as claimed in claim 1, wherein: the syntax analysis of data word segmentation and dependency in the step (3) and the step (6) is as follows: data segmentation and dependency parsing are implemented using LTP (language technology platform for haardard) tools.
5. The domain dictionary construction method based on web retrieval and new word discovery as claimed in claim 1, wherein: the evaluation method in the step (4) comprises the following steps: the evaluation method mainly comprises two parts, wherein one part is evaluation on the extraction mode, the other part is statistical analysis on the extracted field words, and the evaluation on the extraction mode can evaluate the quality of the extracted field words according to the accuracy and the quantity of the extracted information.
(4.1) calculating the extraction accuracy, namely the correlation between the information extracted by the extraction mode and the field, wherein the calculation formula is as follows:
Figure FDA0002376561940000011
wherein rel-freqiIs the amount of information extracted by the extraction mode i in the relevant text, total-freqiIs the amount of information the extraction pattern i extracts in the training corpus (including both relevant and irrelevant text) and can therefore be used to measure relevance (relevance rate).
(4.2) calculating the number of extractions by the following formula:
log2(frequency)
where frequency is the amount of information the extraction pattern extracts from the relevant text.
(4.3) combining the equations of step (4.1) and step (4.2) with the equation of Relevant rate log2(frequency) to score the extraction pattern, wherein when the relevant rate<When 0.5, a score of 0 is obtained because the pattern is not considered relevant to the relevant text.
And (4.4) carrying out statistical analysis on the extracted words according to the selected extraction model, giving a corpus and using the extraction mode existing in the text corpus. Giving a seed word set, such as value-added tax, accounting, additional tax and the like, finding out all extraction modes capable of extracting the seed words, and scoring each extraction model according to the correlation between the extraction words and the seed words, wherein the calculation formula is as follows:
Figure FDA0002376561940000021
wherein score (pattern)k) Is the score of the pattern k, and can be calculated by the evaluation method of the extraction pattern. For extracted information NPiCan be covered with NiThe extracted pattern is extracted, the score of the extracted information is the sum of the scores of all the extracted patterns capable of extracting the extracted pattern multiplied by 0.01 plus Ni
(4.5) calculate score (pattern)k) The formula of (1) is as follows:
score(patterni)=Ri*log(Fi)
wherein FiThe de-duplication quantity of the words of the seed set of i is extracted for the pattern. N is a radical ofiIs the de-duplication quantity of all words extracted by the extraction mode i,
Figure FDA0002376561940000022
indicating the relevance of the extracted word to the seed set.
6. The domain dictionary construction method based on web retrieval and new word discovery as claimed in claim 1, wherein: the step (5) comprises: ocr recognition is carried out on books in a specific field by using a third-party module tesserocr of python, and a corresponding text document is obtained and stored in a text database.
7. The domain dictionary construction method based on web retrieval and new word discovery as claimed in claim 1, wherein: the step (8) comprises: and realizing new word discovery based on mutual information and a sequence model BiLstm-CRF. For each input X ═ X1,x2,…,xn) There will be a predicted label sequence y ═ (y)1,y2,…,yn) Wherein the prediction score is formulated as follows:
Figure FDA0002376561940000023
wherein
Figure FDA0002376561940000031
Output of softmax for the ith position as yiThe probability of (a) of (b) being,
Figure FDA0002376561940000032
is from yiTo yi+1When the number of tag (B-person, B-location …) is n, the transition probability matrix is (n +2) × (n + 2).
CN202010068095.6A 2020-01-21 2020-01-21 Domain dictionary construction method based on web retrieval and new word discovery Active CN111325018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010068095.6A CN111325018B (en) 2020-01-21 2020-01-21 Domain dictionary construction method based on web retrieval and new word discovery

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010068095.6A CN111325018B (en) 2020-01-21 2020-01-21 Domain dictionary construction method based on web retrieval and new word discovery

Publications (2)

Publication Number Publication Date
CN111325018A true CN111325018A (en) 2020-06-23
CN111325018B CN111325018B (en) 2023-08-11

Family

ID=71166965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010068095.6A Active CN111325018B (en) 2020-01-21 2020-01-21 Domain dictionary construction method based on web retrieval and new word discovery

Country Status (1)

Country Link
CN (1) CN111325018B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931501A (en) * 2020-09-22 2020-11-13 腾讯科技(深圳)有限公司 Text mining method based on artificial intelligence, related device and equipment
CN112364142A (en) * 2020-11-09 2021-02-12 上海恒企教育培训有限公司 Question matching method and device for vertical field, terminal and readable storage medium
CN114611486A (en) * 2022-03-09 2022-06-10 上海弘玑信息技术有限公司 Information extraction engine generation method and device and electronic equipment
CN116108834A (en) * 2023-04-10 2023-05-12 中国民用航空飞行学院 Interactive user dictionary construction method, device and equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship
CN103544246A (en) * 2013-10-10 2014-01-29 清华大学 Method and system for constructing multi-emotion dictionary for internet
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN106407235A (en) * 2015-08-03 2017-02-15 北京众荟信息技术有限公司 A semantic dictionary establishing method based on comment data
CN107632974A (en) * 2017-08-08 2018-01-26 夏振宇 Suitable for multi-field Chinese analysis platform
CN107797991A (en) * 2017-10-23 2018-03-13 南京云问网络技术有限公司 A kind of knowledge mapping extending method and system based on interdependent syntax tree
CN110297913A (en) * 2019-06-12 2019-10-01 中电科大数据研究院有限公司 A kind of electronic government documents entity abstracting method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship
CN103544246A (en) * 2013-10-10 2014-01-29 清华大学 Method and system for constructing multi-emotion dictionary for internet
CN106407235A (en) * 2015-08-03 2017-02-15 北京众荟信息技术有限公司 A semantic dictionary establishing method based on comment data
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN107632974A (en) * 2017-08-08 2018-01-26 夏振宇 Suitable for multi-field Chinese analysis platform
CN107797991A (en) * 2017-10-23 2018-03-13 南京云问网络技术有限公司 A kind of knowledge mapping extending method and system based on interdependent syntax tree
CN110297913A (en) * 2019-06-12 2019-10-01 中电科大数据研究院有限公司 A kind of electronic government documents entity abstracting method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
石玉鑫; 杨泽青; 赵志滨; 姚兰: "一种面向商品评价对象挖掘的领域词典构建法", 《软件工程》 *
胡家珩; 岑咏华; 吴承尧: "基于深度学习的领域情感词典自动构建——以金融领域为例", 《数据分析与知识发现》 *
邢恩军; 赵富强: "基于上下文词频词汇量指标的新词发现方法", 《计算机应用与软件》 *
黄文明; 杨柳青青; 任冲: "结合信息量和深度学习的领域新词发现", 《计算机工程与设计》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931501A (en) * 2020-09-22 2020-11-13 腾讯科技(深圳)有限公司 Text mining method based on artificial intelligence, related device and equipment
CN111931501B (en) * 2020-09-22 2021-01-08 腾讯科技(深圳)有限公司 Text mining method based on artificial intelligence, related device and equipment
CN112364142A (en) * 2020-11-09 2021-02-12 上海恒企教育培训有限公司 Question matching method and device for vertical field, terminal and readable storage medium
CN114611486A (en) * 2022-03-09 2022-06-10 上海弘玑信息技术有限公司 Information extraction engine generation method and device and electronic equipment
CN114611486B (en) * 2022-03-09 2022-12-16 上海弘玑信息技术有限公司 Method and device for generating information extraction engine and electronic equipment
CN116108834A (en) * 2023-04-10 2023-05-12 中国民用航空飞行学院 Interactive user dictionary construction method, device and equipment

Also Published As

Publication number Publication date
CN111325018B (en) 2023-08-11

Similar Documents

Publication Publication Date Title
Gambhir et al. Recent automatic text summarization techniques: a survey
Tang et al. Sentiment embeddings with applications to sentiment analysis
US20180341871A1 (en) Utilizing deep learning with an information retrieval mechanism to provide question answering in restricted domains
US8751218B2 (en) Indexing content at semantic level
US9715493B2 (en) Method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model
WO2018151856A1 (en) Intelligent matching system with ontology-aided relation extraction
US20100145678A1 (en) Method, System and Apparatus for Automatic Keyword Extraction
CN111325018B (en) Domain dictionary construction method based on web retrieval and new word discovery
Le et al. Text classification: Naïve bayes classifier with sentiment Lexicon
JP5710581B2 (en) Question answering apparatus, method, and program
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN103646099A (en) Thesis recommendation method based on multilayer drawing
CN111324771A (en) Video tag determination method and device, electronic equipment and storage medium
Golpar-Rabooki et al. Feature extraction in opinion mining through Persian reviews
Nasser et al. n-Gram based language processing using Twitter dataset to identify COVID-19 patients
Zhang et al. Stanford at TAC KBP 2016: Sealing Pipeline Leaks and Understanding Chinese.
Jia et al. A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth
CN114239828A (en) Supply chain affair map construction method based on causal relationship
Alshammari et al. TAQS: an Arabic question similarity system using transfer learning of BERT with BILSTM
Gupta et al. Document summarisation based on sentence ranking using vector space model
Shaikh et al. Bringing shape to textual data-a feasible demonstration
Alashti et al. Parsisanj: an automatic component-based approach toward search engine evaluation
JP2010282403A (en) Document retrieval method
Hoque et al. A content-aware hybrid architecture for answering questions from open-domain texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant