WO2020062770A1 - Method and apparatus for constructing domain dictionary, and device and storage medium - Google Patents

Method and apparatus for constructing domain dictionary, and device and storage medium Download PDF

Info

Publication number
WO2020062770A1
WO2020062770A1 PCT/CN2019/075956 CN2019075956W WO2020062770A1 WO 2020062770 A1 WO2020062770 A1 WO 2020062770A1 CN 2019075956 W CN2019075956 W CN 2019075956W WO 2020062770 A1 WO2020062770 A1 WO 2020062770A1
Authority
WO
WIPO (PCT)
Prior art keywords
domain
word vector
dictionary
word
seed
Prior art date
Application number
PCT/CN2019/075956
Other languages
French (fr)
Chinese (zh)
Inventor
李坚强
颜果开
傅向华
李赛玲
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Publication of WO2020062770A1 publication Critical patent/WO2020062770A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Definitions

  • the invention belongs to the technical field of natural language processing, and particularly relates to a method, a device, a device, and a storage medium for constructing a dictionary in the field.
  • the domain vocabulary reflects and loads the core knowledge of a subject area.
  • the change of vocabulary reflects the development and change of a subject area to a certain extent.
  • the domain vocabulary has important theories for understanding and grasping the development status and future trends of a subject area. And practical significance, with the continuous expansion of the field of natural language processing, the demand for domain lexicons is becoming more and more urgent.
  • Existing word dictionary-based domain dictionary construction algorithms are a single general-purpose corpus or domain corpus on the Internet, and a general-purpose word vector model or domain-word vector model constructed by segmenting the corpus directly through the Chinese word segmentation tool, and then calculating the general The semantic similarity between words in the word vector model or the domain word vector model to construct a domain dictionary.
  • the above-mentioned general word vector model does not take into account the dependence of the domain corpus on the domain corpus in the restricted domain, and the domain word vector model does not take into account the problem of insufficient corpus in the restricted domain.
  • the algorithm does not take into account problems such as the inability of the Chinese word segmentation tool to correctly segment words in the domain vocabulary or unknown words in the restricted domain, resulting in insufficient domain dictionary space and inaccurate domain vocabulary.
  • the purpose of the present invention is to provide a method, device, equipment and storage medium for constructing a domain dictionary, which aims to solve the problem that the existing dictionary cannot provide an effective method for constructing a domain dictionary, resulting in insufficient domain vocabulary in the domain dictionary, and the domain Vocabulary inaccuracies.
  • the present invention provides a method for constructing a domain dictionary.
  • the method includes the following steps:
  • the unformed words in the domain dictionary are filtered by a new word discovery algorithm to complete the construction of the domain dictionary.
  • the step of calculating a word semantic similarity between the corresponding general word vector space model and the field word vector space model and the corresponding general word vector and the field word vector with a seed word vector in a preset initial domain seed dictionary includes: :
  • V 1 is the general word vector or the domain word vector
  • V 2 is the seed word vector
  • S (V 1 , V 2 ) is the semantic similarity of the word.
  • the step of selecting a corresponding universal word vector or domain word vector to expand the initial domain seed dictionary includes:
  • a general word vector or a domain word vector corresponding to the semantic similarity of the words is added to the initial domain seed dictionary, so that The initial domain seed dictionary is expanded.
  • the method further includes:
  • the present invention provides a device for constructing a domain dictionary.
  • the device includes:
  • a model training unit configured to train word vectors on the selected general corpus and domain corpus respectively to obtain corresponding general word vector space models and domain word vector space models;
  • a similarity calculation unit configured to calculate a word semantic similarity between the corresponding general word vector space model and the field word vector space model in the universal word vector space model and the field word vector and a seed word vector in a preset initial domain seed dictionary.
  • a dictionary expansion unit configured to select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated semantic similarity of the words to obtain a corresponding domain dictionary
  • the unformed word filtering unit is used for filtering unformed words in the domain dictionary through a new word discovery algorithm to complete the construction of the domain dictionary.
  • the similarity calculation unit includes:
  • the similarity calculation subunit is configured to calculate a word semantic similarity between the general word vector and the field word vector and the seed word vector by using a preset vector cosine similarity formula.
  • the vector cosine similarity formula is Wherein, V 1 is the general word vector or the domain word vector, V 2 is the seed word vector, and S (V 1 , V 2 ) is the semantic similarity of the word.
  • the dictionary expansion unit includes:
  • a dictionary expansion subunit configured to add a general word vector or a domain word vector corresponding to the semantic similarity of the words to the initial domain seed when the calculated semantic similarity of the words is greater than a preset domain keyword threshold Dictionary to expand the initial domain seed dictionary.
  • the device further comprises:
  • Iteration number judging unit for judging whether the current number of iterations reaches a preset number of cross iterations, and then, triggering the unformed word screening unit to execute a new word discovery algorithm to filter out unformed words in the domain dictionary , Otherwise, increase the current number of iterations by 1, and set the domain dictionary as the initial domain seed dictionary, and trigger the similarity calculation unit to perform calculation of the universal word vector space model and the domain word.
  • the present invention also provides a computing device, including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor is implemented when the processor executes the computer program. Steps as described in the above method of constructing a domain dictionary.
  • the present invention also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps described in the method for constructing a dictionary in the foregoing field are implemented. .
  • the invention performs word vector model training on the selected universal corpus and domain corpus, respectively, and obtains the corresponding universal word vector space model and the domain word vector space model, and calculates the corresponding universal word vector in the universal word vector space model and the domain word vector space model.
  • the word semantic similarity between the field word vector and the seed word vector in the preset initial field seed dictionary Based on the calculated word semantic similarity, select the corresponding general word vector or field word vector to expand the initial field seed dictionary.
  • the corresponding domain dictionary uses the new word discovery algorithm to filter out unformed words in the domain dictionary to complete the construction of the domain dictionary, thereby expanding the vocabulary of the domain dictionary and improving the accuracy of the domain vocabulary in the domain dictionary. Then improve the accuracy of the domain dictionary.
  • FIG. 1 is an implementation flowchart of a method for constructing a domain dictionary provided by Embodiment 1 of the present invention
  • FIG. 2 is an implementation flowchart of a method for constructing a domain dictionary provided by Embodiment 2 of the present invention
  • FIG. 3 is a schematic structural diagram of a device for constructing a domain dictionary according to a third embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a device for constructing a domain dictionary according to a fourth embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a computing device according to a fifth embodiment of the present invention.
  • FIG. 1 shows an implementation flow of a method for constructing a domain dictionary provided in Embodiment 1 of the present invention. For convenience of explanation, only the parts related to the embodiment of the present invention are shown, and the details are as follows:
  • step S101 word vector model training is performed on the selected general corpus and domain corpus to obtain corresponding general word vector space models and domain word vector space models.
  • the embodiments of the present invention are applicable to computing devices, such as personal computers, servers, and the like.
  • the general corpus and the domain corpus selected in the embodiments of the present invention are relative rather than absolute.
  • the general corpus is a layer of abstraction or superordinate concept relative to the domain corpus, and is not necessarily a large and complete set of corpora.
  • a large and comprehensive set of common corpora for example, Wikipedia Chinese corpus
  • medical corpus for example, maternal and infant quiz
  • the corpus in the medical field should be regarded as a general corpus, and the dictionary in the field of Chinese medicine should be constructed in combination with the corpus in the field of Chinese medicine.
  • the selected general corpus and the domain corpus are trained on the word vector model through the Skip-Gram model, thereby reducing the complexity of the word vector model training and improving the accuracy of the word vector model training.
  • the vocabulary corresponding to the obtained word vector can better reflect the real text meaning.
  • step S102 the word semantic similarity between the corresponding universal word vector and domain word vector in the universal word vector space model and the domain word vector space model and the seed word vector in the preset initial domain seed dictionary is calculated.
  • the word semantic similarity between each general word vector in the universal word vector space model and each seed word vector in a preset initial domain seed dictionary is calculated, and each word in the domain word vector space model is calculated.
  • the word semantic similarity between each field word vector and each seed word vector in the initial field seed dictionary is composed of one or more field seed words, and the seed word vector is the corresponding field seed in the initial field seed dictionary. Vector representation of the word.
  • the domain to which the domain dictionary to be created belongs is divided into a number of different categories, and a domain seed word is created according to each category.
  • the initial domain seed dictionary is formed by the domain seed words corresponding to the category, so that the general word vector and the domain word.
  • the question and answer corpus is divided into five different categories based on the selected question and answer corpus in the maternal and infant field and combined with medical disease classification, and then the labels of each category are used to create a
  • These categories of keywords / words are the initial medical field seed dictionaries.
  • the word semantic similarity between the general word vector and the field word vector and the seed word vector is calculated by a preset vector cosine similarity formula, and the vector cosine similarity formula is Among them, V 1 is a general word vector or a field word vector, V 2 is a seed word vector, and S (V 1 , V 2 ) is a word semantic similarity, thereby improving the accuracy and accuracy of the word semantic similarity calculation.
  • step S103 according to the calculated semantic similarity of the words, a corresponding general word vector or domain word vector is selected to expand the initial domain seed dictionary to obtain a corresponding domain dictionary.
  • a general word vector or a field word vector that is similar to or the same as the seed word vector in the general word vector space model or the field word vector space model is selected, and the selected
  • the generated general word vector or field word vector is converted into a corresponding general word or field word, and then the general word or field word is added to the initial field seed dictionary to expand the initial field seed dictionary.
  • Dictionary According to the extended initial field seed, Dictionary to get the corresponding domain dictionary.
  • a general word vector or a domain word vector corresponding to the semantic similarity of the word is added to the initial domain seed dictionary.
  • step S104 the unformed words in the domain dictionary are screened out by a new word discovery algorithm to complete the construction of the domain dictionary.
  • the words in the domain dictionary are pre-processed first, and the numbers and English letters in the domain dictionary are filtered out.
  • Punctuation, English words, personal names, stop words, and stop words, and other non-domain words and then calculate the mutual information values of the word vectors corresponding to two adjacent words in the pre-processed domain dictionary to generate candidate new word sets.
  • the left and right adjacent entropy are used to filter the candidate new word set to obtain the new word set and the filtered unformed vocabulary set.
  • the unformed vocabulary set is partially filtered out from the pre-processed domain dictionary. In order to complete the construction of the domain dictionary, thereby improving the accuracy of the domain dictionary.
  • the word vector model training is performed on the general corpus and the domain corpus, respectively, to obtain the corresponding general word vector space model and the domain word vector space model, and calculate the corresponding ones in the general word vector space model and the domain word vector space model.
  • the word semantic similarity between the general word vector and the domain word vector and the seed word vector in the initial domain seed dictionary Based on the calculated word semantic similarity, select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary to obtain
  • the corresponding domain dictionary uses the new word discovery algorithm to filter out unformed words in the domain dictionary to complete the construction of the domain dictionary, thereby expanding the vocabulary of the domain dictionary and improving the accuracy of the domain vocabulary in the domain dictionary. Then improve the accuracy of the domain dictionary.
  • FIG. 2 shows an implementation process of a method for constructing a domain dictionary provided in Embodiment 2 of the present invention. For convenience of explanation, only the parts related to the embodiment of the present invention are shown, and the details are as follows:
  • step S201 word vector model training is performed on the selected general corpus and domain corpus to obtain corresponding general word vector space models and domain word vector space models.
  • step S202 the word semantic similarity between the corresponding general word vector space model and the domain word vector space model and the seed word vector in the preset initial domain seed dictionary is calculated.
  • step S203 according to the calculated semantic similarity of the words, the corresponding general word vector or domain word vector is selected to expand the initial domain seed dictionary to obtain a corresponding domain dictionary.
  • steps S201 to S203 for specific implementations of steps S201 to S203, reference may be made to the description of steps S101 to S103 in Embodiment 1, and details are not described herein again.
  • step S204 it is determined whether the current number of iterations reaches a preset number of cross-iterations. If yes, step S206 is performed; otherwise, step S205 is performed.
  • step S205 the current number of iterations is increased by one, and the domain dictionary is set as the initial domain seed dictionary.
  • the current number of iterations when the current number of iterations does not reach the preset number of cross iterations, the current number of iterations is increased by one, and the domain dictionary is set as the initial domain seed dictionary, so that the domain dictionary obtained by the current iteration is used as the next Input the domain seed word expansion once, and jump to step S202, and continue to perform the word semantic similarity calculation in the general word vector space model and the domain word vector space model to expand the initial domain seed dictionary.
  • step S206 the unformed words in the domain dictionary are filtered out by a new word discovery algorithm to complete the construction of the domain dictionary.
  • step S206 for the specific implementation of step S206, reference may be made to the description of step S104 in Embodiment 1, and details are not described herein again.
  • word vector model training is performed on the selected general corpus and domain corpus to obtain a general word vector space model and a domain word vector space model.
  • Multiple cross-iterations calculate the word semantic similarity of each seed word vector in the initial domain seed dictionary to expand the seed words of the initial domain seed dictionary, thereby improving the accuracy of the domain vocabulary in the obtained domain dictionary and expanding the domain The vocabulary in the dictionary, and then the new word discovery algorithm to filter out unformed words in the domain dictionary to complete the construction of the domain dictionary, thereby improving the accuracy of the domain dictionary.
  • FIG. 3 shows a structure of a device for constructing a domain dictionary provided in Embodiment 3 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, including:
  • a model training unit 31 is configured to perform word vector model training on the selected general corpus and domain corpus, respectively, to obtain corresponding general word vector space models and domain word vector space models.
  • the embodiments of the present invention are applicable to computing devices, such as personal computers, servers, and the like.
  • the general corpus and the domain corpus selected in the embodiments of the present invention are relative rather than absolute.
  • the general corpus is a layer of abstraction or superordinate concept relative to the domain corpus, and is not necessarily a large and complete set of corpora.
  • a large and comprehensive set of common corpora for example, Wikipedia Chinese corpus
  • medical corpus for example, maternal and infant quiz
  • the corpus in the medical field should be regarded as a general corpus, and the dictionary in the field of Chinese medicine should be constructed in combination with the corpus in the field of Chinese medicine.
  • the selected general corpus and the domain corpus are respectively trained with the word vector model through the Skip-Gram model, thereby reducing the complexity of the word vector model training and improving the accuracy of the word vector model training.
  • the vocabulary corresponding to the obtained word vector can better reflect the real text meaning.
  • the similarity calculating unit 32 is configured to calculate the semantic semantic similarity between the corresponding general word vector and domain word vector in the universal word vector space model and the domain word vector space model and the seed word vector in the preset initial domain seed dictionary.
  • the word semantic similarity between each general word vector in the universal word vector space model and each seed word vector in a preset initial domain seed dictionary is calculated, and each word in the domain word vector space model is calculated.
  • the word semantic similarity between each field word vector and each seed word vector in the initial field seed dictionary is composed of one or more field seed words, and the seed word vector is the corresponding field seed in the initial field seed dictionary. Vector representation of the word.
  • the domain to which the domain dictionary to be created belongs is divided into a number of different categories, and a domain seed word is created according to each category.
  • the initial domain seed dictionary is formed by the domain seed words corresponding to the category, so that the general word vector and the domain word.
  • the question and answer corpus is divided into five different categories based on the selected question and answer corpus in the maternal and infant field and combined with medical disease classification. These categories of keywords / words are the initial medical field seed dictionaries.
  • the word semantic similarity between the general word vector and the field word vector and the seed word vector is calculated by a preset vector cosine similarity formula, and the vector cosine similarity formula is Among them, V 1 is a general word vector or a field word vector, V 2 is a seed word vector, and S (V 1 , V 2 ) is a word semantic similarity, thereby improving the accuracy and accuracy of the word semantic similarity calculation.
  • the dictionary expansion unit 33 is configured to select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated word semantic similarity to obtain a corresponding domain dictionary.
  • a general word vector or a field word vector that is similar to or the same as the seed word vector in the general word vector space model or the field word vector space model is selected, and the selected
  • the generated general word vector or field word vector is converted into a corresponding general word or field word, and then the general word or field word is added to the initial field seed dictionary to expand the initial field seed dictionary.
  • Dictionary According to the extended initial field seed, Dictionary to get the corresponding domain dictionary.
  • a general word vector or a domain word vector corresponding to the semantic similarity of the word is added to the initial domain seed dictionary.
  • the unformed word filtering unit 34 is configured to filter the unformed words in the domain dictionary through a new word discovery algorithm to complete the construction of the domain dictionary.
  • the words in the domain dictionary are pre-processed first, and the numbers and English letters in the domain dictionary are filtered out.
  • Punctuation, English words, personal names, stop words, and stop words, and other non-domain words and then calculate the mutual information values of the word vectors corresponding to two adjacent words in the pre-processed domain dictionary to generate candidate new word sets.
  • the left and right adjacent entropy are used to filter the candidate new word set to obtain the new word set and the filtered unformed vocabulary set.
  • the unformed vocabulary set is partially filtered out from the pre-processed domain dictionary. In order to complete the construction of the domain dictionary, thereby improving the accuracy of the domain dictionary.
  • each unit of the device for constructing the domain dictionary may be implemented by corresponding hardware or software units.
  • Each unit may be an independent software and hardware unit, or may be integrated into one software and hardware unit. this invention.
  • Embodiment 4 is a diagrammatic representation of Embodiment 4:
  • FIG. 4 shows the structure of a device for constructing a domain dictionary provided in Embodiment 4 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, including:
  • a model training unit 41 is configured to perform word vector model training on the selected general corpus and domain corpus, respectively, to obtain corresponding general word vector space models and domain word vector space models;
  • a similarity calculation unit 42 for calculating a word semantic similarity between a corresponding general word vector and a field word vector in the universal word vector space model and the domain word vector space model and a seed word vector in a preset initial domain seed dictionary;
  • a dictionary expansion unit 43 is configured to select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated word semantic similarity to obtain a corresponding domain dictionary;
  • Iteration number judging unit 44 is configured to judge whether the current number of iterations reaches a preset number of cross-iterations. If yes, the unformed word screening unit 45 is triggered to perform the filtering of unformed words in the domain dictionary through a new word discovery algorithm, otherwise , Increase the current number of iterations by 1, and set the domain dictionary as the initial domain seed dictionary, and trigger the similarity calculation unit 42 to perform calculation of the corresponding general word vector and domain word vector in the universal word vector space model and the domain word vector space model Word semantic similarity to a seed word vector in a preset initial domain seed dictionary; and
  • the unformed word filtering unit 45 is configured to filter the unformed words in the domain dictionary through a new word discovery algorithm to complete the construction of the domain dictionary.
  • the similarity calculation unit 42 includes:
  • the similarity calculation subunit 421 is configured to calculate a word semantic similarity between the general word vector and the field word vector and the seed word vector through a preset vector cosine similarity formula.
  • the vector cosine similarity formula is Among them, V 1 is a general word vector or a field word vector, V 2 is a seed word vector, and S (V 1 , V 2 ) is a word semantic similarity.
  • the dictionary expansion unit 43 includes:
  • the dictionary expansion subunit 431 is configured to add a general word vector or a domain word vector corresponding to the semantic similarity of words to the initial domain seed dictionary when the calculated semantic similarity of the words is greater than a preset threshold of the domain keywords.
  • the initial domain seed dictionary is expanded.
  • each unit of the device for constructing the domain dictionary may be implemented by corresponding hardware or software units.
  • Each unit may be an independent software and hardware unit, or may be integrated into one software and hardware unit.
  • Embodiment 5 is a diagrammatic representation of Embodiment 5:
  • FIG. 5 shows the structure of a computing device provided in Embodiment 5 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown.
  • the computing device 5 includes a processor 50, a memory 51, and a computer program 52 stored in the memory 51 and executable on the processor 50.
  • the processor 50 executes the computer program 52
  • the steps in the embodiment of the method for constructing a domain dictionary are implemented, for example, steps S101 to S104 shown in FIG.
  • the processor 50 executes the computer program 52
  • the functions of the units in the foregoing device embodiments are implemented, for example, the functions of the units 31 to 34 shown in FIG. 3.
  • word vector model training is performed on the selected general corpus and domain corpus to obtain corresponding general word vector space model and domain word vector space model, and the general word vector space model and the domain word vector space model are calculated.
  • the semantic similarity between the corresponding universal word vector and domain word vector and the seed word vector in the preset initial domain seed dictionary Based on the calculated semantic similarity of the word, select the corresponding universal word vector or domain word vector to the initial domain seed.
  • the dictionary is expanded to obtain the corresponding domain dictionary, and the unformed words in the domain dictionary are filtered by the new word discovery algorithm to complete the construction of the domain dictionary, thereby expanding the vocabulary of the domain dictionary and improving the domain in the domain dictionary.
  • Vocabulary accuracy which in turn improves the accuracy of the domain dictionary.
  • the computing device in the embodiment of the present invention may be a personal computer or a server.
  • steps implemented when the processor 50 in the computing device 5 executes the computer program 52 to implement the method of constructing the domain dictionary reference may be made to the description of the foregoing method embodiments, and details are not described herein again.
  • Embodiment 6 is a diagrammatic representation of Embodiment 6
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the embodiment of a method for constructing a dictionary in the foregoing field are implemented, for example, Steps S101 to S104 shown in FIG. 1.
  • the functions of each unit in the foregoing device embodiments are implemented, for example, the functions of units 31 to 34 shown in FIG. 3.
  • word vector model training is performed on the selected general corpus and domain corpus to obtain corresponding general word vector space model and domain word vector space model, and the general word vector space model and the domain word vector space model are calculated.
  • the semantic similarity between the corresponding universal word vector and domain word vector and the seed word vector in the preset initial domain seed dictionary Based on the calculated semantic similarity of the word, select the corresponding universal word vector or domain word vector to the initial domain seed.
  • the dictionary is expanded to obtain the corresponding domain dictionary, and the unformed words in the domain dictionary are filtered by the new word discovery algorithm to complete the construction of the domain dictionary, thereby expanding the vocabulary of the domain dictionary and improving the domain in the domain dictionary.
  • Vocabulary accuracy which in turn improves the accuracy of the domain dictionary.
  • the computer-readable storage medium of the embodiment of the present invention may include any entity or device capable of carrying computer program code, a recording medium, for example, a memory such as a ROM / RAM, a magnetic disk, an optical disk, a flash memory, or the like.
  • a recording medium for example, a memory such as a ROM / RAM, a magnetic disk, an optical disk, a flash memory, or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention is applicable to the technical field of natural language processing. Provided are a method and apparatus for constructing a domain dictionary, and a device and a storage medium. The method comprises: training a word vector model for a selected general corpus and domain corpus, respectively, and obtaining a corresponding general word vector space model and domain word vector space model; calculating a word semantic similarity between a corresponding general word vector and domain word vector in the general word vector space model and domain word vector space model and a seed word vector in an initial domain seed dictionary; selecting, according to the calculated word semantic similarity, the corresponding general word vector or domain word vector to expand the initial domain seed dictionary, so as to obtain a corresponding domain dictionary; and filtering out unformed words in the domain dictionary by means of a new word discovery algorithm so as to complete the construction of the domain dictionary. Thus, the quantity of vocabulary of the domain dictionary is expanded, and the accuracy of the domain vocabulary in the domain dictionary is improved, thereby improving the accuracy of the domain dictionary.

Description

一种领域词典的构建方法、装置、设备及存储介质Method, device, equipment and storage medium for constructing domain dictionary 技术领域Technical field
本发明属于自然语言处理技术领域,尤其涉及一种领域词典的构建方法、装置、设备及存储介质。The invention belongs to the technical field of natural language processing, and particularly relates to a method, a device, a device, and a storage medium for constructing a dictionary in the field.
背景技术Background technique
随着科技和社会的不断进步,语言也在不断地发生着变化,特别是近年来,新理论、新概念、新材料、新技术、新工艺不断涌现,与之同步产生的新的领域词汇层出不穷。领域词汇集中体现和负载了一个学科领域的核心知识,词汇的变化在一定程度上反映了一个学科领域的发展变化,领域词汇对于了解、把握一个学科领域的发展现状、未来趋向等具有重要的理论和现实意义,随着自然语言处理应用领域的不断扩展,对于领域词汇词典的需求也越来越迫切。With the continuous progress of science and technology and society, the language is constantly changing, especially in recent years, new theories, new concepts, new materials, new technologies, new processes have continuously emerged, and new domain vocabularies generated in parallel have emerged endlessly. . The domain vocabulary reflects and loads the core knowledge of a subject area. The change of vocabulary reflects the development and change of a subject area to a certain extent. The domain vocabulary has important theories for understanding and grasping the development status and future trends of a subject area. And practical significance, with the continuous expansion of the field of natural language processing, the demand for domain lexicons is becoming more and more urgent.
现有的基于词向量的领域词典构建算法是单一的利用网络上的通用语料或者领域语料,直接通过中文分词工具得到分词语料后构建的通用词向量模型或者领域词向量模型,然后再计算通用词向量模型或者领域词向量模型中词语之间的语义相似度,以构建领域词典。然而,上述通用词向量模型没有考虑到在限定领域的领域词典构建对领域语料的依赖问题,而领域词向量模型也没有考虑到限定域语料不足的问题,同时,上述基于词向量的领域词典构建算法没有考虑到中文分词工具在限定域领域不能对领域词汇或者新词等未知词进行正确分词等问题,从而导致获得的领域词典空间不足、领域词汇不准确等问题。Existing word dictionary-based domain dictionary construction algorithms are a single general-purpose corpus or domain corpus on the Internet, and a general-purpose word vector model or domain-word vector model constructed by segmenting the corpus directly through the Chinese word segmentation tool, and then calculating the general The semantic similarity between words in the word vector model or the domain word vector model to construct a domain dictionary. However, the above-mentioned general word vector model does not take into account the dependence of the domain corpus on the domain corpus in the restricted domain, and the domain word vector model does not take into account the problem of insufficient corpus in the restricted domain. The algorithm does not take into account problems such as the inability of the Chinese word segmentation tool to correctly segment words in the domain vocabulary or unknown words in the restricted domain, resulting in insufficient domain dictionary space and inaccurate domain vocabulary.
发明内容Summary of the Invention
本发明的目的在于提供一种领域词典的构建方法、装置、设备及存储介质,旨在解决由于现有技术无法提供一种有效的领域词典构建方法,导致领域词典 中领域词汇量不足、且领域词汇不准确的问题。The purpose of the present invention is to provide a method, device, equipment and storage medium for constructing a domain dictionary, which aims to solve the problem that the existing dictionary cannot provide an effective method for constructing a domain dictionary, resulting in insufficient domain vocabulary in the domain dictionary, and the domain Vocabulary inaccuracies.
一方面,本发明提供了一种领域词典的构建方法,所述方法包括下述步骤:In one aspect, the present invention provides a method for constructing a domain dictionary. The method includes the following steps:
对选取的通用语料库和领域语料库分别进行词向量模型训练,获得对应的通用词向量空间模型和领域词向量空间模型;Train the selected general corpus and domain corpus separately on the word vector model to obtain the corresponding general word vector space model and domain word vector space model;
计算所述通用词向量空间模型和所述领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度;Calculating the word semantic similarity between the corresponding universal word vector and the domain word vector in the universal word vector space model and the domain word vector space model and a seed word vector in a preset initial domain seed dictionary;
根据计算得到的所述词语语义相似度,选取对应的通用词向量或者领域词向量对所述初始领域种子词典进行扩展,得到对应的领域词典;Selecting the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated semantic similarity of the words to obtain a corresponding domain dictionary;
通过新词发现算法对所述领域词典中的未成词词汇进行筛除,以完成所述领域词典的构建。The unformed words in the domain dictionary are filtered by a new word discovery algorithm to complete the construction of the domain dictionary.
优选地,计算所述通用词向量空间模型和所述领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度的步骤,包括:Preferably, the step of calculating a word semantic similarity between the corresponding general word vector space model and the field word vector space model and the corresponding general word vector and the field word vector with a seed word vector in a preset initial domain seed dictionary includes: :
通过预设的向量余弦相似度公式计算所述通用词向量和所述领域词向量与所述种子词向量的词语语义相似度,所述向量余弦相似度公式为
Figure PCTCN2019075956-appb-000001
其中,V 1为所述通用词向量或者所述领域词向量,V 2为所述种子词向量,S(V 1,V 2)为所述词语语义相似度。
Calculate the word semantic similarity between the general word vector and the field word vector and the seed word vector through a preset vector cosine similarity formula, where the vector cosine similarity formula is
Figure PCTCN2019075956-appb-000001
Wherein, V 1 is the general word vector or the domain word vector, V 2 is the seed word vector, and S (V 1 , V 2 ) is the semantic similarity of the word.
优选地,选取对应的通用词向量或者领域词向量对所述初始领域种子词典进行扩展的步骤,包括:Preferably, the step of selecting a corresponding universal word vector or domain word vector to expand the initial domain seed dictionary includes:
当计算得到的所述词语语义相似度大于预设的领域关键词阈值时,将所述词语语义相似度对应的通用词向量或者领域词向量添加到所述初始领域种子词典中,以对所述初始领域种子词典进行扩展。When the calculated semantic similarity of the words is greater than a preset domain keyword threshold, a general word vector or a domain word vector corresponding to the semantic similarity of the words is added to the initial domain seed dictionary, so that The initial domain seed dictionary is expanded.
优选地,通过新词发现算法对所述领域词典中的未成词词汇进行筛除的步骤之前,所述方法还包括:Preferably, before the step of filtering unformed words in the domain dictionary by a new word discovery algorithm, the method further includes:
判断当前迭代次数是否达到预设的交叉迭代次数;Determine whether the current number of iterations reaches a preset number of cross iterations;
是则,跳转到通过新词发现算法对所述领域词典中的未成词词汇进行筛除的步骤;If yes, jump to the step of filtering unformed words in the domain dictionary by using a new word discovery algorithm;
否则,将所述当前迭代次数增加1次,且将所述领域词典设置为所述初始领域种子词典,并跳转到计算所述通用词向量空间模型和所述领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度的步骤。Otherwise, increase the current number of iterations by 1, and set the domain dictionary as the initial domain seed dictionary, and jump to computing the corresponding ones in the universal word vector space model and the domain word vector space model. Steps of semantic semantic similarity between the general word vector and the domain word vector and a seed word vector in a preset initial domain seed dictionary.
另一方面,本发明提供了一种领域词典的构建装置,所述装置包括:In another aspect, the present invention provides a device for constructing a domain dictionary. The device includes:
模型训练单元,用于对选取的通用语料库和领域语料库分别进行词向量模型训练,获得对应的通用词向量空间模型和领域词向量空间模型;A model training unit, configured to train word vectors on the selected general corpus and domain corpus respectively to obtain corresponding general word vector space models and domain word vector space models;
相似度计算单元,用于计算所述通用词向量空间模型和所述领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度;A similarity calculation unit, configured to calculate a word semantic similarity between the corresponding general word vector space model and the field word vector space model in the universal word vector space model and the field word vector and a seed word vector in a preset initial domain seed dictionary. ;
词典扩展单元,用于根据计算得到的所述词语语义相似度,选取对应的通用词向量或者领域词向量对所述初始领域种子词典进行扩展,得到对应的领域词典;以及A dictionary expansion unit, configured to select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated semantic similarity of the words to obtain a corresponding domain dictionary; and
未成词筛除单元,用于通过新词发现算法对所述领域词典中的未成词词汇进行筛除,以完成所述领域词典的构建。The unformed word filtering unit is used for filtering unformed words in the domain dictionary through a new word discovery algorithm to complete the construction of the domain dictionary.
优选地,所述相似度计算单元包括:Preferably, the similarity calculation unit includes:
相似度计算子单元,用于通过预设的向量余弦相似度公式计算所述通用词向量和所述领域词向量与所述种子词向量的词语语义相似度,所述向量余弦相似度公式为
Figure PCTCN2019075956-appb-000002
其中,V 1为所述通用词向量或者所述领域词向量,V 2为所述种子词向量,S(V 1,V 2)为所述词语语义相似度。
The similarity calculation subunit is configured to calculate a word semantic similarity between the general word vector and the field word vector and the seed word vector by using a preset vector cosine similarity formula. The vector cosine similarity formula is
Figure PCTCN2019075956-appb-000002
Wherein, V 1 is the general word vector or the domain word vector, V 2 is the seed word vector, and S (V 1 , V 2 ) is the semantic similarity of the word.
优选地,所述词典扩展单元包括:Preferably, the dictionary expansion unit includes:
词典扩展子单元,用于当计算得到的所述词语语义相似度大于预设的领域关键词阈值时,将所述词语语义相似度对应的通用词向量或者领域词向量添加到所述初始领域种子词典中,以对所述初始领域种子词典进行扩展。A dictionary expansion subunit, configured to add a general word vector or a domain word vector corresponding to the semantic similarity of the words to the initial domain seed when the calculated semantic similarity of the words is greater than a preset domain keyword threshold Dictionary to expand the initial domain seed dictionary.
优选地,所述装置还包括:Preferably, the device further comprises:
迭代次数判断单元,用于判断当前迭代次数是否达到预设的交叉迭代次数,是则,触发所述未成词筛除单元执行通过新词发现算法对所述领域词典中的未成词词汇进行筛除,否则,将所述当前迭代次数增加1次,且将所述领域词典设置为所述初始领域种子词典,并触发所述相似度计算单元执行计算所述通用词向量空间模型和所述领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度。Iteration number judging unit, for judging whether the current number of iterations reaches a preset number of cross iterations, and then, triggering the unformed word screening unit to execute a new word discovery algorithm to filter out unformed words in the domain dictionary , Otherwise, increase the current number of iterations by 1, and set the domain dictionary as the initial domain seed dictionary, and trigger the similarity calculation unit to perform calculation of the universal word vector space model and the domain word The semantic similarity between the corresponding general word vector and domain word vector in the vector space model and the seed word vector in the preset initial domain seed dictionary.
另一方面,本发明还提供了一种计算设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述领域词典的构建方法所述的步骤。In another aspect, the present invention also provides a computing device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor is implemented when the processor executes the computer program. Steps as described in the above method of constructing a domain dictionary.
另一方面,本发明还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如上述领域词典的构建方法所述的步骤。In another aspect, the present invention also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps described in the method for constructing a dictionary in the foregoing field are implemented. .
本发明对选取的通用语料库和领域语料库分别进行词向量模型训练,获得对应的通用词向量空间模型和领域词向量空间模型,计算通用词向量空间模型和领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度,根据计算得到的词语语义相似度,选取对应的通用词向量或者领域词向量对初始领域种子词典进行扩展,得到对应的领域词典,通过新词发现算法对领域词典中的未成词词汇进行筛除,以完成领域词典的构建,从而扩大了领域词典的词汇量,且提高了领域词典中领域词汇的准确度,进而提高领域词典的准确率。The invention performs word vector model training on the selected universal corpus and domain corpus, respectively, and obtains the corresponding universal word vector space model and the domain word vector space model, and calculates the corresponding universal word vector in the universal word vector space model and the domain word vector space model. The word semantic similarity between the field word vector and the seed word vector in the preset initial field seed dictionary. Based on the calculated word semantic similarity, select the corresponding general word vector or field word vector to expand the initial field seed dictionary. The corresponding domain dictionary uses the new word discovery algorithm to filter out unformed words in the domain dictionary to complete the construction of the domain dictionary, thereby expanding the vocabulary of the domain dictionary and improving the accuracy of the domain vocabulary in the domain dictionary. Then improve the accuracy of the domain dictionary.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明实施例一提供的领域词典的构建方法的实现流程图;FIG. 1 is an implementation flowchart of a method for constructing a domain dictionary provided by Embodiment 1 of the present invention; FIG.
图2是本发明实施例二提供的领域词典的构建方法的实现流程图;FIG. 2 is an implementation flowchart of a method for constructing a domain dictionary provided by Embodiment 2 of the present invention; FIG.
图3是本发明实施例三提供的领域词典的构建装置的结构示意图;3 is a schematic structural diagram of a device for constructing a domain dictionary according to a third embodiment of the present invention;
图4是本发明实施例四提供的领域词典的构建装置的结构示意图;以及4 is a schematic structural diagram of a device for constructing a domain dictionary according to a fourth embodiment of the present invention; and
图5是本发明实施例五提供的计算设备的结构示意图。FIG. 5 is a schematic structural diagram of a computing device according to a fifth embodiment of the present invention.
具体实施方式detailed description
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions, and advantages of the present invention clearer, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.
以下结合具体实施例对本发明的具体实现进行详细描述:The following describes the specific implementation of the present invention in detail with reference to specific embodiments:
实施例一:Embodiment one:
图1示出了本发明实施例一提供的领域词典的构建方法的实现流程,为了便于说明,仅示出了与本发明实施例相关的部分,详述如下:FIG. 1 shows an implementation flow of a method for constructing a domain dictionary provided in Embodiment 1 of the present invention. For convenience of explanation, only the parts related to the embodiment of the present invention are shown, and the details are as follows:
在步骤S101中,对选取的通用语料库和领域语料库分别进行词向量模型训练,获得对应的通用词向量空间模型和领域词向量空间模型。In step S101, word vector model training is performed on the selected general corpus and domain corpus to obtain corresponding general word vector space models and domain word vector space models.
本发明实施例适用于计算设备,例如,个人计算机、服务器等。本发明实施例中选取的通用语料库和领域语料库是相对关系而非绝对关系,通用语料库是相对于领域语料库的一层抽象或者上位概念,并非一定是大而全的一套语料,例如,若要构建一套医疗领域词典,则选取大而全的一套通用语料(例如,维基百科中文语料)和医疗领域语料(例如,母婴领域问答语料)来共同完成;若只要构建一套中药领域词典,则医疗领域语料应被视为通用语料,再结合中药领域语料进行中药领域词典的构建。The embodiments of the present invention are applicable to computing devices, such as personal computers, servers, and the like. The general corpus and the domain corpus selected in the embodiments of the present invention are relative rather than absolute. The general corpus is a layer of abstraction or superordinate concept relative to the domain corpus, and is not necessarily a large and complete set of corpora. For example, if you want to To build a medical dictionary, a large and comprehensive set of common corpora (for example, Wikipedia Chinese corpus) and medical corpus (for example, maternal and infant quiz) should be jointly completed; if only a set of Chinese medicine field dictionary Then, the corpus in the medical field should be regarded as a general corpus, and the dictionary in the field of Chinese medicine should be constructed in combination with the corpus in the field of Chinese medicine.
在本发明实施例中,优选地,通过Skip-Gram模型对选取的通用语料库和领域语料库分别进行词向量模型训练,从而降低词向量模型训练的复杂度,且提高词向量模型训练的准确度,使得获得的词向量对应的词汇更能反映真实的文本含义。In the embodiment of the present invention, preferably, the selected general corpus and the domain corpus are trained on the word vector model through the Skip-Gram model, thereby reducing the complexity of the word vector model training and improving the accuracy of the word vector model training. The vocabulary corresponding to the obtained word vector can better reflect the real text meaning.
在步骤S102中,计算通用词向量空间模型和领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义 相似度。In step S102, the word semantic similarity between the corresponding universal word vector and domain word vector in the universal word vector space model and the domain word vector space model and the seed word vector in the preset initial domain seed dictionary is calculated.
在本发明实施例中,计算通用词向量空间模型中的每个通用词向量与预设的初始领域种子词典中每个种子词向量的词语语义相似度,且计算领域词向量空间模型中的每个领域词向量与初始领域种子词典中每个种子词向量的词语语义相似度,其中,初始领域种子词典由一个或多个领域种子词组成,种子词向量为初始领域种子词典中对应的领域种子词的向量表示。In the embodiment of the present invention, the word semantic similarity between each general word vector in the universal word vector space model and each seed word vector in a preset initial domain seed dictionary is calculated, and each word in the domain word vector space model is calculated. The word semantic similarity between each field word vector and each seed word vector in the initial field seed dictionary. The initial field seed dictionary is composed of one or more field seed words, and the seed word vector is the corresponding field seed in the initial field seed dictionary. Vector representation of the word.
在本发明实施例中,在计算通用词向量空间模型和领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度之前,优选地,将待创建的领域词典所属的领域划分成数量个不同的类别,根据每个类别创建一个领域种子词,由类别对应的领域种子词构成初始领域种子词典,从而为通用词向量和领域词向量的词语语义相似度计算提供对照样本。In the embodiment of the present invention, before calculating the word semantic similarity between the corresponding general word vector space model and the domain word vector space model in the universal word vector space model and the domain word vector space model and the seed word vector in the preset initial domain seed dictionary, it is preferred First, the domain to which the domain dictionary to be created belongs is divided into a number of different categories, and a domain seed word is created according to each category. The initial domain seed dictionary is formed by the domain seed words corresponding to the category, so that the general word vector and the domain word The word semantic similarity calculation of vectors provides a comparison sample.
作为示例地,若待创建医疗领域词典,则通过选取的母婴领域问答语料并结合医疗疾病分类情况,将问答语料划分成五个不同的类别,再利用每个类别的标签,创建一个只包含这些类别的关键字/词的初始医疗领域种子词典。As an example, if a dictionary in the medical field is to be created, the question and answer corpus is divided into five different categories based on the selected question and answer corpus in the maternal and infant field and combined with medical disease classification, and then the labels of each category are used to create a These categories of keywords / words are the initial medical field seed dictionaries.
在本发明实施例中,优选地,通过预设的向量余弦相似度公式计算通用词向量和领域词向量与种子词向量的词语语义相似度,向量余弦相似度公式为
Figure PCTCN2019075956-appb-000003
其中,V 1为通用词向量或者领域词向量,V 2为种子词向量,S(V 1,V 2)为词语语义相似度,从而提高词语语义相似度计算的精确度和准确性。
In the embodiment of the present invention, preferably, the word semantic similarity between the general word vector and the field word vector and the seed word vector is calculated by a preset vector cosine similarity formula, and the vector cosine similarity formula is
Figure PCTCN2019075956-appb-000003
Among them, V 1 is a general word vector or a field word vector, V 2 is a seed word vector, and S (V 1 , V 2 ) is a word semantic similarity, thereby improving the accuracy and accuracy of the word semantic similarity calculation.
在步骤S103中,根据计算得到的词语语义相似度,选取对应的通用词向量或者领域词向量对初始领域种子词典进行扩展,得到对应的领域词典。In step S103, according to the calculated semantic similarity of the words, a corresponding general word vector or domain word vector is selected to expand the initial domain seed dictionary to obtain a corresponding domain dictionary.
在本发明实施例中,根据计算得到的词语语义相似度,在通用词向量空间模型或者领域词向量空间模型中选取与种子词向量语义相近或者相同的通用词向量或者领域词向量,并将选取出的通用词向量或者领域词向量转换为对应的通用词汇或者领域词汇,再将该通用词汇或者领域词汇添加到初始领域种子词典中,以对初始领域种子词典进行扩展,根据扩展的初始领域种子词典,得到 对应的领域词典。In the embodiment of the present invention, according to the calculated word semantic similarity, a general word vector or a field word vector that is similar to or the same as the seed word vector in the general word vector space model or the field word vector space model is selected, and the selected The generated general word vector or field word vector is converted into a corresponding general word or field word, and then the general word or field word is added to the initial field seed dictionary to expand the initial field seed dictionary. According to the extended initial field seed, Dictionary to get the corresponding domain dictionary.
在本发明实施例中,优选地,当计算得到的词语语义相似度大于预设的领域关键词阈值时,将该词语语义相似度对应的通用词向量或者领域词向量添加到初始领域种子词典中,以对初始领域种子词典进行扩展,从而提高领域词汇的准确性。In the embodiment of the present invention, preferably, when the calculated semantic similarity of a word is greater than a preset domain keyword threshold, a general word vector or a domain word vector corresponding to the semantic similarity of the word is added to the initial domain seed dictionary. To expand the initial domain seed dictionary to improve the accuracy of the domain vocabulary.
在步骤S104中,通过新词发现算法对领域词典中的未成词词汇进行筛除,以完成领域词典的构建。In step S104, the unformed words in the domain dictionary are screened out by a new word discovery algorithm to complete the construction of the domain dictionary.
在本发明实施例中,在通过新词发现算法对领域词典中的未成词词汇进行筛除时,优选地,首先对领域词典中的词汇进行预处理,过滤掉领域词典中的数字、英文字母、标点符号、英文单词、人名、停用词、以及禁用词等非领域词汇,然后,计算预处理后的领域词典中两个相邻词汇对应的词向量的互信息值,生成候选新词集,之后,再利用左右邻接熵对候选新词集进行过滤,得到新词集和被过滤掉的未成词词汇集合,最后,从预处理后的领域词典中将未成词词汇集合部分筛除掉,以完成领域词典的构建,从而提高领域词典的准确性。In the embodiment of the present invention, when the unformed words in the domain dictionary are filtered by the new word discovery algorithm, preferably, the words in the domain dictionary are pre-processed first, and the numbers and English letters in the domain dictionary are filtered out. , Punctuation, English words, personal names, stop words, and stop words, and other non-domain words, and then calculate the mutual information values of the word vectors corresponding to two adjacent words in the pre-processed domain dictionary to generate candidate new word sets Then, the left and right adjacent entropy are used to filter the candidate new word set to obtain the new word set and the filtered unformed vocabulary set. Finally, the unformed vocabulary set is partially filtered out from the pre-processed domain dictionary. In order to complete the construction of the domain dictionary, thereby improving the accuracy of the domain dictionary.
在本发明实施例中,对通用语料库和领域语料库分别进行词向量模型训练,获得对应的通用词向量空间模型和领域词向量空间模型,计算通用词向量空间模型和领域词向量空间模型中对应的通用词向量和领域词向量与初始领域种子词典中种子词向量的词语语义相似度,根据计算得到的词语语义相似度,选取对应的通用词向量或者领域词向量对初始领域种子词典进行扩展,得到对应的领域词典,通过新词发现算法对领域词典中的未成词词汇进行筛除,以完成领域词典的构建,从而扩大了领域词典的词汇量,且提高了领域词典中领域词汇的准确度,进而提高领域词典的准确率。In the embodiment of the present invention, the word vector model training is performed on the general corpus and the domain corpus, respectively, to obtain the corresponding general word vector space model and the domain word vector space model, and calculate the corresponding ones in the general word vector space model and the domain word vector space model. The word semantic similarity between the general word vector and the domain word vector and the seed word vector in the initial domain seed dictionary. Based on the calculated word semantic similarity, select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary to obtain The corresponding domain dictionary uses the new word discovery algorithm to filter out unformed words in the domain dictionary to complete the construction of the domain dictionary, thereby expanding the vocabulary of the domain dictionary and improving the accuracy of the domain vocabulary in the domain dictionary. Then improve the accuracy of the domain dictionary.
实施例二:Embodiment two:
图2示出了本发明实施例二提供的领域词典的构建方法的实现流程,为了便于说明,仅示出了与本发明实施例相关的部分,详述如下:FIG. 2 shows an implementation process of a method for constructing a domain dictionary provided in Embodiment 2 of the present invention. For convenience of explanation, only the parts related to the embodiment of the present invention are shown, and the details are as follows:
在步骤S201中,对选取的通用语料库和领域语料库分别进行词向量模型训 练,获得对应的通用词向量空间模型和领域词向量空间模型。In step S201, word vector model training is performed on the selected general corpus and domain corpus to obtain corresponding general word vector space models and domain word vector space models.
在步骤S202中,计算通用词向量空间模型和领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度。In step S202, the word semantic similarity between the corresponding general word vector space model and the domain word vector space model and the seed word vector in the preset initial domain seed dictionary is calculated.
在步骤S203中,根据计算得到的词语语义相似度,选取对应的通用词向量或者领域词向量对初始领域种子词典进行扩展,得到对应的领域词典。In step S203, according to the calculated semantic similarity of the words, the corresponding general word vector or domain word vector is selected to expand the initial domain seed dictionary to obtain a corresponding domain dictionary.
在本发明实施例中,步骤S201-步骤S203的具体实施方式可参考实施例一的步骤S101-步骤S103的描述,在此不再赘述。In the embodiment of the present invention, for specific implementations of steps S201 to S203, reference may be made to the description of steps S101 to S103 in Embodiment 1, and details are not described herein again.
在步骤S204中,判断当前迭代次数是否达到预设的交叉迭代次数,是则,执行步骤S206,否则,执行步骤S205。In step S204, it is determined whether the current number of iterations reaches a preset number of cross-iterations. If yes, step S206 is performed; otherwise, step S205 is performed.
在步骤S205中,将当前迭代次数增加1次,且将领域词典设置为初始领域种子词典。In step S205, the current number of iterations is increased by one, and the domain dictionary is set as the initial domain seed dictionary.
在本发明实施例中,当当前迭代次数未达到预设的交叉迭代次数时,将当前迭代次数增加1次,且将领域词典设置为初始领域种子词典,以将当前迭代得到的领域词典作为下一次领域种子词拓展的输入,并跳转到步骤S202,继续在通用词向量空间模型和领域词向量空间模型中执行词语语义相似度计算,以扩展初始领域种子词典。In the embodiment of the present invention, when the current number of iterations does not reach the preset number of cross iterations, the current number of iterations is increased by one, and the domain dictionary is set as the initial domain seed dictionary, so that the domain dictionary obtained by the current iteration is used as the next Input the domain seed word expansion once, and jump to step S202, and continue to perform the word semantic similarity calculation in the general word vector space model and the domain word vector space model to expand the initial domain seed dictionary.
在步骤S206中,通过新词发现算法对领域词典中的未成词词汇进行筛除,以完成领域词典的构建。In step S206, the unformed words in the domain dictionary are filtered out by a new word discovery algorithm to complete the construction of the domain dictionary.
在本发明实施例中,步骤S206的具体实施方式可参考实施例一的步骤S104的描述,在此不再赘述。In the embodiment of the present invention, for the specific implementation of step S206, reference may be made to the description of step S104 in Embodiment 1, and details are not described herein again.
在本发明实施例中,对选取的通用语料库和领域语料库分别进行词向量模型训练,得到通用词向量空间模型和领域词向量空间模型,通过在通用词向量空间模型和领域词向量空间模型上进行多次交叉迭代计算初始领域种子词典中每个种子词向量的词语语义相似度,来对初始领域种子词典的种子词进行扩展,从而提高得到的领域词典中领域词汇的准确度,以及扩大了领域词典中的词汇 量,再通过新词发现算法对领域词典中的未成词词汇进行筛除,以完成领域词典的构建,从而提高领域词典的准确率。In the embodiment of the present invention, word vector model training is performed on the selected general corpus and domain corpus to obtain a general word vector space model and a domain word vector space model. Multiple cross-iterations calculate the word semantic similarity of each seed word vector in the initial domain seed dictionary to expand the seed words of the initial domain seed dictionary, thereby improving the accuracy of the domain vocabulary in the obtained domain dictionary and expanding the domain The vocabulary in the dictionary, and then the new word discovery algorithm to filter out unformed words in the domain dictionary to complete the construction of the domain dictionary, thereby improving the accuracy of the domain dictionary.
实施例三:Embodiment three:
图3示出了本发明实施例三提供的领域词典的构建装置的结构,为了便于说明,仅示出了与本发明实施例相关的部分,其中包括:FIG. 3 shows a structure of a device for constructing a domain dictionary provided in Embodiment 3 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, including:
模型训练单元31,用于对选取的通用语料库和领域语料库分别进行词向量模型训练,获得对应的通用词向量空间模型和领域词向量空间模型。A model training unit 31 is configured to perform word vector model training on the selected general corpus and domain corpus, respectively, to obtain corresponding general word vector space models and domain word vector space models.
本发明实施例适用于计算设备,例如,个人计算机、服务器等。本发明实施例中选取的通用语料库和领域语料库是相对关系而非绝对关系,通用语料库是相对于领域语料库的一层抽象或者上位概念,并非一定是大而全的一套语料,例如,若要构建一套医疗领域词典,则选取大而全的一套通用语料(例如,维基百科中文语料)和医疗领域语料(例如,母婴领域问答语料)来共同完成;若只要构建一套中药领域词典,则医疗领域语料应被视为通用语料,再结合中药领域语料进行中药领域词典的构建。The embodiments of the present invention are applicable to computing devices, such as personal computers, servers, and the like. The general corpus and the domain corpus selected in the embodiments of the present invention are relative rather than absolute. The general corpus is a layer of abstraction or superordinate concept relative to the domain corpus, and is not necessarily a large and complete set of corpora. For example, if you want to To build a medical dictionary, a large and comprehensive set of common corpora (for example, Wikipedia Chinese corpus) and medical corpus (for example, maternal and infant quiz) should be jointly completed; if only a set of Chinese medicine field dictionary Then, the corpus in the medical field should be regarded as a general corpus, and the dictionary in the field of Chinese medicine should be constructed in combination with the corpus in the field of Chinese medicine.
在本发明实施例中,优选地,通过Skip-Gram模型对选取的通用语料库和领域语料库分别进行词向量模型训练,从而降低词向量模型训练的复杂度,且提高词向量模型训练的准确度,使得获得的词向量对应的词汇更能反映真实的文本含义。In the embodiment of the present invention, preferably, the selected general corpus and the domain corpus are respectively trained with the word vector model through the Skip-Gram model, thereby reducing the complexity of the word vector model training and improving the accuracy of the word vector model training. The vocabulary corresponding to the obtained word vector can better reflect the real text meaning.
相似度计算单元32,用于计算通用词向量空间模型和领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度。The similarity calculating unit 32 is configured to calculate the semantic semantic similarity between the corresponding general word vector and domain word vector in the universal word vector space model and the domain word vector space model and the seed word vector in the preset initial domain seed dictionary.
在本发明实施例中,计算通用词向量空间模型中的每个通用词向量与预设的初始领域种子词典中每个种子词向量的词语语义相似度,且计算领域词向量空间模型中的每个领域词向量与初始领域种子词典中每个种子词向量的词语语义相似度,其中,初始领域种子词典由一个或多个领域种子词组成,种子词向量为初始领域种子词典中对应的领域种子词的向量表示。In the embodiment of the present invention, the word semantic similarity between each general word vector in the universal word vector space model and each seed word vector in a preset initial domain seed dictionary is calculated, and each word in the domain word vector space model is calculated. The word semantic similarity between each field word vector and each seed word vector in the initial field seed dictionary. The initial field seed dictionary is composed of one or more field seed words, and the seed word vector is the corresponding field seed in the initial field seed dictionary. Vector representation of the word.
在本发明实施例中,在计算通用词向量空间模型和领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度之前,优选地,将待创建的领域词典所属的领域划分成数量个不同的类别,根据每个类别创建一个领域种子词,由类别对应的领域种子词构成初始领域种子词典,从而为通用词向量和领域词向量的词语语义相似度计算提供对照样本。In the embodiment of the present invention, before calculating the word semantic similarity between the corresponding general word vector space model and the domain word vector space model in the universal word vector space model and the domain word vector space model and the seed word vector in the preset initial domain seed dictionary, it is preferred First, the domain to which the domain dictionary to be created belongs is divided into a number of different categories, and a domain seed word is created according to each category. The initial domain seed dictionary is formed by the domain seed words corresponding to the category, so that the general word vector and the domain word The word semantic similarity calculation of vectors provides a comparison sample.
作为示例地,若待创建医疗领域词典,则通过选取的母婴领域问答语料并结合医疗疾病分类情况,将问答语料划分成五个不同的类别,再利用每个类别的标签,创建一个只包含这些类别的关键字/词的初始医疗领域种子词典。As an example, if a dictionary in the medical field is to be created, the question and answer corpus is divided into five different categories based on the selected question and answer corpus in the maternal and infant field and combined with medical disease classification. These categories of keywords / words are the initial medical field seed dictionaries.
在本发明实施例中,优选地,通过预设的向量余弦相似度公式计算通用词向量和领域词向量与种子词向量的词语语义相似度,向量余弦相似度公式为
Figure PCTCN2019075956-appb-000004
其中,V 1为通用词向量或者领域词向量,V 2为种子词向量,S(V 1,V 2)为词语语义相似度,从而提高词语语义相似度计算的精确度和准确性。
In the embodiment of the present invention, preferably, the word semantic similarity between the general word vector and the field word vector and the seed word vector is calculated by a preset vector cosine similarity formula, and the vector cosine similarity formula is
Figure PCTCN2019075956-appb-000004
Among them, V 1 is a general word vector or a field word vector, V 2 is a seed word vector, and S (V 1 , V 2 ) is a word semantic similarity, thereby improving the accuracy and accuracy of the word semantic similarity calculation.
词典扩展单元33,用于根据计算得到的词语语义相似度,选取对应的通用词向量或者领域词向量对初始领域种子词典进行扩展,得到对应的领域词典。The dictionary expansion unit 33 is configured to select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated word semantic similarity to obtain a corresponding domain dictionary.
在本发明实施例中,根据计算得到的词语语义相似度,在通用词向量空间模型或者领域词向量空间模型中选取与种子词向量语义相近或者相同的通用词向量或者领域词向量,并将选取出的通用词向量或者领域词向量转换为对应的通用词汇或者领域词汇,再将该通用词汇或者领域词汇添加到初始领域种子词典中,以对初始领域种子词典进行扩展,根据扩展的初始领域种子词典,得到对应的领域词典。In the embodiment of the present invention, according to the calculated word semantic similarity, a general word vector or a field word vector that is similar to or the same as the seed word vector in the general word vector space model or the field word vector space model is selected, and the selected The generated general word vector or field word vector is converted into a corresponding general word or field word, and then the general word or field word is added to the initial field seed dictionary to expand the initial field seed dictionary. According to the extended initial field seed, Dictionary to get the corresponding domain dictionary.
在本发明实施例中,优选地,当计算得到的词语语义相似度大于预设的领域关键词阈值时,将该词语语义相似度对应的通用词向量或者领域词向量添加到初始领域种子词典中,以对初始领域种子词典进行扩展,从而提高领域词汇的准确性。In the embodiment of the present invention, preferably, when the calculated semantic similarity of a word is greater than a preset domain keyword threshold, a general word vector or a domain word vector corresponding to the semantic similarity of the word is added to the initial domain seed dictionary. To expand the initial domain seed dictionary to improve the accuracy of the domain vocabulary.
未成词筛除单元34,用于通过新词发现算法对领域词典中的未成词词汇进 行筛除,以完成领域词典的构建。The unformed word filtering unit 34 is configured to filter the unformed words in the domain dictionary through a new word discovery algorithm to complete the construction of the domain dictionary.
在本发明实施例中,在通过新词发现算法对领域词典中的未成词词汇进行筛除时,优选地,首先对领域词典中的词汇进行预处理,过滤掉领域词典中的数字、英文字母、标点符号、英文单词、人名、停用词、以及禁用词等非领域词汇,然后,计算预处理后的领域词典中两个相邻词汇对应的词向量的互信息值,生成候选新词集,之后,再利用左右邻接熵对候选新词集进行过滤,得到新词集和被过滤掉的未成词词汇集合,最后,从预处理后的领域词典中将未成词词汇集合部分筛除掉,以完成领域词典的构建,从而提高领域词典的准确性。In the embodiment of the present invention, when the unformed words in the domain dictionary are filtered by the new word discovery algorithm, preferably, the words in the domain dictionary are pre-processed first, and the numbers and English letters in the domain dictionary are filtered out. , Punctuation, English words, personal names, stop words, and stop words, and other non-domain words, and then calculate the mutual information values of the word vectors corresponding to two adjacent words in the pre-processed domain dictionary to generate candidate new word sets Then, the left and right adjacent entropy are used to filter the candidate new word set to obtain the new word set and the filtered unformed vocabulary set. Finally, the unformed vocabulary set is partially filtered out from the pre-processed domain dictionary. In order to complete the construction of the domain dictionary, thereby improving the accuracy of the domain dictionary.
在本发明实施例中,领域词典的构建装置的各单元可由相应的硬件或软件单元实现,各单元可以为独立的软、硬件单元,也可以集成为一个软、硬件单元,在此不用以限制本发明。In the embodiment of the present invention, each unit of the device for constructing the domain dictionary may be implemented by corresponding hardware or software units. Each unit may be an independent software and hardware unit, or may be integrated into one software and hardware unit. this invention.
实施例四:Embodiment 4:
图4示出了本发明实施例四提供的领域词典的构建装置的结构,为了便于说明,仅示出了与本发明实施例相关的部分,其中包括:FIG. 4 shows the structure of a device for constructing a domain dictionary provided in Embodiment 4 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, including:
模型训练单元41,用于对选取的通用语料库和领域语料库分别进行词向量模型训练,获得对应的通用词向量空间模型和领域词向量空间模型;A model training unit 41 is configured to perform word vector model training on the selected general corpus and domain corpus, respectively, to obtain corresponding general word vector space models and domain word vector space models;
相似度计算单元42,用于计算通用词向量空间模型和领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度;A similarity calculation unit 42 for calculating a word semantic similarity between a corresponding general word vector and a field word vector in the universal word vector space model and the domain word vector space model and a seed word vector in a preset initial domain seed dictionary;
词典扩展单元43,用于根据计算得到的词语语义相似度,选取对应的通用词向量或者领域词向量对初始领域种子词典进行扩展,得到对应的领域词典;A dictionary expansion unit 43 is configured to select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated word semantic similarity to obtain a corresponding domain dictionary;
迭代次数判断单元44,用于判断当前迭代次数是否达到预设的交叉迭代次数,是则,触发未成词筛除单元45执行通过新词发现算法对领域词典中的未成词词汇进行筛除,否则,将当前迭代次数增加1次,且将领域词典设置为初始领域种子词典,并触发相似度计算单元42执行计算通用词向量空间模型和领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典 中种子词向量的词语语义相似度;以及Iteration number judging unit 44 is configured to judge whether the current number of iterations reaches a preset number of cross-iterations. If yes, the unformed word screening unit 45 is triggered to perform the filtering of unformed words in the domain dictionary through a new word discovery algorithm, otherwise , Increase the current number of iterations by 1, and set the domain dictionary as the initial domain seed dictionary, and trigger the similarity calculation unit 42 to perform calculation of the corresponding general word vector and domain word vector in the universal word vector space model and the domain word vector space model Word semantic similarity to a seed word vector in a preset initial domain seed dictionary; and
未成词筛除单元45,用于通过新词发现算法对领域词典中的未成词词汇进行筛除,以完成领域词典的构建。The unformed word filtering unit 45 is configured to filter the unformed words in the domain dictionary through a new word discovery algorithm to complete the construction of the domain dictionary.
优选地,相似度计算单元42包括:Preferably, the similarity calculation unit 42 includes:
相似度计算子单元421,用于通过预设的向量余弦相似度公式计算通用词向量和领域词向量与种子词向量的词语语义相似度,向量余弦相似度公式为
Figure PCTCN2019075956-appb-000005
其中,V 1为通用词向量或者领域词向量,V 2为种子词向量,S(V 1,V 2)为词语语义相似度。
The similarity calculation subunit 421 is configured to calculate a word semantic similarity between the general word vector and the field word vector and the seed word vector through a preset vector cosine similarity formula. The vector cosine similarity formula is
Figure PCTCN2019075956-appb-000005
Among them, V 1 is a general word vector or a field word vector, V 2 is a seed word vector, and S (V 1 , V 2 ) is a word semantic similarity.
优选地,词典扩展单元43包括:Preferably, the dictionary expansion unit 43 includes:
词典扩展子单元431,用于当计算得到的词语语义相似度大于预设的领域关键词阈值时,将词语语义相似度对应的通用词向量或者领域词向量添加到初始领域种子词典中,以对初始领域种子词典进行扩展。The dictionary expansion subunit 431 is configured to add a general word vector or a domain word vector corresponding to the semantic similarity of words to the initial domain seed dictionary when the calculated semantic similarity of the words is greater than a preset threshold of the domain keywords. The initial domain seed dictionary is expanded.
在本发明实施例中,领域词典的构建装置的各单元可由相应的硬件或软件单元实现,各单元可以为独立的软、硬件单元,也可以集成为一个软、硬件单元,在此不用以限制本发明。各单元的具体实施方式可参考上述方法实施例的描述,在此不再赘述。In the embodiment of the present invention, each unit of the device for constructing the domain dictionary may be implemented by corresponding hardware or software units. Each unit may be an independent software and hardware unit, or may be integrated into one software and hardware unit. this invention. For specific implementation of each unit, reference may be made to the description of the foregoing method embodiments, and details are not described herein again.
实施例五:Embodiment 5:
图5示出了本发明实施例五提供的计算设备的结构,为了便于说明,仅示出了与本发明实施例相关的部分。FIG. 5 shows the structure of a computing device provided in Embodiment 5 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown.
本发明实施例的计算设备5包括处理器50、存储器51以及存储在存储器51中并可在处理器50上运行的计算机程序52。该处理器50执行计算机程序52时实现上述领域词典的构建方法实施例中的步骤,例如图1所示的步骤S101至S104。或者,处理器50执行计算机程序52时实现上述各装置实施例中各单元的功能,例如图3所示单元31至34的功能。The computing device 5 according to the embodiment of the present invention includes a processor 50, a memory 51, and a computer program 52 stored in the memory 51 and executable on the processor 50. When the processor 50 executes the computer program 52, the steps in the embodiment of the method for constructing a domain dictionary are implemented, for example, steps S101 to S104 shown in FIG. Alternatively, when the processor 50 executes the computer program 52, the functions of the units in the foregoing device embodiments are implemented, for example, the functions of the units 31 to 34 shown in FIG. 3.
在本发明实施例中,对选取的通用语料库和领域语料库分别进行词向量模型训练,获得对应的通用词向量空间模型和领域词向量空间模型,计算通用词 向量空间模型和领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度,根据计算得到的词语语义相似度,选取对应的通用词向量或者领域词向量对初始领域种子词典进行扩展,得到对应的领域词典,通过新词发现算法对领域词典中的未成词词汇进行筛除,以完成领域词典的构建,从而扩大了领域词典的词汇量,且提高了领域词典中领域词汇的准确度,进而提高领域词典的准确率。In the embodiment of the present invention, word vector model training is performed on the selected general corpus and domain corpus to obtain corresponding general word vector space model and domain word vector space model, and the general word vector space model and the domain word vector space model are calculated. The semantic similarity between the corresponding universal word vector and domain word vector and the seed word vector in the preset initial domain seed dictionary. Based on the calculated semantic similarity of the word, select the corresponding universal word vector or domain word vector to the initial domain seed. The dictionary is expanded to obtain the corresponding domain dictionary, and the unformed words in the domain dictionary are filtered by the new word discovery algorithm to complete the construction of the domain dictionary, thereby expanding the vocabulary of the domain dictionary and improving the domain in the domain dictionary. Vocabulary accuracy, which in turn improves the accuracy of the domain dictionary.
本发明实施例的计算设备可以为个人计算机、服务器。该计算设备5中处理器50执行计算机程序52时实现领域词典的构建方法时实现的步骤可参考前述方法实施例的描述,在此不再赘述。The computing device in the embodiment of the present invention may be a personal computer or a server. For steps implemented when the processor 50 in the computing device 5 executes the computer program 52 to implement the method of constructing the domain dictionary, reference may be made to the description of the foregoing method embodiments, and details are not described herein again.
实施例六:Embodiment 6:
在本发明实施例中,提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述领域词典的构建方法实施例中的步骤,例如,图1所示的步骤S101至S104。或者,该计算机程序被处理器执行时实现上述各装置实施例中各单元的功能,例如图3所示单元31至34的功能。In the embodiment of the present invention, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the embodiment of a method for constructing a dictionary in the foregoing field are implemented, for example, Steps S101 to S104 shown in FIG. 1. Alternatively, when the computer program is executed by a processor, the functions of each unit in the foregoing device embodiments are implemented, for example, the functions of units 31 to 34 shown in FIG. 3.
在本发明实施例中,对选取的通用语料库和领域语料库分别进行词向量模型训练,获得对应的通用词向量空间模型和领域词向量空间模型,计算通用词向量空间模型和领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度,根据计算得到的词语语义相似度,选取对应的通用词向量或者领域词向量对初始领域种子词典进行扩展,得到对应的领域词典,通过新词发现算法对领域词典中的未成词词汇进行筛除,以完成领域词典的构建,从而扩大了领域词典的词汇量,且提高了领域词典中领域词汇的准确度,进而提高领域词典的准确率。In the embodiment of the present invention, word vector model training is performed on the selected general corpus and domain corpus to obtain corresponding general word vector space model and domain word vector space model, and the general word vector space model and the domain word vector space model are calculated. The semantic similarity between the corresponding universal word vector and domain word vector and the seed word vector in the preset initial domain seed dictionary. Based on the calculated semantic similarity of the word, select the corresponding universal word vector or domain word vector to the initial domain seed. The dictionary is expanded to obtain the corresponding domain dictionary, and the unformed words in the domain dictionary are filtered by the new word discovery algorithm to complete the construction of the domain dictionary, thereby expanding the vocabulary of the domain dictionary and improving the domain in the domain dictionary. Vocabulary accuracy, which in turn improves the accuracy of the domain dictionary.
本发明实施例的计算机可读存储介质可以包括能够携带计算机程序代码的任何实体或装置、记录介质,例如,ROM/RAM、磁盘、光盘、闪存等存储器。The computer-readable storage medium of the embodiment of the present invention may include any entity or device capable of carrying computer program code, a recording medium, for example, a memory such as a ROM / RAM, a magnetic disk, an optical disk, a flash memory, or the like.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发 明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above description is only the preferred embodiments of the present invention, and is not intended to limit the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims (10)

  1. 一种领域词典的构建方法,其特征在于,所述方法包括下述步骤:A method for constructing a domain dictionary, wherein the method includes the following steps:
    对选取的通用语料库和领域语料库分别进行词向量模型训练,获得对应的通用词向量空间模型和领域词向量空间模型;Train the selected general corpus and domain corpus separately on the word vector model to obtain the corresponding general word vector space model and domain word vector space model;
    计算所述通用词向量空间模型和所述领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度;Calculating the word semantic similarity between the corresponding universal word vector and the domain word vector in the universal word vector space model and the domain word vector space model and a seed word vector in a preset initial domain seed dictionary;
    根据计算得到的所述词语语义相似度,选取对应的通用词向量或者领域词向量对所述初始领域种子词典进行扩展,得到对应的领域词典;Selecting the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated semantic similarity of the words to obtain a corresponding domain dictionary;
    通过新词发现算法对所述领域词典中的未成词词汇进行筛除,以完成所述领域词典的构建。The unformed words in the domain dictionary are filtered by a new word discovery algorithm to complete the construction of the domain dictionary.
  2. 如权利要求1所述的方法,其特征在于,计算所述通用词向量空间模型和所述领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度的步骤,包括:The method according to claim 1, wherein the corresponding general word vector space model and the field word vector space model in the universal word vector space model and the field word vector space model are calculated with the seed words in a preset initial domain seed dictionary. The steps of vector word semantic similarity include:
    通过预设的向量余弦相似度公式计算所述通用词向量和所述领域词向量与所述种子词向量的词语语义相似度,所述向量余弦相似度公式为
    Figure PCTCN2019075956-appb-100001
    其中,V 1为所述通用词向量或者所述领域词向量,V 2为所述种子词向量,S(V 1,V 2)为所述词语语义相似度。
    Calculate the word semantic similarity between the general word vector and the field word vector and the seed word vector through a preset vector cosine similarity formula, where the vector cosine similarity formula is
    Figure PCTCN2019075956-appb-100001
    Wherein, V 1 is the general word vector or the domain word vector, V 2 is the seed word vector, and S (V 1 , V 2 ) is the semantic similarity of the word.
  3. 如权利要求1所述的方法,其特征在于,选取对应的通用词向量或者领域词向量对所述初始领域种子词典进行扩展的步骤,包括:The method of claim 1, wherein the step of selecting a corresponding general word vector or domain word vector to expand the initial domain seed dictionary comprises:
    当计算得到的所述词语语义相似度大于预设的领域关键词阈值时,将所述词语语义相似度对应的通用词向量或者领域词向量添加到所述初始领域种子词典中,以对所述初始领域种子词典进行扩展。When the calculated semantic similarity of the words is greater than a preset domain keyword threshold, a general word vector or a domain word vector corresponding to the semantic similarity of the words is added to the initial domain seed dictionary, so that The initial domain seed dictionary is expanded.
  4. 如权利要求1所述的方法,其特征在于,通过新词发现算法对所述领域词典中的未成词词汇进行筛除的步骤之前,所述方法还包括:The method according to claim 1, wherein before the step of filtering unformed words in the domain dictionary by a new word discovery algorithm, the method further comprises:
    判断当前迭代次数是否达到预设的交叉迭代次数;Determine whether the current number of iterations reaches a preset number of cross iterations;
    是则,跳转到通过新词发现算法对所述领域词典中的未成词词汇进行筛除的步骤;If yes, jump to the step of filtering unformed words in the domain dictionary by using a new word discovery algorithm;
    否则,将所述当前迭代次数增加1次,且将所述领域词典设置为所述初始领域种子词典,并跳转到计算所述通用词向量空间模型和所述领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度的步骤。Otherwise, increase the current number of iterations by 1, and set the domain dictionary as the initial domain seed dictionary, and jump to computing the corresponding ones in the universal word vector space model and the domain word vector space model. Steps of semantic semantic similarity between the general word vector and the domain word vector and a seed word vector in a preset initial domain seed dictionary.
  5. 一种领域词典的构建装置,其特征在于,所述装置包括:A device for constructing a domain dictionary, wherein the device includes:
    模型训练单元,用于对选取的通用语料库和领域语料库分别进行词向量模型训练,获得对应的通用词向量空间模型和领域词向量空间模型;A model training unit, configured to train word vectors on the selected general corpus and domain corpus respectively to obtain corresponding general word vector space models and domain word vector space models;
    相似度计算单元,用于计算所述通用词向量空间模型和所述领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度;A similarity calculation unit, configured to calculate a word semantic similarity between the corresponding general word vector space model and the field word vector space model in the universal word vector space model and the field word vector and a seed word vector in a preset initial domain seed dictionary. ;
    词典扩展单元,用于根据计算得到的所述词语语义相似度,选取对应的通用词向量或者领域词向量对所述初始领域种子词典进行扩展,得到对应的领域词典;以及A dictionary expansion unit, configured to select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated semantic similarity of the words to obtain a corresponding domain dictionary; and
    未成词筛除单元,用于通过新词发现算法对所述领域词典中的未成词词汇进行筛除,以完成所述领域词典的构建。The unformed word filtering unit is used for filtering unformed words in the domain dictionary through a new word discovery algorithm to complete the construction of the domain dictionary.
  6. 如权利要求5所述的装置,其特征在于,所述相似度计算单元包括:The apparatus according to claim 5, wherein the similarity calculation unit comprises:
    相似度计算子单元,用于通过预设的向量余弦相似度公式计算所述通用词向量和所述领域词向量与所述种子词向量的词语语义相似度,所述向量余弦相似度公式为
    Figure PCTCN2019075956-appb-100002
    其中,V 1为所述通用词向量或者所述领域词向量,V 2为所述种子词向量,S(V 1,V 2)为所述词语语义相似度。
    The similarity calculation subunit is configured to calculate a word semantic similarity between the general word vector and the field word vector and the seed word vector by using a preset vector cosine similarity formula. The vector cosine similarity formula is
    Figure PCTCN2019075956-appb-100002
    Wherein, V 1 is the general word vector or the domain word vector, V 2 is the seed word vector, and S (V 1 , V 2 ) is the semantic similarity of the word.
  7. 如权利要求5所述的装置,其特征在于,所述词典扩展单元包括:The apparatus according to claim 5, wherein the dictionary expansion unit comprises:
    词典扩展子单元,用于当计算得到的所述词语语义相似度大于预设的领域关键词阈值时,将所述词语语义相似度对应的通用词向量或者领域词向量添加 到所述初始领域种子词典中,以对所述初始领域种子词典进行扩展。A dictionary expansion subunit, configured to add a general word vector or a domain word vector corresponding to the semantic similarity of the words to the initial domain seed when the calculated semantic similarity of the words is greater than a preset domain keyword threshold Dictionary to expand the initial domain seed dictionary.
  8. 如权利要求5所述的装置,其特征在于,所述装置还包括:The apparatus according to claim 5, further comprising:
    迭代次数判断单元,用于判断当前迭代次数是否达到预设的交叉迭代次数,是则,触发所述未成词筛除单元执行通过新词发现算法对所述领域词典中的未成词词汇进行筛除,否则,将所述当前迭代次数增加1次,且将所述领域词典设置为所述初始领域种子词典,并触发所述相似度计算单元执行计算所述通用词向量空间模型和所述领域词向量空间模型中对应的通用词向量和领域词向量与预设的初始领域种子词典中种子词向量的词语语义相似度。Iteration number judging unit, for judging whether the current number of iterations reaches a preset number of cross iterations, and then, triggering the unformed word screening unit to execute a new word discovery algorithm to filter out unformed words in the domain dictionary , Otherwise, increase the current number of iterations by 1, and set the domain dictionary as the initial domain seed dictionary, and trigger the similarity calculation unit to perform calculation of the universal word vector space model and the domain word The semantic similarity between the corresponding general word vector and domain word vector in the vector space model and the seed word vector in the preset initial domain seed dictionary.
  9. 一种计算设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至4任一项所述方法的步骤。A computing device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the processor implements claims 1 to Steps of the method of any one of 4.
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至4任一项所述方法的步骤。A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 4 are implemented.
PCT/CN2019/075956 2018-09-27 2019-02-22 Method and apparatus for constructing domain dictionary, and device and storage medium WO2020062770A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811133186.2 2018-09-27
CN201811133186.2A CN109284397A (en) 2018-09-27 2018-09-27 A kind of construction method of domain lexicon, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2020062770A1 true WO2020062770A1 (en) 2020-04-02

Family

ID=65181584

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/075956 WO2020062770A1 (en) 2018-09-27 2019-02-22 Method and apparatus for constructing domain dictionary, and device and storage medium

Country Status (2)

Country Link
CN (1) CN109284397A (en)
WO (1) WO2020062770A1 (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284397A (en) * 2018-09-27 2019-01-29 深圳大学 A kind of construction method of domain lexicon, device, equipment and storage medium
CN110134943B (en) * 2019-04-03 2023-04-18 平安科技(深圳)有限公司 Domain ontology generation method, device, equipment and medium
CN110188204B (en) * 2019-06-11 2022-10-04 腾讯科技(深圳)有限公司 Extended corpus mining method and device, server and storage medium
CN110738042B (en) * 2019-09-12 2024-01-05 腾讯音乐娱乐科技(深圳)有限公司 Error correction dictionary creation method, device, terminal and computer storage medium
CN110704391A (en) * 2019-09-23 2020-01-17 车智互联(北京)科技有限公司 Word stock construction method and computing device
CN110704638A (en) * 2019-09-30 2020-01-17 南京邮电大学 Clustering algorithm-based electric power text dictionary construction method
CN111506699A (en) * 2020-03-20 2020-08-07 北京邮电大学 Method and device for discovering secret words
CN111583915B (en) * 2020-04-07 2023-08-25 苏宁云计算有限公司 Optimization method, optimization device, optimization computer device and optimization storage medium for n-gram language model
CN111506716B (en) * 2020-04-15 2023-04-25 腾讯科技(深圳)有限公司 Question-answer data processing method and device and computer readable storage medium
CN111581952B (en) * 2020-05-20 2023-10-03 长沙理工大学 Large-scale replaceable word library construction method for natural language information hiding
CN111859093A (en) * 2020-07-30 2020-10-30 中国联合网络通信集团有限公司 Sensitive word processing method and device and readable storage medium
CN111814473B (en) * 2020-09-11 2020-12-22 平安国际智慧城市科技股份有限公司 Word vector increment method and device for specific field and storage medium
CN112185359B (en) * 2020-09-28 2023-08-29 广州秉理科技有限公司 Word coverage rate-based voice training set minimization method
CN112530591B (en) * 2020-12-10 2022-11-29 厦门越人健康技术研发有限公司 Method for generating auscultation test vocabulary and storage equipment
CN112687403B (en) * 2021-01-08 2022-12-02 拉扎斯网络科技(上海)有限公司 Medicine dictionary generation and medicine search method and device
CN115270774B (en) * 2022-09-27 2023-01-03 吉奥时空信息技术股份有限公司 Big data keyword dictionary construction method for semi-supervised learning
CN115905575A (en) * 2023-01-09 2023-04-04 海乂知信息科技(南京)有限公司 Semantic knowledge graph construction method, electronic equipment and storage medium
CN116108834A (en) * 2023-04-10 2023-05-12 中国民用航空飞行学院 Interactive user dictionary construction method, device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
WO2017217661A1 (en) * 2016-06-15 2017-12-21 울산대학교 산학협력단 Word sense embedding apparatus and method using lexical semantic network, and homograph discrimination apparatus and method using lexical semantic network and word embedding
CN108334495A (en) * 2018-01-30 2018-07-27 国家计算机网络与信息安全管理中心 Short text similarity calculating method and system
CN108491462A (en) * 2018-03-05 2018-09-04 昆明理工大学 A kind of semantic query expansion method and device based on word2vec
CN109284397A (en) * 2018-09-27 2019-01-29 深圳大学 A kind of construction method of domain lexicon, device, equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100755677B1 (en) * 2005-11-02 2007-09-05 삼성전자주식회사 Apparatus and method for dialogue speech recognition using topic detection
CN101901249A (en) * 2009-05-26 2010-12-01 复旦大学 Text-based query expansion and sort method in image retrieval
CN106445906A (en) * 2015-08-06 2017-02-22 北京国双科技有限公司 Generation method and apparatus for medium-and-long phrase in domain lexicon
CN108563635A (en) * 2018-04-04 2018-09-21 北京理工大学 A kind of sentiment dictionary fast construction method based on emotion wheel model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
WO2017217661A1 (en) * 2016-06-15 2017-12-21 울산대학교 산학협력단 Word sense embedding apparatus and method using lexical semantic network, and homograph discrimination apparatus and method using lexical semantic network and word embedding
CN108334495A (en) * 2018-01-30 2018-07-27 国家计算机网络与信息安全管理中心 Short text similarity calculating method and system
CN108491462A (en) * 2018-03-05 2018-09-04 昆明理工大学 A kind of semantic query expansion method and device based on word2vec
CN109284397A (en) * 2018-09-27 2019-01-29 深圳大学 A kind of construction method of domain lexicon, device, equipment and storage medium

Also Published As

Publication number Publication date
CN109284397A (en) 2019-01-29

Similar Documents

Publication Publication Date Title
WO2020062770A1 (en) Method and apparatus for constructing domain dictionary, and device and storage medium
CN104615767B (en) Training method, search processing method and the device of searching order model
CN106844368B (en) Method for man-machine conversation, neural network system and user equipment
CN101079026B (en) Text similarity, acceptation similarity calculating method and system and application system
Chen et al. Joint learning of character and word embeddings
Sun et al. Sentiment analysis for Chinese microblog based on deep neural networks with convolutional extension features
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN104834747B (en) Short text classification method based on convolutional neural networks
CN110059160B (en) End-to-end context-based knowledge base question-answering method and device
CN105095204B (en) The acquisition methods and device of synonym
JP6284643B2 (en) Disambiguation method of features in unstructured text
TW202009749A (en) Human-machine dialog method, device, electronic apparatus and computer readable medium
CN106202010A (en) The method and apparatus building Law Text syntax tree based on deep neural network
CN109670050B (en) Entity relationship prediction method and device
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
CN110580340A (en) neural network relation extraction method based on multi-attention machine system
KR102059743B1 (en) Method and system for providing biomedical passage retrieval using deep-learning based knowledge structure construction
CN111291177A (en) Information processing method and device and computer storage medium
CN111400584A (en) Association word recommendation method and device, computer equipment and storage medium
JP6291443B2 (en) Connection relationship estimation apparatus, method, and program
JP2019082931A (en) Retrieval device, similarity calculation method, and program
Zhang et al. Continuous word embeddings for detecting local text reuses at the semantic level
CN112417155B (en) Court trial query generation method, device and medium based on pointer-generation Seq2Seq model
CN111881256A (en) Text entity relation extraction method and device and computer readable storage medium equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19867000

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 08/07/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19867000

Country of ref document: EP

Kind code of ref document: A1