WO2020062770A1

WO2020062770A1 - Method and apparatus for constructing domain dictionary, and device and storage medium

Info

Publication number: WO2020062770A1
Application number: PCT/CN2019/075956
Authority: WO
Inventors: 李坚强; 颜果开; 傅向华; 李赛玲
Original assignee: 深圳大学
Priority date: 2018-09-27
Filing date: 2019-02-22
Publication date: 2020-04-02
Also published as: CN109284397A

Abstract

The invention is applicable to the technical field of natural language processing. Provided are a method and apparatus for constructing a domain dictionary, and a device and a storage medium. The method comprises: training a word vector model for a selected general corpus and domain corpus, respectively, and obtaining a corresponding general word vector space model and domain word vector space model; calculating a word semantic similarity between a corresponding general word vector and domain word vector in the general word vector space model and domain word vector space model and a seed word vector in an initial domain seed dictionary; selecting, according to the calculated word semantic similarity, the corresponding general word vector or domain word vector to expand the initial domain seed dictionary, so as to obtain a corresponding domain dictionary; and filtering out unformed words in the domain dictionary by means of a new word discovery algorithm so as to complete the construction of the domain dictionary. Thus, the quantity of vocabulary of the domain dictionary is expanded, and the accuracy of the domain vocabulary in the domain dictionary is improved, thereby improving the accuracy of the domain dictionary.

Description

Method, device, equipment and storage medium for constructing domain dictionary

Technical field

The invention belongs to the technical field of natural language processing, and particularly relates to a method, a device, a device, and a storage medium for constructing a dictionary in the field.

Background technique

With the continuous progress of science and technology and society, the language is constantly changing, especially in recent years, new theories, new concepts, new materials, new technologies, new processes have continuously emerged, and new domain vocabularies generated in parallel have emerged endlessly. . The domain vocabulary reflects and loads the core knowledge of a subject area. The change of vocabulary reflects the development and change of a subject area to a certain extent. The domain vocabulary has important theories for understanding and grasping the development status and future trends of a subject area. And practical significance, with the continuous expansion of the field of natural language processing, the demand for domain lexicons is becoming more and more urgent.

Existing word dictionary-based domain dictionary construction algorithms are a single general-purpose corpus or domain corpus on the Internet, and a general-purpose word vector model or domain-word vector model constructed by segmenting the corpus directly through the Chinese word segmentation tool, and then calculating the general The semantic similarity between words in the word vector model or the domain word vector model to construct a domain dictionary. However, the above-mentioned general word vector model does not take into account the dependence of the domain corpus on the domain corpus in the restricted domain, and the domain word vector model does not take into account the problem of insufficient corpus in the restricted domain. The algorithm does not take into account problems such as the inability of the Chinese word segmentation tool to correctly segment words in the domain vocabulary or unknown words in the restricted domain, resulting in insufficient domain dictionary space and inaccurate domain vocabulary.

Summary of the Invention

The purpose of the present invention is to provide a method, device, equipment and storage medium for constructing a domain dictionary, which aims to solve the problem that the existing dictionary cannot provide an effective method for constructing a domain dictionary, resulting in insufficient domain vocabulary in the domain dictionary, and the domain Vocabulary inaccuracies.

In one aspect, the present invention provides a method for constructing a domain dictionary. The method includes the following steps:

Train the selected general corpus and domain corpus separately on the word vector model to obtain the corresponding general word vector space model and domain word vector space model;

Calculating the word semantic similarity between the corresponding universal word vector and the domain word vector in the universal word vector space model and the domain word vector space model and a seed word vector in a preset initial domain seed dictionary;

Selecting the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated semantic similarity of the words to obtain a corresponding domain dictionary;

The unformed words in the domain dictionary are filtered by a new word discovery algorithm to complete the construction of the domain dictionary.

Preferably, the step of calculating a word semantic similarity between the corresponding general word vector space model and the field word vector space model and the corresponding general word vector and the field word vector with a seed word vector in a preset initial domain seed dictionary includes: :

Calculate the word semantic similarity between the general word vector and the field word vector and the seed word vector through a preset vector cosine similarity formula, where the vector cosine similarity formula is

Wherein, V ₁ is the general word vector or the domain word vector, V ₂ is the seed word vector, and S (V ₁ , V ₂ ) is the semantic similarity of the word.

Preferably, the step of selecting a corresponding universal word vector or domain word vector to expand the initial domain seed dictionary includes:

When the calculated semantic similarity of the words is greater than a preset domain keyword threshold, a general word vector or a domain word vector corresponding to the semantic similarity of the words is added to the initial domain seed dictionary, so that The initial domain seed dictionary is expanded.

Preferably, before the step of filtering unformed words in the domain dictionary by a new word discovery algorithm, the method further includes:

Determine whether the current number of iterations reaches a preset number of cross iterations;

If yes, jump to the step of filtering unformed words in the domain dictionary by using a new word discovery algorithm;

Otherwise, increase the current number of iterations by 1, and set the domain dictionary as the initial domain seed dictionary, and jump to computing the corresponding ones in the universal word vector space model and the domain word vector space model. Steps of semantic semantic similarity between the general word vector and the domain word vector and a seed word vector in a preset initial domain seed dictionary.

In another aspect, the present invention provides a device for constructing a domain dictionary. The device includes:

A model training unit, configured to train word vectors on the selected general corpus and domain corpus respectively to obtain corresponding general word vector space models and domain word vector space models;

A similarity calculation unit, configured to calculate a word semantic similarity between the corresponding general word vector space model and the field word vector space model in the universal word vector space model and the field word vector and a seed word vector in a preset initial domain seed dictionary. ;

A dictionary expansion unit, configured to select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated semantic similarity of the words to obtain a corresponding domain dictionary; and

The unformed word filtering unit is used for filtering unformed words in the domain dictionary through a new word discovery algorithm to complete the construction of the domain dictionary.

Preferably, the similarity calculation unit includes:

The similarity calculation subunit is configured to calculate a word semantic similarity between the general word vector and the field word vector and the seed word vector by using a preset vector cosine similarity formula. The vector cosine similarity formula is

Preferably, the dictionary expansion unit includes:

A dictionary expansion subunit, configured to add a general word vector or a domain word vector corresponding to the semantic similarity of the words to the initial domain seed when the calculated semantic similarity of the words is greater than a preset domain keyword threshold Dictionary to expand the initial domain seed dictionary.

Preferably, the device further comprises:

Iteration number judging unit, for judging whether the current number of iterations reaches a preset number of cross iterations, and then, triggering the unformed word screening unit to execute a new word discovery algorithm to filter out unformed words in the domain dictionary , Otherwise, increase the current number of iterations by 1, and set the domain dictionary as the initial domain seed dictionary, and trigger the similarity calculation unit to perform calculation of the universal word vector space model and the domain word The semantic similarity between the corresponding general word vector and domain word vector in the vector space model and the seed word vector in the preset initial domain seed dictionary.

In another aspect, the present invention also provides a computing device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor is implemented when the processor executes the computer program. Steps as described in the above method of constructing a domain dictionary.

In another aspect, the present invention also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps described in the method for constructing a dictionary in the foregoing field are implemented. .

The invention performs word vector model training on the selected universal corpus and domain corpus, respectively, and obtains the corresponding universal word vector space model and the domain word vector space model, and calculates the corresponding universal word vector in the universal word vector space model and the domain word vector space model. The word semantic similarity between the field word vector and the seed word vector in the preset initial field seed dictionary. Based on the calculated word semantic similarity, select the corresponding general word vector or field word vector to expand the initial field seed dictionary. The corresponding domain dictionary uses the new word discovery algorithm to filter out unformed words in the domain dictionary to complete the construction of the domain dictionary, thereby expanding the vocabulary of the domain dictionary and improving the accuracy of the domain vocabulary in the domain dictionary. Then improve the accuracy of the domain dictionary.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an implementation flowchart of a method for constructing a domain dictionary provided by Embodiment 1 of the present invention; FIG.

FIG. 2 is an implementation flowchart of a method for constructing a domain dictionary provided by Embodiment 2 of the present invention; FIG.

3 is a schematic structural diagram of a device for constructing a domain dictionary according to a third embodiment of the present invention;

4 is a schematic structural diagram of a device for constructing a domain dictionary according to a fourth embodiment of the present invention; and

FIG. 5 is a schematic structural diagram of a computing device according to a fifth embodiment of the present invention.

detailed description

In order to make the objectives, technical solutions, and advantages of the present invention clearer, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention.

The following describes the specific implementation of the present invention in detail with reference to specific embodiments:

Embodiment one:

FIG. 1 shows an implementation flow of a method for constructing a domain dictionary provided in Embodiment 1 of the present invention. For convenience of explanation, only the parts related to the embodiment of the present invention are shown, and the details are as follows:

In step S101, word vector model training is performed on the selected general corpus and domain corpus to obtain corresponding general word vector space models and domain word vector space models.

The embodiments of the present invention are applicable to computing devices, such as personal computers, servers, and the like. The general corpus and the domain corpus selected in the embodiments of the present invention are relative rather than absolute. The general corpus is a layer of abstraction or superordinate concept relative to the domain corpus, and is not necessarily a large and complete set of corpora. For example, if you want to To build a medical dictionary, a large and comprehensive set of common corpora (for example, Wikipedia Chinese corpus) and medical corpus (for example, maternal and infant quiz) should be jointly completed; if only a set of Chinese medicine field dictionary Then, the corpus in the medical field should be regarded as a general corpus, and the dictionary in the field of Chinese medicine should be constructed in combination with the corpus in the field of Chinese medicine.

In the embodiment of the present invention, preferably, the selected general corpus and the domain corpus are trained on the word vector model through the Skip-Gram model, thereby reducing the complexity of the word vector model training and improving the accuracy of the word vector model training. The vocabulary corresponding to the obtained word vector can better reflect the real text meaning.

In step S102, the word semantic similarity between the corresponding universal word vector and domain word vector in the universal word vector space model and the domain word vector space model and the seed word vector in the preset initial domain seed dictionary is calculated.

In the embodiment of the present invention, the word semantic similarity between each general word vector in the universal word vector space model and each seed word vector in a preset initial domain seed dictionary is calculated, and each word in the domain word vector space model is calculated. The word semantic similarity between each field word vector and each seed word vector in the initial field seed dictionary. The initial field seed dictionary is composed of one or more field seed words, and the seed word vector is the corresponding field seed in the initial field seed dictionary. Vector representation of the word.

In the embodiment of the present invention, before calculating the word semantic similarity between the corresponding general word vector space model and the domain word vector space model in the universal word vector space model and the domain word vector space model and the seed word vector in the preset initial domain seed dictionary, it is preferred First, the domain to which the domain dictionary to be created belongs is divided into a number of different categories, and a domain seed word is created according to each category. The initial domain seed dictionary is formed by the domain seed words corresponding to the category, so that the general word vector and the domain word The word semantic similarity calculation of vectors provides a comparison sample.

As an example, if a dictionary in the medical field is to be created, the question and answer corpus is divided into five different categories based on the selected question and answer corpus in the maternal and infant field and combined with medical disease classification, and then the labels of each category are used to create a These categories of keywords / words are the initial medical field seed dictionaries.

In the embodiment of the present invention, preferably, the word semantic similarity between the general word vector and the field word vector and the seed word vector is calculated by a preset vector cosine similarity formula, and the vector cosine similarity formula is

Among them, V ₁ is a general word vector or a field word vector, V ₂ is a seed word vector, and S (V ₁ , V ₂ ) is a word semantic similarity, thereby improving the accuracy and accuracy of the word semantic similarity calculation.

In step S103, according to the calculated semantic similarity of the words, a corresponding general word vector or domain word vector is selected to expand the initial domain seed dictionary to obtain a corresponding domain dictionary.

In the embodiment of the present invention, according to the calculated word semantic similarity, a general word vector or a field word vector that is similar to or the same as the seed word vector in the general word vector space model or the field word vector space model is selected, and the selected The generated general word vector or field word vector is converted into a corresponding general word or field word, and then the general word or field word is added to the initial field seed dictionary to expand the initial field seed dictionary. According to the extended initial field seed, Dictionary to get the corresponding domain dictionary.

In the embodiment of the present invention, preferably, when the calculated semantic similarity of a word is greater than a preset domain keyword threshold, a general word vector or a domain word vector corresponding to the semantic similarity of the word is added to the initial domain seed dictionary. To expand the initial domain seed dictionary to improve the accuracy of the domain vocabulary.

In step S104, the unformed words in the domain dictionary are screened out by a new word discovery algorithm to complete the construction of the domain dictionary.

In the embodiment of the present invention, when the unformed words in the domain dictionary are filtered by the new word discovery algorithm, preferably, the words in the domain dictionary are pre-processed first, and the numbers and English letters in the domain dictionary are filtered out. , Punctuation, English words, personal names, stop words, and stop words, and other non-domain words, and then calculate the mutual information values of the word vectors corresponding to two adjacent words in the pre-processed domain dictionary to generate candidate new word sets Then, the left and right adjacent entropy are used to filter the candidate new word set to obtain the new word set and the filtered unformed vocabulary set. Finally, the unformed vocabulary set is partially filtered out from the pre-processed domain dictionary. In order to complete the construction of the domain dictionary, thereby improving the accuracy of the domain dictionary.

In the embodiment of the present invention, the word vector model training is performed on the general corpus and the domain corpus, respectively, to obtain the corresponding general word vector space model and the domain word vector space model, and calculate the corresponding ones in the general word vector space model and the domain word vector space model. The word semantic similarity between the general word vector and the domain word vector and the seed word vector in the initial domain seed dictionary. Based on the calculated word semantic similarity, select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary to obtain The corresponding domain dictionary uses the new word discovery algorithm to filter out unformed words in the domain dictionary to complete the construction of the domain dictionary, thereby expanding the vocabulary of the domain dictionary and improving the accuracy of the domain vocabulary in the domain dictionary. Then improve the accuracy of the domain dictionary.

Embodiment two:

FIG. 2 shows an implementation process of a method for constructing a domain dictionary provided in Embodiment 2 of the present invention. For convenience of explanation, only the parts related to the embodiment of the present invention are shown, and the details are as follows:

In step S201, word vector model training is performed on the selected general corpus and domain corpus to obtain corresponding general word vector space models and domain word vector space models.

In step S202, the word semantic similarity between the corresponding general word vector space model and the domain word vector space model and the seed word vector in the preset initial domain seed dictionary is calculated.

In step S203, according to the calculated semantic similarity of the words, the corresponding general word vector or domain word vector is selected to expand the initial domain seed dictionary to obtain a corresponding domain dictionary.

In the embodiment of the present invention, for specific implementations of steps S201 to S203, reference may be made to the description of steps S101 to S103 in Embodiment 1, and details are not described herein again.

In step S204, it is determined whether the current number of iterations reaches a preset number of cross-iterations. If yes, step S206 is performed; otherwise, step S205 is performed.

In step S205, the current number of iterations is increased by one, and the domain dictionary is set as the initial domain seed dictionary.

In the embodiment of the present invention, when the current number of iterations does not reach the preset number of cross iterations, the current number of iterations is increased by one, and the domain dictionary is set as the initial domain seed dictionary, so that the domain dictionary obtained by the current iteration is used as the next Input the domain seed word expansion once, and jump to step S202, and continue to perform the word semantic similarity calculation in the general word vector space model and the domain word vector space model to expand the initial domain seed dictionary.

In step S206, the unformed words in the domain dictionary are filtered out by a new word discovery algorithm to complete the construction of the domain dictionary.

In the embodiment of the present invention, for the specific implementation of step S206, reference may be made to the description of step S104 in Embodiment 1, and details are not described herein again.

In the embodiment of the present invention, word vector model training is performed on the selected general corpus and domain corpus to obtain a general word vector space model and a domain word vector space model. Multiple cross-iterations calculate the word semantic similarity of each seed word vector in the initial domain seed dictionary to expand the seed words of the initial domain seed dictionary, thereby improving the accuracy of the domain vocabulary in the obtained domain dictionary and expanding the domain The vocabulary in the dictionary, and then the new word discovery algorithm to filter out unformed words in the domain dictionary to complete the construction of the domain dictionary, thereby improving the accuracy of the domain dictionary.

Embodiment three:

FIG. 3 shows a structure of a device for constructing a domain dictionary provided in Embodiment 3 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, including:

A model training unit 31 is configured to perform word vector model training on the selected general corpus and domain corpus, respectively, to obtain corresponding general word vector space models and domain word vector space models.

In the embodiment of the present invention, preferably, the selected general corpus and the domain corpus are respectively trained with the word vector model through the Skip-Gram model, thereby reducing the complexity of the word vector model training and improving the accuracy of the word vector model training. The vocabulary corresponding to the obtained word vector can better reflect the real text meaning.

The similarity calculating unit 32 is configured to calculate the semantic semantic similarity between the corresponding general word vector and domain word vector in the universal word vector space model and the domain word vector space model and the seed word vector in the preset initial domain seed dictionary.

As an example, if a dictionary in the medical field is to be created, the question and answer corpus is divided into five different categories based on the selected question and answer corpus in the maternal and infant field and combined with medical disease classification. These categories of keywords / words are the initial medical field seed dictionaries.

The dictionary expansion unit 33 is configured to select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated word semantic similarity to obtain a corresponding domain dictionary.

The unformed word filtering unit 34 is configured to filter the unformed words in the domain dictionary through a new word discovery algorithm to complete the construction of the domain dictionary.

In the embodiment of the present invention, each unit of the device for constructing the domain dictionary may be implemented by corresponding hardware or software units. Each unit may be an independent software and hardware unit, or may be integrated into one software and hardware unit. this invention.

Embodiment 4:

FIG. 4 shows the structure of a device for constructing a domain dictionary provided in Embodiment 4 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown, including:

A model training unit 41 is configured to perform word vector model training on the selected general corpus and domain corpus, respectively, to obtain corresponding general word vector space models and domain word vector space models;

A similarity calculation unit 42 for calculating a word semantic similarity between a corresponding general word vector and a field word vector in the universal word vector space model and the domain word vector space model and a seed word vector in a preset initial domain seed dictionary;

A dictionary expansion unit 43 is configured to select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated word semantic similarity to obtain a corresponding domain dictionary;

Iteration number judging unit 44 is configured to judge whether the current number of iterations reaches a preset number of cross-iterations. If yes, the unformed word screening unit 45 is triggered to perform the filtering of unformed words in the domain dictionary through a new word discovery algorithm, otherwise , Increase the current number of iterations by 1, and set the domain dictionary as the initial domain seed dictionary, and trigger the similarity calculation unit 42 to perform calculation of the corresponding general word vector and domain word vector in the universal word vector space model and the domain word vector space model Word semantic similarity to a seed word vector in a preset initial domain seed dictionary; and

The unformed word filtering unit 45 is configured to filter the unformed words in the domain dictionary through a new word discovery algorithm to complete the construction of the domain dictionary.

Preferably, the similarity calculation unit 42 includes:

The similarity calculation subunit 421 is configured to calculate a word semantic similarity between the general word vector and the field word vector and the seed word vector through a preset vector cosine similarity formula. The vector cosine similarity formula is

Among them, V ₁ is a general word vector or a field word vector, V ₂ is a seed word vector, and S (V ₁ , V ₂ ) is a word semantic similarity.

Preferably, the dictionary expansion unit 43 includes:

The dictionary expansion subunit 431 is configured to add a general word vector or a domain word vector corresponding to the semantic similarity of words to the initial domain seed dictionary when the calculated semantic similarity of the words is greater than a preset threshold of the domain keywords. The initial domain seed dictionary is expanded.

In the embodiment of the present invention, each unit of the device for constructing the domain dictionary may be implemented by corresponding hardware or software units. Each unit may be an independent software and hardware unit, or may be integrated into one software and hardware unit. this invention. For specific implementation of each unit, reference may be made to the description of the foregoing method embodiments, and details are not described herein again.

Embodiment 5:

FIG. 5 shows the structure of a computing device provided in Embodiment 5 of the present invention. For ease of description, only parts related to the embodiment of the present invention are shown.

The computing device 5 according to the embodiment of the present invention includes a processor 50, a memory 51, and a computer program 52 stored in the memory 51 and executable on the processor 50. When the processor 50 executes the computer program 52, the steps in the embodiment of the method for constructing a domain dictionary are implemented, for example, steps S101 to S104 shown in FIG. Alternatively, when the processor 50 executes the computer program 52, the functions of the units in the foregoing device embodiments are implemented, for example, the functions of the units 31 to 34 shown in FIG. 3.

In the embodiment of the present invention, word vector model training is performed on the selected general corpus and domain corpus to obtain corresponding general word vector space model and domain word vector space model, and the general word vector space model and the domain word vector space model are calculated. The semantic similarity between the corresponding universal word vector and domain word vector and the seed word vector in the preset initial domain seed dictionary. Based on the calculated semantic similarity of the word, select the corresponding universal word vector or domain word vector to the initial domain seed. The dictionary is expanded to obtain the corresponding domain dictionary, and the unformed words in the domain dictionary are filtered by the new word discovery algorithm to complete the construction of the domain dictionary, thereby expanding the vocabulary of the domain dictionary and improving the domain in the domain dictionary. Vocabulary accuracy, which in turn improves the accuracy of the domain dictionary.

The computing device in the embodiment of the present invention may be a personal computer or a server. For steps implemented when the processor 50 in the computing device 5 executes the computer program 52 to implement the method of constructing the domain dictionary, reference may be made to the description of the foregoing method embodiments, and details are not described herein again.

Embodiment 6:

In the embodiment of the present invention, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in the embodiment of a method for constructing a dictionary in the foregoing field are implemented, for example, Steps S101 to S104 shown in FIG. 1. Alternatively, when the computer program is executed by a processor, the functions of each unit in the foregoing device embodiments are implemented, for example, the functions of units 31 to 34 shown in FIG. 3.

The computer-readable storage medium of the embodiment of the present invention may include any entity or device capable of carrying computer program code, a recording medium, for example, a memory such as a ROM / RAM, a magnetic disk, an optical disk, a flash memory, or the like.

The above description is only the preferred embodiments of the present invention, and is not intended to limit the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims

A method for constructing a domain dictionary, wherein the method includes the following steps:

Train the selected general corpus and domain corpus separately on the word vector model to obtain the corresponding general word vector space model and domain word vector space model;

Calculating the word semantic similarity between the corresponding universal word vector and the domain word vector in the universal word vector space model and the domain word vector space model and a seed word vector in a preset initial domain seed dictionary;

Selecting the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated semantic similarity of the words to obtain a corresponding domain dictionary;

The unformed words in the domain dictionary are filtered by a new word discovery algorithm to complete the construction of the domain dictionary.
The method according to claim 1, wherein the corresponding general word vector space model and the field word vector space model in the universal word vector space model and the field word vector space model are calculated with the seed words in a preset initial domain seed dictionary. The steps of vector word semantic similarity include:

Calculate the word semantic similarity between the general word vector and the field word vector and the seed word vector through a preset vector cosine similarity formula, where the vector cosine similarity formula is
Wherein, V 1 is the general word vector or the domain word vector, V 2 is the seed word vector, and S (V 1 , V 2 ) is the semantic similarity of the word.
The method of claim 1, wherein the step of selecting a corresponding general word vector or domain word vector to expand the initial domain seed dictionary comprises:

When the calculated semantic similarity of the words is greater than a preset domain keyword threshold, a general word vector or a domain word vector corresponding to the semantic similarity of the words is added to the initial domain seed dictionary, so that The initial domain seed dictionary is expanded.
The method according to claim 1, wherein before the step of filtering unformed words in the domain dictionary by a new word discovery algorithm, the method further comprises:

Determine whether the current number of iterations reaches a preset number of cross iterations;

If yes, jump to the step of filtering unformed words in the domain dictionary by using a new word discovery algorithm;

Otherwise, increase the current number of iterations by 1, and set the domain dictionary as the initial domain seed dictionary, and jump to computing the corresponding ones in the universal word vector space model and the domain word vector space model. Steps of semantic semantic similarity between the general word vector and the domain word vector and a seed word vector in a preset initial domain seed dictionary.
A device for constructing a domain dictionary, wherein the device includes:

A model training unit, configured to train word vectors on the selected general corpus and domain corpus respectively to obtain corresponding general word vector space models and domain word vector space models;

A similarity calculation unit, configured to calculate a word semantic similarity between the corresponding general word vector space model and the field word vector space model in the universal word vector space model and the field word vector and a seed word vector in a preset initial domain seed dictionary. ;

A dictionary expansion unit, configured to select the corresponding general word vector or domain word vector to expand the initial domain seed dictionary according to the calculated semantic similarity of the words to obtain a corresponding domain dictionary; and

The unformed word filtering unit is used for filtering unformed words in the domain dictionary through a new word discovery algorithm to complete the construction of the domain dictionary.
The apparatus according to claim 5, wherein the similarity calculation unit comprises:

The similarity calculation subunit is configured to calculate a word semantic similarity between the general word vector and the field word vector and the seed word vector by using a preset vector cosine similarity formula. The vector cosine similarity formula is
Wherein, V 1 is the general word vector or the domain word vector, V 2 is the seed word vector, and S (V 1 , V 2 ) is the semantic similarity of the word.
The apparatus according to claim 5, wherein the dictionary expansion unit comprises:

A dictionary expansion subunit, configured to add a general word vector or a domain word vector corresponding to the semantic similarity of the words to the initial domain seed when the calculated semantic similarity of the words is greater than a preset domain keyword threshold Dictionary to expand the initial domain seed dictionary.
The apparatus according to claim 5, further comprising:

Iteration number judging unit, for judging whether the current number of iterations reaches a preset number of cross iterations, and then, triggering the unformed word screening unit to execute a new word discovery algorithm to filter out unformed words in the domain dictionary , Otherwise, increase the current number of iterations by 1, and set the domain dictionary as the initial domain seed dictionary, and trigger the similarity calculation unit to perform calculation of the universal word vector space model and the domain word The semantic similarity between the corresponding general word vector and domain word vector in the vector space model and the seed word vector in the preset initial domain seed dictionary.
A computing device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the processor implements claims 1 to Steps of the method of any one of 4.
A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 4 are implemented.