CN114020868A

CN114020868A - Domain corpus generation method, device and equipment

Info

Publication number: CN114020868A
Application number: CN202111302097.8A
Authority: CN
Inventors: 秦海龙; 郑伟
Original assignee: Beijing Qury Technology Co ltd; Shandong Kurui Technology Co ltd
Current assignee: Beijing Qury Technology Co ltd; Shandong Kurui Technology Co ltd
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-02-08

Abstract

The disclosure relates to a method, a device and equipment for generating domain corpora, wherein the method comprises the following steps: acquiring at least one initial keyword of a target field, and adding the at least one initial keyword to a vocabulary set of the target field; determining a plurality of corresponding near-meaning words for each vocabulary in the target field vocabulary set; determining a target similar meaning word matched with the encyclopedic entry from a plurality of similar meaning words, and adding the target similar meaning word to a target field vocabulary set; and repeatedly executing the steps based on the vocabulary in the target field vocabulary set until a preset stop condition is met, taking the vocabulary in the target field vocabulary set as an encyclopedia entry, and extracting a corresponding encyclopedia text as the field corpus of the target field. According to the technical scheme disclosed by the invention, the problem of missing of the domain NLP algorithm training corpus is solved, and the domain corpus can be quickly and accurately acquired.

Description

Domain corpus generation method, device and equipment

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, and a device for generating domain corpora.

Background

In the field of natural language processing, because of data differences between different fields, the performance of the same model in different fields also differs, and in order to ensure the performance effect of the model, the model in the field is usually trained independently for a certain field.

At present, for some fields with scarce data, in the algorithm training of the natural language processing field, the problem of data sparseness exists, so that a large amount of field linguistic data needs to be acquired to ensure the model training effect, and how to acquire the field linguistic data is a problem to be solved urgently.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, the present disclosure provides a method, an apparatus, and a device for generating a domain corpus.

In a first aspect, an embodiment of the present disclosure provides a method for generating domain corpora, including:

step S1, at least one initial keyword of the target field is obtained, and the at least one initial keyword is added into the vocabulary set of the target field;

step S2, determining a plurality of corresponding similar meaning words for each vocabulary in the target field vocabulary set;

step S3, matching the corresponding multiple similar meaning words with encyclopedia entries to determine target similar meaning words matched with the encyclopedia entries from the multiple similar meaning words and adding the target similar meaning words to the target domain vocabulary set;

step S4, repeatedly executing step S2 and step S3 based on the vocabulary in the target domain vocabulary set until a preset stop condition is met;

and step S5, taking the vocabulary in the target field vocabulary set as encyclopedic entries to extract corresponding encyclopedic texts, wherein the corresponding encyclopedic texts are field linguistic data of the target field.

Optionally, the determining, for each vocabulary in the target domain vocabulary set, a corresponding plurality of near-synonyms includes: and performing near-meaning word recall processing on each vocabulary in the target field vocabulary set based on a pre-trained word vector model to determine a plurality of near-meaning words corresponding to each vocabulary.

Optionally, the method further comprises: and capturing encyclopedic corpus as a training sample by using a web crawler so as to train the word vector model according to the training sample.

Optionally, the taking the vocabulary in the target domain vocabulary set as encyclopedia entries to extract corresponding encyclopedia texts includes: taking the vocabulary in the target field vocabulary set as encyclopedia entries to extract a corresponding first encyclopedia text; processing the first hundred texts based on a new word discovery algorithm to obtain at least one candidate word in the first hundred texts; and taking the at least one candidate word as an encyclopedia entry to extract a corresponding second encyclopedia text, wherein the first encyclopedia text and the second encyclopedia text are taken as the domain linguistic data of the target domain.

Optionally, the meeting of the preset stop condition includes: comparing the target similar meaning words with the words in the target field word set, and if the target similar meaning words have the same words in the target field word set, determining that the stop condition is met.

Optionally, the number of the target domains is multiple, and the method further includes: performing the above steps S1 to S5 for each target domain to obtain domain corpora of the target domain; determining a target encyclopedia text of a field corpus which does not belong to any target field in all encyclopedia texts, and taking the target encyclopedia text as a universal corpus.

In a second aspect, an embodiment of the present disclosure provides a domain corpus generating device, including:

an obtaining module, configured to perform step S1, obtain at least one initial keyword of the target domain, and add the at least one initial keyword to the target domain vocabulary set;

a determining module, configured to perform step S2, and determine, for each vocabulary in the target domain vocabulary set, a corresponding plurality of near-synonyms;

a matching module, configured to perform step S3, match the corresponding multiple near-sense words with an encyclopedia entry, determine a target near-sense word matching the encyclopedia entry from the multiple near-sense words, and add the target near-sense word to the target domain vocabulary set;

an executing module, configured to execute step S4, and repeatedly execute step S2 and step S3 based on the vocabulary in the target domain vocabulary set until a preset stop condition is met;

and a generating module, configured to perform step S5, and take the vocabulary in the target domain vocabulary set as encyclopedia entries to extract corresponding encyclopedia texts, where the corresponding encyclopedia texts are domain corpora of the target domain.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: a processor; a memory for storing the processor-executable instructions; the processor is configured to read the executable instruction from the memory, and execute the instruction to implement the domain corpus generating method according to the first aspect.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program, when executed by a processor, implements the domain corpus generating method according to the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages: the method comprises the steps of obtaining at least one initial keyword of a target field, adding the at least one initial keyword into a target field vocabulary set, determining a plurality of corresponding near-synonyms for each vocabulary in the target field vocabulary set, further determining a target near-synonym matched with an encyclopedia entry from the plurality of near-synonyms, adding the target near-synonym into the target field vocabulary set, repeatedly executing the steps based on the vocabulary in the target field vocabulary set until a preset stop condition is met, taking the vocabulary in the target field vocabulary set as the encyclopedia entry, and extracting a corresponding encyclopedic text to serve as a field vocabulary material of the target field. According to the technical scheme disclosed by the invention, the problem that the domain NLP (Natural Language Processing) algorithm training corpus is missing is solved, and the domain corpus can be quickly and accurately obtained.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a domain corpus generating method according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a domain corpus generating device according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

Fig. 1 is a schematic flow diagram of a domain corpus generating method according to an embodiment of the present disclosure, where the method according to the embodiment of the present disclosure may be executed by a domain corpus generating device, and the device may be implemented by software and/or hardware, and may be integrated on any electronic device with computing capability, such as a user terminal, e.g., a smart phone, a tablet computer, and the like.

As shown in fig. 1, a method for generating domain corpus provided in an embodiment of the present disclosure may include:

step 101, at least one initial keyword of a target field is obtained, and the at least one initial keyword is added to a vocabulary set of the target field.

The target domain may be a domain to be generated with domain corpora, for example, the target domain includes a finance vertical domain, a music vertical domain, a travel vertical domain, an education vertical domain, a medical vertical domain, and the like.

In this embodiment, a vertical domain professional dictionary in a target field may be queried, one or more seed vocabularies in the vertical domain may be obtained, the obtained seed vocabularies may be used as initial keywords, and the initial keywords may be added to a vocabulary set in the target field. As an example, taking the financial vertical domain as an example, a number of seed vocabularies existing in the financial vertical domain professional dictionary or encyclopedia entry, such as a financial derivative vocabulary, a bank name vocabulary, etc., may be selected to be added to the financial vertical domain vocabulary set as the initial keyword.

Step 102, determining a plurality of corresponding near-meaning words for each vocabulary in the target field vocabulary set.

In this embodiment, the target domain vocabulary set initially includes one or more vocabularies, and for each vocabulary, a plurality of near-synonyms or synonyms corresponding to the vocabulary is determined.

As a possible implementation, a word vector model may be trained in advance, and the input of the word vector model is vocabulary and the output is a vector. Optionally, capturing encyclopedic corpus as a training sample by using a web crawler to train a Word vector model according to the training sample, wherein the training mode includes but is not limited to Word2Vec, Glove and the like. The function of the word vector model includes recalling several near-synonyms of a vocabulary, and therefore, in this embodiment, determining a corresponding plurality of near-synonyms for each vocabulary in the target domain vocabulary set includes: and performing near-meaning word recall processing on each vocabulary in the target field vocabulary set based on the pre-trained word vector model to determine a plurality of near-meaning words corresponding to each vocabulary.

And 103, matching the corresponding multiple similar meaning words with the encyclopedia entry so as to determine a target similar meaning word matched with the encyclopedia entry from the multiple similar meaning words, and adding the target similar meaning word to the target field vocabulary set.

In this embodiment, all the similar meaning words determined in the foregoing step are respectively matched with a preset encyclopedia entry, and for each similar meaning word, if a entry identical to the similar meaning word exists in the encyclopedia entry, it is determined that the similar meaning word is matched with the encyclopedia entry, and the similar meaning word is added to the target domain vocabulary set.

The encyclopedic data can be acquired in advance, the encyclopedic data comprises encyclopedic entries and encyclopedic texts, namely each encyclopedic entry corresponds to one article in the encyclopedic data, and the field vocabularies are screened from the plurality of near-sense words according to the encyclopedic entries by matching the plurality of near-sense words with the encyclopedic entries, so that the field vocabularies are expanded.

And 104, repeatedly executing the step 102 and the step 103 based on the vocabulary in the target field vocabulary set until a preset stop condition is met.

In this embodiment, the number of words in the target domain word set is increased by repeatedly performing the above steps.

As an example, after adding the target near-synonyms B and C to the target domain vocabulary set, for the target near-synonym B, determining a corresponding plurality of near-synonyms, further determining a near-synonym matching the encyclopedia entry from the plurality of near-synonyms corresponding to B, and adding the matching near-synonym to the target domain vocabulary set; the target synonym C can be executed with reference to the step B, and will not be described herein. At this time, whether a preset stopping condition is met or not is judged, if the preset stopping condition is met, the step of repeated execution is stopped, and if the preset stopping condition is not met, the step is continuously executed.

And under the condition that the stop condition is met, stopping adding the vocabulary to the target field vocabulary set so as to further extract the field linguistic data according to the vocabulary in the target field vocabulary set. The stopping condition may be set according to an actual scene, optionally, after each round of determining the target near-synonym matched with the encyclopedic entry, the target near-synonym is compared with the vocabulary in the target field vocabulary set, and if the target near-synonym has the same vocabulary in the target field vocabulary set, it is determined that the stopping condition is satisfied. Alternatively, the stop condition may be set to a specified number of times, and when the number of iterations reaches the specified number of times, it is determined that the stop condition is satisfied.

And 105, taking the vocabulary in the target field vocabulary set as encyclopedia entries to extract corresponding encyclopedia texts.

In the embodiment of the disclosure, the vocabularies in the target field vocabulary set are all matched with encyclopedia entries, the vocabularies in the target field vocabulary set are used as the encyclopedia entries, corresponding encyclopedia texts are extracted, and the extracted encyclopedia texts are used as field corpora of the target field.

In one embodiment of the present disclosure, taking the vocabulary in the target domain vocabulary set as encyclopedia entries to extract corresponding encyclopedia texts, includes: the method comprises the steps of taking vocabularies in a target field vocabulary set as encyclopedias entries to extract a corresponding first encyclopedia text, processing the first encyclopedia text based on a new word discovery algorithm to obtain at least one candidate word in the first encyclopedia text, taking the at least one candidate word as an encyclopedia entry to extract a corresponding second encyclopedia text, and taking the first encyclopedia text and the second encyclopedia text as field corpora of a target field.

In this embodiment, existing vertical domain corpus texts are processed based on a new word discovery algorithm, multiple candidate words can be obtained for each existing vertical domain corpus text, as an example, a found candidate word is a word with a large number of co-occurrences of an existing word set in a text (usually, two adjacent words or three adjacent words), indexes such as maximum entropy and mutual information between words can be calculated, and if the index exceeds a threshold, two words are combined into a new word. And further matching the candidate word with the encyclopedic entry, and adding the encyclopedic text corresponding to the encyclopedic entry into the field corpus of the target field if the candidate word is matched with the encyclopedic entry. Therefore, the new word discovery algorithm is used for further supplementing the field vocabularies, the quantity of the generated field linguistic data is ensured, and the problem that the field NLP algorithm training linguistic data is missing is solved.

According to the technical scheme of the embodiment of the disclosure, at least one initial keyword of a target field is obtained, the at least one initial keyword is added into a target field vocabulary set, a plurality of corresponding near-sense words are determined aiming at each vocabulary in the target field vocabulary set, then the target near-sense words matched with the encyclopedic entry are determined from the plurality of near-sense words, the target near-sense words are added into the target field vocabulary set, the steps are repeatedly executed based on the vocabulary in the target field vocabulary set until a preset stop condition is met, the vocabulary in the target field vocabulary set is used as the encyclopedic entry, and corresponding encyclopedic texts are extracted to be used as field linguistic data of the target field. According to the technical scheme disclosed by the invention, the problem of missing of the domain NLP algorithm training corpus is solved, and the domain corpus can be quickly and accurately acquired.

Based on the above embodiment, the domain corpora of each domain can be acquired. In an embodiment of the present disclosure, there are a plurality of target domains, and therefore, the above steps 101 to 105 may be performed for each target domain to obtain the domain corpora of the target domain, for example, for the financial vertical domain, the music vertical domain, and the travel vertical domain, the domain corpora of the financial vertical domain, the domain corpora of the music vertical domain, and the domain corpora of the travel vertical domain are respectively obtained. And further determining a target encyclopedia text of the domain linguistic data which does not belong to any target domain in all encyclopedia texts, and taking the target encyclopedia text as the universal linguistic data. In this embodiment, if a certain encyclopedic entry is not included in any target domain vocabulary set, the encyclopedic text corresponding to the encyclopedic entry is determined as the general corpus. Therefore, the field corpus extraction scheme based on encyclopedic data is provided, the problems that the field data are sparse and the data cleaning cost is high due to the fact that uncertainty exists in the format quality of the website crawling data are solved, the data cleaning cost is reduced, and the field corpus can be rapidly and accurately acquired.

Furthermore, a certain field corpus and a general corpus can be used as a data set of the field NLP algorithm, the data set can be used for field NLP algorithm training based on the data set, for example, for a travel-related intelligent question-answering robot, a field corpus training word vector model of a travel vertical domain can be adopted, the problem that the field NLP algorithm training corpus is missing is solved, and the model accuracy is improved.

Fig. 2 is a schematic structural diagram of a domain corpus generating device according to an embodiment of the present disclosure, and as shown in fig. 2, the domain corpus generating device includes: the device comprises an acquisition module 21, a determination module 22, a matching module 23, an execution module 24 and a generation module 25.

The obtaining module 21 is configured to execute step S1, obtain at least one initial keyword of the target domain, and add the at least one initial keyword to the target domain vocabulary set.

The determining module 22 is configured to execute step S2, and determine a plurality of corresponding near-meaning words for each vocabulary in the target domain vocabulary set.

The matching module 23 is configured to perform step S3, match the corresponding multiple near-sense words with the encyclopedia entry, determine a target near-sense word matched with the encyclopedia entry from the multiple near-sense words, and add the target near-sense word to the target domain vocabulary set.

And the execution module 24 is used for executing the step S4, and repeatedly executing the steps S2 and S3 based on the vocabulary in the target domain vocabulary set until a preset stop condition is met.

The generating module 25 is configured to execute step S5, and take the vocabulary in the target domain vocabulary set as encyclopedia entries to extract corresponding encyclopedia texts, where the corresponding encyclopedia texts are the domain corpora of the target domain.

In an embodiment of the present disclosure, the determining module 22 is specifically configured to: and performing near-meaning word recall processing on each vocabulary in the target field vocabulary set based on a pre-trained word vector model to determine a plurality of near-meaning words corresponding to each vocabulary.

In an embodiment of the present disclosure, the domain corpus generating device further includes: and the training module is used for grabbing encyclopedic corpus as a training sample by using a web crawler so as to train the word vector model according to the training sample.

In an embodiment of the present disclosure, the generating module 25 is specifically configured to: taking the vocabulary in the target field vocabulary set as encyclopedia entries to extract a corresponding first encyclopedia text; processing the first hundred texts based on a new word discovery algorithm to obtain at least one candidate word in the first hundred texts; and taking the at least one candidate word as an encyclopedia entry to extract a corresponding second encyclopedia text, wherein the first encyclopedia text and the second encyclopedia text are taken as the domain linguistic data of the target domain.

In an embodiment of the present disclosure, the meeting of the preset stop condition includes: comparing the target similar meaning words with the words in the target field word set, and if the target similar meaning words have the same words in the target field word set, determining that the stop condition is met.

In an embodiment of the present disclosure, the number of the target domains is multiple, and the domain corpus generating device further includes: a dividing module, configured to perform the above steps S1 to S5 for each target domain to obtain domain corpora of the target domain; determining a target encyclopedia text of a field corpus which does not belong to any target field in all encyclopedia texts, and taking the target encyclopedia text as a universal corpus.

The domain corpus generating device provided by the embodiment of the disclosure can execute any domain corpus generating method provided by the embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the executing method. Reference may be made to the description of any method embodiment of the disclosure that may not be described in detail in the embodiments of the apparatus of the disclosure.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 3, the electronic device 600 includes one or more processors 601 and memory 602.

The processor 601 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 600 to perform desired functions.

The memory 602 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer-readable storage medium and executed by processor 601 to implement the methods of the embodiments of the present disclosure above and/or other desired functionality. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 600 may further include: an input device 603 and an output device 604, which are interconnected by a bus system and/or other form of connection mechanism (not shown). The input device 603 may also include, for example, a keyboard, a mouse, and the like. The output device 604 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 604 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.

Of course, for simplicity, only some of the components of the electronic device 600 relevant to the present disclosure are shown in fig. 3, omitting components such as buses, input/output interfaces, and the like. In addition, electronic device 600 may include any other suitable components depending on the particular application.

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform any of the methods provided by embodiments of the present disclosure.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform any of the methods provided by the embodiments of the present disclosure.

A computer-readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for generating domain corpora is characterized by comprising the following steps:

2. The method of claim 1, wherein determining, for each vocabulary in the target domain vocabulary set, a corresponding plurality of near-synonyms comprises:

and performing near-meaning word recall processing on each vocabulary in the target field vocabulary set based on a pre-trained word vector model to determine a plurality of near-meaning words corresponding to each vocabulary.

3. The method of claim 2, further comprising:

and capturing encyclopedic corpus as a training sample by using a web crawler so as to train the word vector model according to the training sample.

4. The method of claim 1, wherein the taking the vocabulary in the target domain vocabulary set as encyclopedia entries to extract corresponding encyclopedia text comprises:

taking the vocabulary in the target field vocabulary set as encyclopedia entries to extract a corresponding first encyclopedia text;

processing the first hundred texts based on a new word discovery algorithm to obtain at least one candidate word in the first hundred texts;

and taking the at least one candidate word as an encyclopedia entry to extract a corresponding second encyclopedia text, wherein the first encyclopedia text and the second encyclopedia text are taken as the domain linguistic data of the target domain.

5. The method of claim 1, wherein the meeting of the preset stop condition comprises:

comparing the target similar meaning words with the words in the target field word set, and if the target similar meaning words have the same words in the target field word set, determining that the stop condition is met.

6. The method of claim 1, wherein the number of target domains is plural, the method further comprising:

performing the above steps S1 to S5 for each target domain to obtain domain corpora of the target domain;

determining a target encyclopedia text of a field corpus which does not belong to any target field in all encyclopedia texts, and taking the target encyclopedia text as a universal corpus.

7. A domain corpus generating device, comprising:

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the domain corpus generating method according to any one of the claims 1 to 6.

9. A computer-readable storage medium, wherein the storage medium stores a computer program, and the computer program is executed by a processor to implement the domain corpus generating method according to any one of the preceding claims 1 to 6.