CN115859948A

CN115859948A - Method, device and storage medium for mining domain vocabulary based on correlation analysis algorithm

Info

Publication number: CN115859948A
Application number: CN202210673281.1A
Authority: CN
Inventors: 王军华; 蒋宁; 李宽
Original assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Current assignee: Beijing Zhongguancun Kejin Technology Co Ltd
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2023-03-28

Abstract

The application discloses a method, a device and a storage medium for mining domain vocabularies based on an association analysis algorithm, wherein the method comprises the following steps: acquiring a question and a paragraph associated with the question; inputting the questions and paragraphs into a pre-trained deep learning model, and outputting text vectors; inputting the text vector into a first convolution neural network, and predicting the head position of the answer of the question in a paragraph; inputting the text vector into a second convolutional neural network, and predicting the tail position of the answer in the paragraph; based on the predicted head and tail positions, an answer to the question is determined from the passage.

Description

Method, device and storage medium for mining domain vocabulary based on correlation analysis algorithm

Technical Field

The present application relates to the field of information mining technologies, and in particular, to a method, an apparatus, and a storage medium for mining a domain vocabulary based on an association analysis algorithm.

Background

The domain vocabulary is present in the corpus but not included in the general dictionary and has strong domain distinction degree, and the appearance of the domain vocabulary often has the characteristics of time, space, region and the like. With the development of internet and scientific research, the speed of appearance of network vocabularies and professional vocabularies of specific fields far exceeds the cognitive speed of field scholars. Generally, professional vocabularies of a specific field often appear in the corresponding field, and the professional vocabularies and the specialties often have the characteristics of being professional and special, and meanwhile, the field vocabularies often have the characteristics of being fast in speed and fast in extinction along with the change of time and space.

Currently, domain vocabulary mining is mostly performed using a statistics-based domain vocabulary mining algorithm or an open source tool SmoothNLP-based domain vocabulary mining algorithm. However, the precision and recall rate of the statistical-based domain vocabulary mining algorithm are not high, and a large number of manual filtering rules need to be further formulated, which is time-consuming and labor-consuming, and the final domain vocabulary quality cannot be guaranteed. The domain vocabulary mining algorithm based on the open source tool SmoothNLP is not high in precision and recall rate, and is inconvenient and flexible to use when the effect is improved in the face of the fact that manual rules need to be added in a specific landing scene in a targeted mode.

Aiming at the technical problems of low precision rate, low recall rate and incapability of flexible use of word mining in the field in the prior art, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the invention provides a method, a device and a storage medium for mining field vocabularies based on an association analysis algorithm, and at least solves the technical problems that the field vocabularies are low in mining accuracy rate, low in recall rate and incapable of being used flexibly in the prior art.

According to an aspect of the embodiments of the present invention, there is provided a method for mining a domain vocabulary based on an association analysis algorithm, including: acquiring a field text corpus of a field vocabulary to be extracted; performing word segmentation on the field text corpus, calculating the degree of aggregation and the degree of freedom of each word obtained through the word segmentation, and determining a candidate word set from each word obtained through the word segmentation according to the degree of aggregation and the degree of freedom; utilizing an open source word segmentation tool to perform word segmentation on the field text corpus, calculating TF-IDF weight values of all words obtained through word segmentation, and determining a seed vocabulary set from all words obtained through word segmentation according to the TF-IDF weight values; and respectively calculating the association values between each candidate word in the candidate word set and each seed word in the seed word set, and screening out the candidate words with the association values larger than a preset threshold value from the candidate word set as the field words.

Optionally, obtaining a domain text corpus of a domain vocabulary to be extracted includes: acquiring text corpora in the field from a database; and preprocessing the text corpus in the field to obtain the field text corpus of the field vocabulary to be extracted.

Optionally, the preprocessing the text corpus in the field includes: and performing sentence-breaking processing on the text corpus in the field, and recording the document name of each sentence.

Optionally, performing word segmentation processing on the domain text corpus, including: determining a word cutting length interval; and performing word segmentation processing on the left sentence and the right sentence in the field text corpus by using a word segmentation algorithm with the word segmentation length interval.

Optionally, calculating the degree of aggregation and the degree of freedom of each word obtained through the word segmentation processing, and determining a candidate word set from each word obtained through the word segmentation processing according to the degree of aggregation and the degree of freedom, including: calculating the degree of aggregation of each word obtained by word segmentation according to a preset aggregation calculation formula; calculating the degree of freedom of each word obtained by word segmentation according to a preset degree of freedom calculation formula; determining first candidate words with the degree of aggregation larger than a preset degree of aggregation threshold value from all words obtained through word segmentation processing, and determining second candidate words with the degree of freedom larger than a preset degree of freedom threshold value from all words obtained through word segmentation processing; and taking the intersection of the first candidate word and the second candidate word to obtain the candidate word set.

Optionally, determining a seed vocabulary set from each word obtained through the word segmentation processing according to the TF-IDF weight value, including: according to the TF-IDF weighted value, performing descending processing on each word obtained by word segmentation processing; and extracting a preset number of words from each word after the descending processing as seed words to obtain the seed word set.

Optionally, after the candidate words with the relevance values larger than the preset threshold are screened from the candidate word set as the domain vocabulary, the method further includes: adding the screened field vocabularies to the seed vocabulary set to update the seed vocabulary set; and respectively calculating the association value between each candidate word in the candidate word set and each seed word in the updated seed word set, and screening out the candidate words with the association value larger than a preset threshold value from the candidate word set as the field words.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, wherein the method of any one of the above is performed by a processor when the program is run.

According to another aspect of the embodiments of the present invention, there is also provided an apparatus for mining a domain vocabulary based on a correlation analysis algorithm, including: the corpus acquiring module is used for acquiring a domain text corpus of a domain vocabulary to be extracted; the candidate word set determining module is used for carrying out word segmentation on the field text corpus, calculating the degree of aggregation and the degree of freedom of each word obtained through the word segmentation, and determining a candidate word set from each word obtained through the word segmentation according to the degree of aggregation and the degree of freedom; the seed vocabulary set determining module is used for utilizing an open source word segmentation tool to perform word segmentation on the field text corpus, calculating TF-IDF (Trans-inverse discrete frequency) weight values of all words obtained through word segmentation, and determining a seed vocabulary set from all words obtained through word segmentation according to the TF-IDF weight values; and the domain vocabulary determining module is used for respectively calculating the association values between each candidate word in the candidate word set and each seed vocabulary in the seed vocabulary set, and screening out candidate words with the association values larger than a preset threshold value from the candidate word set as domain vocabularies.

According to another aspect of the embodiments of the present invention, there is also provided an apparatus for mining a domain vocabulary based on a correlation analysis algorithm, including: a processor; and a memory coupled to the processor for providing instructions to the processor for processing the following processing steps: acquiring a domain text corpus of a domain vocabulary to be extracted; performing word segmentation on the field text corpus, calculating the degree of aggregation and the degree of freedom of each word obtained through the word segmentation, and determining a candidate word set from each word obtained through the word segmentation according to the degree of aggregation and the degree of freedom; utilizing an open source word segmentation tool to perform word segmentation on the field text corpus, calculating TF-IDF weight values of all words obtained through word segmentation, and determining a seed vocabulary set from all words obtained through word segmentation according to the TF-IDF weight values; and respectively calculating the association values between each candidate word in the candidate word set and each seed word in the seed word set, and screening out the candidate words with the association values larger than a preset threshold value from the candidate word set as the field words.

In the embodiment of the invention, a field text corpus of a field vocabulary to be extracted is firstly obtained, then word segmentation is carried out on the field text corpus, the degree of aggregation and the degree of freedom of each word obtained through word segmentation are calculated, a candidate word set is determined from each word obtained through word segmentation according to the degree of aggregation and the degree of freedom, then word segmentation is carried out on the field text corpus by using an open source word segmentation tool, TF-IDF weight values of each word obtained through word segmentation are calculated, a seed vocabulary set is determined from each word obtained through word segmentation according to the TF-IDF weight values, finally, association values between each candidate word in the candidate word set and each seed vocabulary in the seed vocabulary set are respectively calculated, and a candidate word with the association value larger than a preset threshold value is screened from the candidate word set to serve as the field vocabulary. According to the method, a small amount of seed vocabularies are sorted out by means of an open source word segmentation tool, correlation values between candidate words and each seed vocabulary obtained through word segmentation processing are calculated, and a certain amount of field vocabularies can be accurately and efficiently excavated through correlation value filtering. Compared with a mining algorithm for supervised learning, the method needs a large amount of manual labeling cost, is low in efficiency and does not have cross-field applicability, can quickly adapt to different fields, can simply and efficiently mine field vocabularies only by using a small amount of seed vocabularies, and has industrial landing practical value. Therefore, the technical problems that the domain vocabulary mining is low in precision rate, low in recall rate and incapable of being used flexibly in the prior art are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a hardware configuration block diagram of a computing apparatus for implementing the method according to embodiment 1 of the present invention;

FIG. 2 is a flowchart illustrating a method for mining a domain vocabulary based on a correlation analysis algorithm according to a first aspect of embodiment 1 of the present invention;

fig. 3 is a schematic overall flowchart of a method for mining a domain vocabulary based on a correlation analysis algorithm according to embodiment 1 of the present invention;

fig. 4 is a schematic diagram of an apparatus for mining a domain vocabulary based on a correlation analysis algorithm according to embodiment 2 of the present invention; and

fig. 5 is a schematic diagram of an apparatus for mining a domain vocabulary based on a correlation analysis algorithm according to embodiment 3 of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not necessarily all exemplary embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms or terms appearing in the description of the embodiments of the present invention are applicable to the following explanations:

information entropy: a measure of the amount of information needed to remove uncertainty, i.e., the amount of information an unknown event may contain.

PMI (degree of aggregation): for measuring the degree of clustering within a word.

Degree of freedom (left-right adjacent entropy): the richness of the left and right surrounding words used to represent a word increases the probability that the word becomes a domain word as the richness in the domain vocabulary discovery increases.

TF-IDF (Term Frequency-Inverse Document Frequency): a statistical method for evaluating the importance of a word to one of a set of documents or a corpus of documents.

And (3) association rule algorithm: the FP-growth algorithm is a method for efficiently finding frequent sets (the FP-growth algorithm is one of association rule algorithms).

Example 1

In accordance with the present embodiments, there is provided an embodiment of a method for mining domain vocabulary based on associative analysis algorithms, it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than presented herein.

The method embodiments provided by the present embodiment may be executed in a server or similar computing device. FIG. 1 illustrates a block diagram of a hardware architecture of a computing device for implementing a method for mining domain vocabulary based on associative analysis algorithms. As shown in fig. 1, the computing device may include one or more processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory for storing data, and a transmission device for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computing device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuitry may be a single, stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computing device. As referred to in the embodiments of the invention, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).

The memory may be configured to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the method for mining a domain vocabulary based on an associative analysis algorithm in an embodiment of the present invention, and the processor executes various functional applications and data processing by operating the software programs and modules stored in the memory, so as to implement the method for mining a domain vocabulary based on an associative analysis algorithm of an application program. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory located remotely from the processor, which may be connected to the computing device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device is used for receiving or transmitting data via a network. Specific examples of such networks may include wireless networks provided by communication providers of the computing devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

The display may be, for example, a touch screen-type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computing device.

It should be noted here that in some alternative embodiments, the computing device shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that FIG. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in a computing device as described above.

In the operating environment, according to the first aspect of the embodiment, a method for mining a domain vocabulary based on a correlation analysis algorithm is provided. Fig. 2 shows a flow diagram of the method, which, with reference to fig. 2, comprises:

s202: and acquiring a field text corpus of a field vocabulary to be extracted.

In the embodiment of the present invention, as shown in fig. 3, first, domain text corpus preparation is performed, and text corpuses in the domain are obtained from a database (for example, but not limited to, a business knowledge base). And then, preprocessing the acquired text corpus.

In the embodiment of the invention, the text corpus in the field is subjected to sentence segmentation, namely, the sentences are segmented according to 6 punctuations of separators (the character;,) and the document name of each sentence is recorded.

S204: and performing word segmentation on the field text corpus, calculating the degree of aggregation and the degree of freedom of each word obtained through the word segmentation, and determining a candidate word set from each word obtained through the word segmentation according to the degree of aggregation and the degree of freedom.

Optionally, performing word segmentation processing on the domain text corpus, including: determining a word cutting length interval; and performing word segmentation processing on left and right sentences in the field text corpus by using a word segmentation algorithm with the word segmentation length interval.

In the embodiment of the present invention, a word segmentation length interval of a word segmentation algorithm (for example, but not limited to, a Ngram algorithm) may be determined based on the mastering of words in a domain, that is, a value range of N, which is generally 2 to 6 words in length, is determined, and then, a word segmentation process is performed on left and right sentences in a domain text corpus through the Ngram algorithm.

Optionally, calculating the degree of aggregation and the degree of freedom of each word obtained through the word segmentation processing, and determining a candidate word set from each word obtained through the word segmentation processing according to the degree of aggregation and the degree of freedom, including: calculating a formula according to a preset degree of cohesion, and calculating the degree of cohesion of each word obtained by word segmentation; calculating the degree of freedom of each word obtained by word segmentation according to a preset degree of freedom calculation formula; determining first candidate words with the degree of aggregation larger than a preset degree of aggregation threshold value from all words obtained through word segmentation processing, and determining second candidate words with the degree of freedom larger than a preset degree of freedom threshold value from all words obtained through word segmentation processing; and taking the intersection of the first candidate word and the second candidate word to obtain the candidate word set.

In the embodiment of the present invention, the degree of cohesion (PMI) represents the probability of co-occurrence between sub-words of a word, such as: the hyaluronic acid is composed of the sub-words of hyaluronic acid and hyaluronic acid, the calculation formula of the degree of aggregation of hyaluronic acid is the product of the times C1 of the occurrence of hyaluronic acid and hyaluronic acid of 3 sub-words divided by the times C2 of the occurrence of hyaluronic acid, and finally the ratio of C1 to C2 is used as the degree of aggregation (PMI) of the words.

Wherein, the preset calculation formula of the degree of agglomeration is as follows: PMI = p (x, y)/(p (x) × p (y)); in the formula, p (x, y) represents the probability of the word x and the word y occurring in the same sentence, p (x) represents the probability of the word x occurring, and p (y) represents the probability of the word y occurring.

In the embodiment of the invention, the degree of freedom represents the richness of words with adjacent positions on the left side and the right side of one word in an original sentence, namely different kinds of degrees of different adjacent words. Such as: the left adjacent word of the hyaluronic acid word in the sentence is 'middle' and the right adjacent word is 'quilt' in the sentence, the hyaluronic acid is used for filling fat in the clinical medical case, all the left and right adjacent words of the hyaluronic acid in the whole corpus sentence are counted, and the occurrence frequency of each word is calculated. And calculating the freedom value of the word cut out by each Ngram algorithm according to a preset freedom calculation formula.

Wherein, the calculation formula according to the preset degree of freedom is as follows:

in the formula, p (x) represents the probability of occurrence of the word x。

Further, as shown in fig. 3, the word segmentation results of the Ngram algorithm are filtered by a threshold. The method specifically comprises the following steps: and (3) respectively filtering the words by using a threshold value 1 and a threshold value 2 (the values of the two threshold values are different), so as to obtain a first candidate word w1 subjected to agglomeration filtering and a second candidate word w2 subjected to freedom filtering, and finally, taking the intersection of the w1 and the w2 to obtain a candidate word set.

S206: and performing word segmentation on the field text corpus by using an open source word segmentation tool, calculating TF-IDF (Trans-inverse discrete function) weight values of all words obtained through word segmentation, and determining a seed vocabulary set from all words obtained through word segmentation according to the TF-IDF weight values.

In the embodiment of the invention, the open source word segmentation tool is used for carrying out word segmentation processing on the field text corpus. It should be noted that the word segmentation process using the open source word segmentation tool can be performed in parallel with the Ngram word segmentation part. As shown in fig. 3, all sentences after corpus preprocessing are participled by the open source participle tool jieba.

Optionally, determining a seed vocabulary set from each word obtained through the word segmentation processing according to the TF-IDF weight value, including: according to the TF-IDF weight value, performing descending processing on each word obtained by word segmentation processing; and extracting a preset number of words from each word after the descending processing as seed words to obtain the seed word set.

In the embodiment of the invention, TF-IDF weight value calculation is carried out on the words separated by the jieba tool, and a part of words (such as but not limited to about 500 words) are obtained by manual combing and screening according to the descending order of TF-IDF values and are used as seed vocabularies.

S208: and respectively calculating the association values between each candidate word in the candidate word set and each seed word in the seed word set, and screening out the candidate words with the association values larger than a preset threshold value from the candidate word set as the field words.

In the embodiment of the present invention, based on an association rule algorithm (for example, but not limited to, an FP-Growth algorithm, which is one of association rule algorithms), association values between each candidate word in the candidate word set and each seed vocabulary in the seed vocabulary set are respectively calculated, where the association values are the number of times that two words appear together, and each candidate word in the candidate word set and its corresponding association value are obtained.

Further, as shown in fig. 3, the words with associated values are filtered by a threshold value of 3, and the filtering result is used as the domain vocabulary, i.e. the candidate words with associated values greater than the threshold value of 3 are screened from the candidate word set as the domain vocabulary.

Optionally, after the candidate words with the association value greater than the preset threshold are screened from the candidate word set as the domain vocabulary, the method further includes: adding the screened field vocabularies to the seed vocabulary set to update the seed vocabulary set; and respectively calculating the association value between each candidate word in the candidate word set and each seed word in the updated seed word set, and screening out the candidate words with the association value larger than a preset threshold value from the candidate word set as the field words.

In the embodiment of the present invention, as shown in fig. 3, the domain vocabulary obtained by filtering through the threshold 3 is added to the artificial seed vocabulary set, the correlation values between each candidate word in the candidate word set and each seed vocabulary in the updated seed vocabulary set are continuously calculated, then the threshold 3 is filtered, and after multiple rounds of cyclic calculation, a relatively complete domain vocabulary is finally obtained.

Therefore, the method for mining the field vocabularies based on the association analysis algorithm, provided by the invention, comprises the steps of firstly obtaining field text corpus of the field vocabularies to be extracted, then carrying out word segmentation on the field text corpus, calculating the aggregation and the freedom of each word obtained through the word segmentation, determining a candidate word set from each word obtained through the word segmentation according to the aggregation and the freedom, then carrying out word segmentation on the field text corpus by using an open source word segmentation tool, calculating TF-IDF weighted values of each word obtained through the word segmentation, determining a seed vocabulary set from each word obtained through the word segmentation according to the TF-IDF weighted values, finally respectively calculating the association values between each candidate word in the candidate word set and each seed vocabulary in the seed vocabulary set, and screening out candidate words with the association values larger than a preset threshold value from the candidate word set as the field vocabularies. According to the method, a small amount of seed vocabularies are sorted out by means of an open source word segmentation tool, correlation values between candidate words and each seed vocabulary obtained through word segmentation processing are calculated, and a certain amount of field vocabularies can be accurately and efficiently excavated through correlation value filtering. Compared with a mining algorithm for supervised learning, the method needs a large amount of manual labeling cost, is low in efficiency and does not have cross-field applicability, can quickly adapt to different fields, can simply and efficiently mine field vocabularies only by using a small amount of seed vocabularies, and has industrial landing practical value. Therefore, the technical problems that the domain vocabulary mining is low in precision rate, low in recall rate and incapable of being used flexibly in the prior art are solved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

Fig. 4 shows an apparatus 400 for mining a domain vocabulary based on a correlation analysis algorithm according to the present embodiment, wherein the apparatus 400 corresponds to the method according to the first aspect of the embodiment 1. Referring to fig. 4, the apparatus 400 includes: a corpus obtaining module 410, configured to obtain a domain text corpus of a domain vocabulary to be extracted; a candidate word set determining module 420, configured to perform word segmentation on the domain text corpus, calculate an aggregation degree and a degree of freedom of each word obtained through the word segmentation, and determine a candidate word set from each word obtained through the word segmentation according to the aggregation degree and the degree of freedom; a seed vocabulary set determining module 430, configured to perform a word segmentation on the field text corpus by using an open source word segmentation tool, calculate a TF-IDF weight value of each word obtained through the word segmentation, and determine a seed vocabulary set from each word obtained through the word segmentation according to the TF-IDF weight value; the domain vocabulary determining module 440 is configured to calculate association values between each candidate word in the candidate word set and each seed vocabulary in the seed vocabulary set, and screen out a candidate word with an association value greater than a preset threshold from the candidate word set as a domain vocabulary.

Optionally, the corpus obtaining module 410 is specifically configured to: acquiring text corpora in the field from a database; and preprocessing the text corpus in the field to obtain the field text corpus of the field vocabulary to be extracted.

Optionally, the corpus acquiring module 410 is further specifically configured to: and performing sentence breaking processing on the text corpus in the field, and recording the document name of each sentence.

Optionally, the corpus acquiring module 410 is further specifically configured to: determining a word cutting length interval; and performing word segmentation processing on the left sentence and the right sentence in the field text corpus by using a word segmentation algorithm with the word segmentation length interval.

Optionally, the candidate word set determining module 420 is specifically configured to: calculating the degree of aggregation of each word obtained by word segmentation according to a preset aggregation calculation formula; calculating the degree of freedom of each word obtained by word segmentation according to a preset degree of freedom calculation formula; determining a first candidate word of which the degree of aggregation is greater than a preset degree of aggregation threshold value from each word obtained by word segmentation processing, and determining a second candidate word of which the degree of freedom is greater than a preset degree of freedom threshold value from each word obtained by word segmentation processing; and taking the intersection of the first candidate word and the second candidate word to obtain the candidate word set.

Optionally, the seed vocabulary set determining module 430 is specifically configured to: according to the TF-IDF weighted value, performing descending processing on each word obtained by word segmentation processing; and extracting a preset number of words from each word after the descending processing as seed words to obtain the seed word set.

Optionally, the apparatus 400 further comprises: the seed vocabulary updating module is used for adding the screened field vocabulary to the seed vocabulary set so as to update the seed vocabulary set; and the field vocabulary updating module is used for respectively calculating the correlation values between each candidate word in the candidate word set and each updated seed vocabulary in the seed vocabulary set, and screening out the candidate words with the correlation values larger than a preset threshold value from the candidate word set as the field vocabularies.

Thus, according to this embodiment, a domain text corpus of a domain word to be extracted is obtained first, then word segmentation is performed on the domain text corpus, the aggregation and the degree of freedom of each word obtained through word segmentation are calculated, a candidate word set is determined from each word obtained through word segmentation according to the aggregation and the degree of freedom, then word segmentation is performed on the domain text corpus by using an open source word segmentation tool, a TF-IDF weight value of each word obtained through word segmentation is calculated, a seed word set is determined from each word obtained through word segmentation according to the TF-IDF weight value, finally, a correlation value between each candidate word in the candidate word set and each seed word in the seed word set is calculated, and a candidate word with a correlation value larger than a preset threshold value is selected from the candidate word set to serve as the domain word. According to the method, a small amount of seed vocabularies are sorted out by means of an open source word segmentation tool, correlation values between candidate words and each seed vocabulary obtained through word segmentation processing are calculated, and a certain amount of field vocabularies can be accurately and efficiently excavated through correlation value filtering. Compared with the mining algorithm of supervised learning, the method needs a large amount of manual labeling cost, is low in efficiency and does not have cross-field applicability, can quickly adapt to different fields, can simply and efficiently mine field vocabularies only by a small amount of seed vocabularies, and has industrial floor practical value. Therefore, the technical problems of low precision rate, low recall rate and incapability of flexible use of the field vocabulary mining in the prior art are solved.

Example 3

Fig. 5 shows an apparatus 500 for mining a domain vocabulary based on a correlation analysis algorithm according to the present embodiment, wherein the apparatus 500 corresponds to the method according to the first aspect of the embodiment 1. Referring to fig. 5, the apparatus 500 includes: a processor 510; and a memory 520 coupled to processor 510 for providing processor 510 with instructions to process the following process steps: acquiring a domain text corpus of a domain vocabulary to be extracted; performing word segmentation on the field text corpus, calculating the degree of aggregation and the degree of freedom of each word obtained through the word segmentation, and determining a candidate word set from each word obtained through the word segmentation according to the degree of aggregation and the degree of freedom; utilizing an open source word segmentation tool to perform word segmentation on the field text corpus, calculating TF-IDF weight values of all words obtained through word segmentation, and determining a seed vocabulary set from all words obtained through word segmentation according to the TF-IDF weight values; and respectively calculating the association values between each candidate word in the candidate word set and each seed word in the seed word set, and screening out the candidate words with the association values larger than a preset threshold value from the candidate word set as the field words.

Optionally, the obtaining of the domain text corpus of the domain vocabulary to be extracted includes: acquiring text corpora in the field from a database; and preprocessing the text corpus in the field to obtain the field text corpus of the field vocabulary to be extracted.

Optionally, performing word segmentation processing on the domain text corpus, including: determining a word segmentation length interval; and performing word segmentation processing on left and right sentences in the field text corpus by using a word segmentation algorithm with the word segmentation length interval.

Optionally, after the candidate words with the association value greater than the preset threshold value are screened out from the candidate word set as the domain vocabulary, the memory 520 is further configured to provide the processor 510 with instructions for processing the following processing steps: adding the screened field vocabularies to the seed vocabulary set to update the seed vocabulary set; and respectively calculating the association value between each candidate word in the candidate word set and each seed word in the updated seed word set, and screening out the candidate words with the association value larger than a preset threshold value from the candidate word set as the field words.

Thus, according to this embodiment, a domain text corpus of a domain vocabulary to be extracted is obtained first, then the domain text corpus is subjected to word segmentation, the degree of aggregation and the degree of freedom of each word obtained through word segmentation are calculated, a candidate word set is determined from each word obtained through word segmentation according to the degree of aggregation and the degree of freedom, then word segmentation is performed on the domain text corpus by using an open source word segmentation tool, a TF-IDF weight value of each word obtained through word segmentation is calculated, a seed vocabulary set is determined from each word obtained through word segmentation according to the TF-IDF weight value, finally, a correlation value between each candidate word in the candidate word set and each seed vocabulary in the seed vocabulary set is calculated, and a candidate word with a correlation value larger than a preset threshold value is screened from the candidate word set to serve as the domain vocabulary. According to the method, a small amount of seed vocabularies are sorted out by means of an open source word segmentation tool, correlation values between candidate words and each seed vocabulary obtained through word segmentation processing are calculated, and a certain amount of field vocabularies can be accurately and efficiently excavated through correlation value filtering. Compared with the mining algorithm of supervised learning, the method needs a large amount of manual labeling cost, is low in efficiency and does not have cross-field applicability, can quickly adapt to different fields, can simply and efficiently mine field vocabularies only by a small amount of seed vocabularies, and has industrial floor practical value. Therefore, the technical problems that the domain vocabulary mining is low in precision rate, low in recall rate and incapable of being used flexibly in the prior art are solved.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for mining a domain vocabulary based on a correlation analysis algorithm is characterized by comprising the following steps:

acquiring a domain text corpus of a domain vocabulary to be extracted;

performing word segmentation on the field text corpus, calculating the degree of aggregation and the degree of freedom of each word obtained through the word segmentation, and determining a candidate word set from each word obtained through the word segmentation according to the degree of aggregation and the degree of freedom;

utilizing an open source word segmentation tool to perform word segmentation on the field text corpus, calculating TF-IDF weight values of all words obtained through word segmentation, and determining a seed vocabulary set from all words obtained through word segmentation according to the TF-IDF weight values;

and respectively calculating the association values between each candidate word in the candidate word set and each seed word in the seed word set, and screening out the candidate words with the association values larger than a preset threshold value from the candidate word set as the field words.

2. The method according to claim 1, wherein obtaining a domain text corpus of a domain vocabulary to be extracted comprises:

acquiring text corpora in the field from a database;

and preprocessing the text corpus in the field to obtain the field text corpus of the field vocabulary to be extracted.

3. The method of claim 2, wherein preprocessing the text corpus in the domain comprises: and performing sentence-breaking processing on the text corpus in the field, and recording the document name of each sentence.

4. The method according to claim 1, wherein performing word segmentation processing on the domain text corpus comprises:

determining a word cutting length interval;

and performing word segmentation processing on the left sentence and the right sentence in the field text corpus by using a word segmentation algorithm with the word segmentation length interval.

5. The method of claim 1, wherein calculating a degree of aggregation and a degree of freedom of each word resulting from the word segmentation process, and determining a set of candidate words from each word resulting from the word segmentation process based on the degree of aggregation and the degree of freedom comprises:

calculating the degree of aggregation of each word obtained by word segmentation according to a preset aggregation calculation formula; calculating the degree of freedom of each word obtained by word segmentation according to a preset degree of freedom calculation formula;

determining first candidate words with the degree of aggregation larger than a preset degree of aggregation threshold value from all words obtained through word segmentation processing, and determining second candidate words with the degree of freedom larger than a preset degree of freedom threshold value from all words obtained through word segmentation processing;

and taking the intersection of the first candidate word and the second candidate word to obtain the candidate word set.

6. The method of claim 1, wherein determining a seed vocabulary set from each word resulting from the tokenization process based on the TF-IDF weight values comprises:

according to the TF-IDF weighted value, performing descending processing on each word obtained by word segmentation processing;

and extracting a preset number of words from each word after the descending processing as seed words to obtain the seed word set.

7. The method of claim 1, wherein after the candidate words with the association value greater than the preset threshold value are screened from the candidate word set as the domain vocabulary, the method further comprises:

adding the screened field vocabularies to the seed vocabulary set to update the seed vocabulary set;

and respectively calculating the association value between each candidate word in the candidate word set and each seed word in the updated seed word set, and screening out the candidate words with the association value larger than a preset threshold value from the candidate word set as the field words.

8. A storage medium comprising a stored program, wherein the method of any one of claims 1 to 7 is performed by a processor when the program is run.

9. An apparatus for mining a domain vocabulary based on a correlation analysis algorithm, comprising:

the corpus acquisition module is used for acquiring the field text corpus of the field vocabulary to be extracted;

the candidate word set determining module is used for carrying out word segmentation on the field text corpus, calculating the degree of aggregation and the degree of freedom of each word obtained through the word segmentation, and determining a candidate word set from each word obtained through the word segmentation according to the degree of aggregation and the degree of freedom;

the seed vocabulary set determining module is used for utilizing an open source word segmentation tool to perform word segmentation on the field text corpus, calculating TF-IDF (Trans-inverse discrete frequency) weight values of all words obtained through word segmentation, and determining a seed vocabulary set from all words obtained through word segmentation according to the TF-IDF weight values;

and the domain vocabulary determining module is used for respectively calculating the association values between each candidate word in the candidate word set and each seed vocabulary in the seed vocabulary set, and screening out the candidate words with the association values larger than a preset threshold value from the candidate word set as the domain vocabularies.

10. An apparatus for mining a domain vocabulary based on a correlation analysis algorithm, comprising:

a processor; and

a memory coupled to the processor for providing instructions to the processor for processing the following processing steps:

acquiring a domain text corpus of a domain vocabulary to be extracted;