CN113705227A

CN113705227A - Method, system, medium and device for constructing Chinese non-segmented word and word embedding model

Info

Publication number: CN113705227A
Application number: CN202010437000.3A
Authority: CN
Inventors: 张一帆; 王茂华; 顾倩荣; 黄永健
Original assignee: Shanghai Advanced Research Institute of CAS
Current assignee: Shanghai Advanced Research Institute of CAS
Priority date: 2020-05-21
Filing date: 2020-05-21
Publication date: 2021-11-26
Anticipated expiration: 2040-05-21
Also published as: CN113705227B

Abstract

The invention provides a method, a system, a medium and equipment for constructing a Chinese non-participle word embedding model, wherein the method for constructing the Chinese non-participle word embedding model comprises the following steps: counting candidate fragments in a corpus and word frequency information corresponding to the candidate fragments; determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary list according to the association strength; and constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set. The invention provides a method for improving a non-participle word embedding model by using an unsupervised relevance measurement index, aiming at the problem of excessive noise n-gram in the existing non-participle word embedding model vocabulary, taking Chinese linguistic data as a research object and utilizing a negatively sampled skip-gram model.

Description

Method, system, medium and device for constructing Chinese non-segmented word and word embedding model

Technical Field

The invention belongs to the technical field of natural language processing, relates to a design method of a word embedding model, and particularly relates to a construction method, a system, a medium and equipment of a Chinese non-segmented word and word embedding model.

Background

Word embedding is used as a basic task in the field of natural language processing, and plays an important role in downstream tasks such as machine translation, part of speech tagging and the like. Because there is no obvious separator between the words in the Chinese corpus, the existing Chinese word embedding usually needs to perform Chinese word segmentation first to obtain the word after word segmentation as the target of word embedding. However, the current Chinese word segmentation still has a plurality of problems, and the problems can seriously affect the quality of Chinese word embedding. Therefore, for languages like Chinese, in order to avoid the influence caused by word segmentation errors, a word segmentation-free word embedding model is proposed and proved to be superior to the traditional word embedding method.

The existing word embedding model without word segmentation is mainly used for collecting n-gram fragments with the highest Top-K word frequency as model training objects. But considering only the word frequency results in a large number of noisy n-gram fragments in the vocabulary of word insertions, which may affect the quality of the final generated word insertions.

Therefore, how to provide a design method for a word-segmentation-free word embedding model, which reduces the influence of a large number of noise n-gram fragments on the quality of the finally generated word embedding model and improves the quality of the word embedding model, has become a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above disadvantages of the prior art, an object of the present invention is to provide a method, a system, a medium, and an apparatus for constructing a chinese non-segmented word-word embedding model, which are used to solve the problem that the prior art cannot reduce the influence of a large number of noise n-gram fragments on the quality of a finally generated word embedding model, and improve the quality of the word embedding model.

In order to achieve the above and other related objects, an aspect of the present invention provides a method for constructing a chinese non-segmented word and word embedding model, including: counting candidate fragments in a corpus and word frequency information corresponding to the candidate fragments; determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary list according to the association strength; and constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.

In an embodiment of the present invention, the candidate segments are chinese language model segments, and the step of counting the candidate segments in the corpus and the word frequency information corresponding to the candidate segments includes: and counting Chinese language model fragments corresponding to different fixed length values and word frequency information thereof in the corpus set.

In an embodiment of the present invention, the step of determining the association strength of the candidate segment by combining the word frequency information and generating a word-embedded vocabulary according to the association strength includes: determining an unsupervised association metric of the candidate segment in combination with the word frequency information, wherein the unsupervised association metric characterizes the association strength of the candidate segment; and sequentially arranging the association strengths from large to small, and selecting K candidate segments before the association strengths as a word-embedded vocabulary list.

In an embodiment of the present invention, the step of determining the unsupervised association metric of the candidate segment according to the word frequency information includes: calculating mutual information values of the candidate segments, and determining a corresponding segment combination when the mutual information values are minimum; determining a first set and a second set according to the fragment combination, and calculating a statistical relationship value between the fragment combination and the first set or the second set; and taking the product of the word frequency information, the mutual information value and the statistical relationship value as an unsupervised association measurement index.

In an embodiment of the present invention, the step of determining the first set and the second set according to the fragment combination and calculating the statistical relationship value between the fragment combination and the first set or the second set includes: taking the maximum value of the word frequency ratio of the word frequency information to the word frequency of the first set and the maximum value of the word frequency ratio of the word frequency information to the word frequency of the second set as a numerator, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator; calculating a fraction value formed by the numerator and the denominator as a relative importance degree calculation value of the segment combination in the first set or the second set; and determining the statistical relationship value according to the relative importance degree calculation value.

In an embodiment of the present invention, the calculation of the correlation strength is performed for the candidate segments of each length; according to the candidate segments with different lengths, the association strengths are sequentially arranged from large to small under each length, the K candidate segments before the association strengths are selected as word embedding vocabularies, and different numbers of candidate segments are selected as word embedding vocabularies according to different lengths.

In an embodiment of the present invention, the step of constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set includes: and on the basis of combining a skip-gram model with negative sampling, adopting a parameter optimization method to maximize the positive sampling probability and minimize the negative sampling probability, and constructing the word embedding model.

The invention also provides a system for constructing a Chinese non-participle word embedding model, which comprises the following steps: the fragment counting module is used for counting candidate fragments in the corpus and word frequency information corresponding to the candidate fragments; the association measurement module is used for determining the association strength of the candidate segments by combining the word frequency information and generating a word embedded vocabulary according to the association strength; and the model generation module is used for constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.

Still another aspect of the present invention provides a medium having stored thereon a computer program that, when being executed by a processor, implements the method for constructing the chinese non-segmented word embedding model.

A final aspect of the invention provides an apparatus comprising: a processor and a memory; the memory is used for storing computer programs, and the processor is used for executing the computer programs stored in the memory so as to enable the equipment to execute the construction method of the Chinese non-segmented word embedding model.

As mentioned above, the method, system, medium and device for constructing the Chinese non-segmented word and word embedding model have the following advantages:

a new unsupervised relevance metric is proposed for screening n-gram fragments with strong relevance. The unsupervised relevance measurement index is combined with a word embedding model to construct a new Chinese corpus-oriented word-segmentation-free Chinese word embedding model. The word embedding model obtained by the invention can show better performance in downstream tasks.

Drawings

Fig. 1 is a schematic flow chart illustrating a method for constructing a chinese non-segmented word-word embedding model according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating an association metric method of the method for constructing an embedding model of chinese non-segmented words according to an embodiment of the present invention.

Fig. 3 is a flowchart illustrating the calculation of the association strength of the method for constructing the chinese non-segmented word-word embedding model according to an embodiment of the present invention.

FIG. 4 is a diagram showing the effect of the construction method of the Chinese non-segmented word and word embedding model of the present invention compared with a basic dictionary.

FIG. 5 is a diagram illustrating the effect of the method for constructing an embedding model of Chinese non-segmented words and words in comparison with a rich dictionary according to the present invention.

FIG. 6 is a schematic structural diagram of a system for constructing an embedding model of Chinese non-segmented words according to an embodiment of the present invention.

FIG. 7 is a schematic structural connection diagram of an apparatus for constructing a Chinese non-segmented word embedding model according to an embodiment of the present invention.

Description of the element reference numerals

6 Chinese non-segmented word and word embedding model construction system

61 fragment statistical module

62 correlation metric module

63 model generation module

7 device

71 processor

72 memory

73 communication interface

74 system bus

S11-S13

S121 to S122

S121A-S121C steps

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that, in the following embodiments, features in the embodiments may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The invention provides a method for constructing a Chinese word-segmentation-free word embedding model, which aims at the problem of excessive noise n-gram in the vocabulary of the existing word-segmentation-free word embedding model, takes Chinese linguistic data as a research object and utilizes a negatively sampled skip-gram model to improve the word-segmentation-free word embedding model by using an unsupervised relevance metric index.

The principle and implementation of the method, system, medium and apparatus for constructing a chinese non-segmented word embedding model according to the present embodiment will be described in detail below with reference to fig. 1 to 7, so that those skilled in the art can understand the method, system, medium and apparatus for constructing a chinese non-segmented word embedding model without creative work.

Please refer to fig. 1, which is a schematic flow chart illustrating a method for constructing a chinese non-segmented word embedding model according to an embodiment of the present invention. As shown in fig. 1, the method for constructing the chinese non-segmented word and word embedding model specifically includes the following steps:

and S11, counting the candidate fragments in the corpus and the word frequency information corresponding to the candidate fragments.

In this embodiment, the candidate segments are chinese language model segments, for example, the candidate segments are n-gram segments, and the n-gram segments corresponding to different fixed length values and their word frequency information are counted in the corpus set.

Specifically, a simple word segmentation device is realized through an n-gram model to obtain n-gram fragments. The model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words and not to any other words, and that the probability of a complete sentence is the product of the probabilities of occurrence of the words. In general, we only compute the probability of two words before and after a word, i.e. n is 2, and compute the probability of n-2,. n-1, n +1, n + 2. If n is 3, the calculation effect is better; when n is 4, the amount of calculation becomes large.

Specifically, the corpus set is sorted, and all possible n-gram segments and word frequency information thereof under a fixed length are counted. And performing list management on the n-gram segments with different lengths and corresponding word frequency information to form a table 1. For example, the length is "one" of 1 Chinese character in Table 1, and the word frequency is 529285.

TABLE 1 candidate fields Table

And S12, determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength.

In this embodiment, the calculation of the correlation strength is performed for the candidate segment of each length.

Furthermore, according to the candidate segments with different lengths, the association strengths are sequentially arranged from large to small under each length, the candidate segments with K previous association strengths are selected as word embedding vocabularies, and the candidate segments with different numbers are selected as word embedding vocabularies according to different lengths.

Please refer to fig. 2, which is a flowchart illustrating an association metric method of the method for constructing a chinese non-segmented word embedding model according to an embodiment of the present invention. As shown in fig. 2, S12 includes:

s121, determining an unsupervised association metric index of the candidate segment by combining the word frequency information, wherein the unsupervised association metric index represents the association strength of the candidate segment. The PATI (position Association with Times Information) of the invention can explore more strongly associated n-gram fragments by considering more statistic Information.

Please refer to fig. 3, which is a flowchart illustrating the correlation strength calculation of the method for constructing the chinese non-segmented word embedding model according to an embodiment of the present invention. As shown in fig. 3, S121 includes:

S121A, calculating mutual information values of the candidate segments, and determining the corresponding segment combination when the mutual information values are minimum.

In particular, the mutual information value is an MP value, g ═ w for each n-gram fragment of length s_iw_i+1...w_i+s(i is not less than 0 and not more than N-s), and the left part and the right part of g are respectively a ═ w_i...w_k-1，b＝w_k...w_i+s(i<k<i + s), i.e. g ═ concat (a, b). f. of_a，f_bAnd f_gRespectively representing the word frequencies of the character strings a, b and the n-gram fragment g in the corpus.

For an n-gram fragment g ═ concat (a, b), the corresponding MP is defined as:

for n-gram fragments g of fixed length s, there will always be a specific left and right a, b combination (a)_m,b_m) MP can be minimized. Subsequently, the AT's calculation will also be based on this particular combination (a)_m,b_m)。

S121B, determining a first set and a second set according to the fragment combination, and calculating the statistical relationship value of the fragment combination and the first set or the second set.

In the present embodiment, S121B includes:

(1) and taking the maximum value of the ratio of the word frequency information to the word frequency of the first set and the ratio of the word frequency of the second set as a numerator, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator.

For a particular combination of n-gram fragments g (a)_m,b_m) There are a collection of n-gram fragments (a) of the same length_m,b_h) And (a)_j,b_m) Then the first set { a }_mAnd a second set {. b {. B { (R) }_mThe following definitions are provided:

{a_m,*}＝{(a_m,b₁),(a_m,b₂),…,(a_m,b_h) Equation (2)

{*,b_m}＝{(a₁,b_m),(a₂,b_m),…,(a_j,b_m) Formula (3)

Order to

And

are respectively represented in the set { a }_mA and b_mThe sum of the word frequencies of all n-gram segments in the } is defined as:

for n-gram fragments g and specific combinations thereof (a)_m,b_m) The variable rate represents f_gAnd

ratio and f_gAnd

the maximum of the ratios, namely the rate, is defined as follows:

for two sets a_mA and b_mAnd their counterparts

And

let sizeof represent the number of n-gram elements in the set, then AC can be defined as:

(2) and taking the numerator calculation value formed by the numerator and the denominator as the relative importance degree calculation value of the segment combination in the first set or the second set.

Specifically, given the variable rate and the variable AC, the time value of the n-gram fragment g at this time is defined as follows:

(3) and determining the statistical relationship value according to the relative importance degree calculation value.

In particular, for a particular combination of n-gram fragments g of length s (a)_m,b_m) If there is a unique variable time, the calculation formula of AT is:

AT 1+ | logtimes | formula (9)

And S121C, taking the product of the word frequency information, the mutual information value and the statistical relationship value as an unsupervised association measurement index.

Specifically, the formula of PATI (position Association with Times Information) is as follows:

PATI F × MP × AT equation (10)

Wherein, F-fg is the word frequency information, MP is the mutual information value, and AT is the statistical relationship value.

The MP is a modified version of the PMI, which considers more marginal variables of each n-gram fragment g ═ concat (a, b), i.e. statistics of the left and right parts a and b of the n-gram, in the calculation process of the correlation strength, so as to be more sensitive to local information of the n-gram.

AT by utilizing a specific combination of each n-gram fragment (a)_m,b_m) In the set { a }_mB or b_mThe statistical information in the } to further measure the strength of the association of n-grams. Variable times is taken into account

And (a)_m,b_m) Word frequency information of and (a)_m,b_m) Front and rear adjacency number of (a) to evaluate_m,b_m) The relative degree of importance in the collection, the higher the time value, generally states (a)_m,b_m) As a whole is reasonable. Generally, the times of most reasonable n-grams with higher correlation strengths are much larger than those of unreasonable n-gram fragments.

And S122, sequentially arranging the association strengths from large to small, and selecting K candidate segments before the association strengths as a word-embedded vocabulary.

Specifically, the proposed unsupervised relevance metric index is used to calculate the relevance strength of the candidate n-gram segment under each segment length. And then selecting the n-gram segment with the highest Top-K association strength as a vocabulary of the word embedding model. Wherein, the problem that the front K is large or small is solved in a large pile number, which is called Top-K problem for short.

And S13, constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.

In this embodiment, the word embedding model is constructed by using a skip-gram model in combination with negative sampling and by using a maximum likelihood estimation method to maximize the positive sampling probability and minimize the negative sampling probability. The method screens the vocabulary of word embedding by using the unsupervised relevance metric index, reconstructs the positive sampling and the negative sampling of the word embedding model, reduces the influence of noise n-gram fragments on the model, and improves the expression of word embedding in downstream tasks.

It should be noted that the maximum likelihood estimation is only one embodiment of the parameter estimation and optimization of the present invention, and other methods for implementing the parameter estimation and optimization are also included in the scope of the present invention.

Specifically, the PFNE learns word embedding based on a negatively sampled skip-gram model, so that the calculated amount of the model when the gradient of the model is reduced, and the model training is accelerated. Positive sampling set N of models_pThat is, the "headword-context pair" (w) generated by combining the vocabulary and the corpus_t,w_c) Set of negative samples N_nThen, a sufficiently large unigram language model vocabulary is constructed, each n-gram fragment is indexed in the list, and a negative sampling sample is randomly obtained according to the word frequency of the n-gram in the vocabulary. The objective function of the PFNE model is defined as follows:

wherein the content of the first and second substances,

and

are respectively the central word w_tAnd its context w_cUsing maximum likelihood estimation, the model predicts the context from the central word, maximizing the probability of positive samples while minimizing the probability of negative samples, such that the objective function is generatedThe word embedding model of (2) is optimal. The optimization of the objective function adopts a random gradient descent method based on positive and negative sampling.

Please refer to fig. 4 and fig. 5, which respectively show an effect diagram of the method for constructing the chinese non-segmented word embedding model according to the present invention compared with the basic dictionary and an effect diagram of the method for constructing the chinese non-segmented word embedding model according to the present invention compared with the rich dictionary. In FIGS. 4 and 5, PFNE represents the result of comparing the n-gram fragments screened using the PATI algorithm with a dictionary; sembei is the result of comparing the n-gram segments screened by using frequency (word frequency) with a dictionary; the SGNS-PMI is a result of comparing an n-gram fragment selected by PMI (Mutual Information) with a dictionary. The vertical axis is precision rate and the horizontal axis is recall rate. The expression of the accuracy rate and the recall rate is as follows:

further, the higher the curve, the longer it is, the more rational n-gram fragments are screened. It can be seen in fig. 4 and 5 that the curve of the solid line PFNE using the pat algorithm is the highest and longest among the three curves, thereby illustrating that the method for constructing the chinese non-segmented word embedding model of the present invention can screen more reasonable n-gram fragments than the word embedding model (basic dictionary or rich dictionary) of the prior art.

The protection scope of the method for constructing the Chinese non-segmented word and word embedding model is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes of increasing, decreasing and replacing the steps in the prior art according to the principle of the invention are included in the protection scope of the invention.

The present embodiment provides a computer storage medium, on which a computer program is stored, which, when executed by a processor, implements the method for constructing the chinese non-segmented word embedding model.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned computer-readable storage media comprise: various computer storage media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The following describes in detail the construction system of the chinese non-segmented word embedding model provided in this embodiment with reference to the drawings. It should be noted that the division of the modules of the following system is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity or may be physically separated. And the modules can be realized in a form that all software is called by the processing element, or in a form that all the modules are realized in a form that all the modules are called by the processing element, or in a form that part of the modules are called by the hardware. For example: a module may be a separate processing element, or may be integrated into a chip of the system described below. Further, a certain module may be stored in the memory of the following system in the form of program code, and a certain processing element of the following system may call and execute the function of the following certain module. Other modules are implemented similarly. All or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, the steps of the above method or the following modules may be implemented by hardware integrated logic circuits in a processor element or instructions in software.

The following modules may be one or more integrated circuits configured to implement the above methods, for example: one or more Application Specific Integrated Circuits (ASICs), one or more Digital Signal Processors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), and the like. When some of the following modules are implemented in the form of a program code called by a Processing element, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling the program code. These modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).

Please refer to fig. 6, which is a schematic structural diagram of a system for constructing an embedding model of chinese non-segmented words according to an embodiment of the present invention. As shown in fig. 6, the system 6 for constructing the chinese non-segmented word embedding model includes: a segment statistics module 61, an association metrics module 62 and a model generation module 63.

The segment counting module 61 is configured to count candidate segments in the corpus and word frequency information corresponding to the candidate segments.

In this embodiment, the candidate segments are chinese language model segments, and the segment counting module 61 is specifically configured to count chinese language model segments corresponding to different fixed length values and word frequency information thereof in the corpus.

The relevance metric module 62 is configured to determine the relevance strength of the candidate segment in combination with the word frequency information, and generate a word-embedded vocabulary according to the relevance strength.

In this embodiment, the association metric module 62 is specifically configured to determine, in combination with the word frequency information, an unsupervised association metric index of the candidate segment, where the unsupervised association metric index characterizes an association strength of the candidate segment; and sequentially arranging the association strengths from large to small, and selecting K candidate segments before the association strengths as a word-embedded vocabulary list.

The model generation module 63 is configured to construct a positive sampling set and a negative sampling set according to the vocabulary, and construct a word embedding model by combining the positive sampling set and the negative sampling set.

In this embodiment, the model generating module 63 is specifically configured to construct the word embedding model by using a parameter optimization method to maximize the positive sampling probability and minimize the negative sampling probability based on a skip-gram model combined with negative sampling.

The system for constructing the Chinese non-participle word embedding model can realize the method for constructing the Chinese non-participle word embedding model, but the device for realizing the method for constructing the Chinese non-participle word embedding model comprises but is not limited to the structure of the system for constructing the Chinese non-participle word embedding model listed in the embodiment, and all structural deformation and replacement in the prior art according to the principle of the invention are included in the protection scope of the invention.

Please refer to fig. 7, which is a schematic structural connection diagram of a device for constructing an embedding model of chinese non-segmented words according to an embodiment of the present invention. As shown in fig. 7, the present embodiment provides an apparatus 7, the apparatus 7 including: a processor 71, memory 72, communication interface 73, or/and system bus 74; the memory 72 and the communication interface 73 are connected with the processor 71 through a system bus 74 and perform communication with each other, the memory 72 is used for storing computer programs, the communication interface 73 is used for communicating with other devices, and the processor 71 is used for running the computer programs to enable the device 7 to execute the steps of the method for constructing the Chinese non-segmented word embedding model.

The system bus 74 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. The communication interface 73 is used to realize communication between the database access device and other devices (such as a client, a read-write library, and a read-only library). The Memory 72 may include a Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor 71 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

In summary, the method, system, medium and apparatus for constructing a Chinese non-segmented word-word embedding model of the present invention provide a new unsupervised correlation metric for screening n-gram segments with strong correlation. The unsupervised relevance measurement index is combined with a word embedding model to construct a new Chinese corpus-oriented word-segmentation-free Chinese word embedding model. The word embedding model obtained by the invention can show better performance in downstream tasks. The invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for constructing a Chinese non-participle word embedding model is characterized by comprising the following steps:

counting candidate fragments in a corpus and word frequency information corresponding to the candidate fragments;

determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary list according to the association strength;

and constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.

2. The method for constructing the Chinese non-segmented word and word embedding model according to claim 1, wherein the candidate segments are Chinese language model segments, and the step of counting the candidate segments in the corpus and the word frequency information corresponding to the candidate segments comprises:

and counting Chinese language model fragments corresponding to different fixed length values and word frequency information thereof in the corpus set.

3. The method of claim 1, wherein the step of determining the association strength of the candidate segment in combination with the word frequency information and generating a vocabulary for word embedding according to the association strength comprises:

determining an unsupervised association metric of the candidate segment in combination with the word frequency information, wherein the unsupervised association metric characterizes the association strength of the candidate segment;

and sequentially arranging the association strengths from large to small, and selecting K candidate segments before the association strengths as a word-embedded vocabulary list.

4. The method for constructing the Chinese non-segmented word and word embedding model according to claim 3, wherein the step of determining the unsupervised relevance metric of the candidate segment in combination with the word frequency information comprises:

calculating mutual information values of the candidate segments, and determining a corresponding segment combination when the mutual information values are minimum;

determining a first set and a second set according to the fragment combination, and calculating a statistical relationship value between the fragment combination and the first set or the second set;

and taking the product of the word frequency information, the mutual information value and the statistical relationship value as an unsupervised association measurement index.

5. The method as claimed in claim 4, wherein the step of determining the first set and the second set according to the fragment combination and calculating the statistical relationship between the fragment combination and the first set or the second set comprises:

taking the maximum value of the word frequency ratio of the word frequency information to the word frequency of the first set and the maximum value of the word frequency ratio of the word frequency information to the word frequency of the second set as a numerator, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator;

calculating a fraction value formed by the numerator and the denominator as a relative importance degree calculation value of the segment combination in the first set or the second set;

and determining the statistical relationship value according to the relative importance degree calculation value.

6. The method for constructing the Chinese non-segmented word and word embedding model according to claim 3, wherein:

calculating the correlation strength of the candidate segments of each length;

according to the candidate segments with different lengths, the association strengths are sequentially arranged from large to small under each length, the K candidate segments before the association strengths are selected as word embedding vocabularies, and different numbers of candidate segments are selected as word embedding vocabularies according to different lengths.

7. The method for constructing a Chinese non-segmented word-word embedding model according to claim 1, wherein the step of constructing a positive sample set and a negative sample set according to the vocabulary, and constructing the word embedding model by combining the positive sample set and the negative sample set comprises:

and on the basis of combining a skip-gram model with negative sampling, adopting a parameter optimization method to maximize the positive sampling probability and minimize the negative sampling probability, and constructing the word embedding model.

8. A system for constructing a Chinese non-participle word embedding model is characterized by comprising the following steps:

the fragment counting module is used for counting candidate fragments in the corpus and word frequency information corresponding to the candidate fragments;

the association measurement module is used for determining the association strength of the candidate segments by combining the word frequency information and generating a word embedded vocabulary according to the association strength;

and the model generation module is used for constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.

9. A medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of constructing the chinese non-segmented word embedding model according to any one of claims 1 to 7.

10. An apparatus, comprising: a processor and a memory;

the memory is used for storing a computer program, and the processor is used for executing the computer program stored by the memory to enable the equipment to execute the construction method of the Chinese non-segmented word embedding model according to any one of claims 1 to 7.