CN113705227B - Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model - Google Patents
Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model Download PDFInfo
- Publication number
- CN113705227B CN113705227B CN202010437000.3A CN202010437000A CN113705227B CN 113705227 B CN113705227 B CN 113705227B CN 202010437000 A CN202010437000 A CN 202010437000A CN 113705227 B CN113705227 B CN 113705227B
- Authority
- CN
- China
- Prior art keywords
- word
- word frequency
- candidate
- constructing
- frequency information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000005070 sampling Methods 0.000 claims abstract description 54
- 238000005259 measurement Methods 0.000 claims abstract description 13
- 239000012634 fragment Substances 0.000 claims description 71
- 238000004590 computer program Methods 0.000 claims description 11
- 230000011218 segmentation Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000005457 optimization Methods 0.000 claims description 6
- 241000446386 Patis Species 0.000 claims 1
- 238000011160 research Methods 0.000 abstract description 2
- 238000012545 processing Methods 0.000 description 13
- 238000010586 diagram Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 238000007476 Maximum Likelihood Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a method, a system, a medium and equipment for constructing a Chinese word-segmentation-free word embedding model, wherein the method for constructing the Chinese word-segmentation-free word embedding model comprises the following steps: counting candidate segments in a corpus and word frequency information corresponding to the candidate segments; determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength; and constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set. Aiming at the problem of excessive noise n-gram in the vocabulary of the existing word-segmentation-free word embedding model, the invention provides a method for improving the word-segmentation-free word embedding model by using an unsupervised correlation measurement index by taking Chinese corpus as a research object and utilizing a negative sampling skip-gram model.
Description
Technical Field
The invention belongs to the technical field of natural language processing, relates to a design method of a word embedding model, and particularly relates to a construction method, a system, a medium and equipment of a Chinese word embedding model without word segmentation.
Background
Word embedding is a basic task in the field of natural language processing, and plays an important role in downstream tasks such as machine translation, part-of-speech tagging and the like. Because there is no obvious separator between words in the Chinese corpus, the existing Chinese word embedding generally needs to perform Chinese word segmentation first to obtain word after word segmentation as a word embedding target. However, there are still many problems in the current chinese word segmentation, and these problems seriously affect the quality of chinese word embedding. Thus, for languages like chinese, a word-segmentation-free word embedding model is proposed and proved to be superior to the conventional word embedding method in order to avoid the influence of word segmentation errors.
The current word-segmentation-free word embedding model mainly takes an n-gram segment with highest Top-K word frequency as a model training object. However, merely considering word frequencies can result in a large number of noisy n-gram segments in the word-embedded vocabulary that can affect the quality of the resulting word-embedded.
Therefore, how to provide a design method for word embedding models without word segmentation, which reduces the influence of a large number of noise n-gram segments on the quality of the finally generated word embedding models, improves the quality of the word embedding models, and is a technical problem to be solved by the technicians in the field.
Disclosure of Invention
In view of the above drawbacks of the prior art, the present invention aims to provide a method, a system, a medium and a device for constructing a word-segmentation-free Chinese word embedding model, which are used for solving the problem that the quality of the word embedding model cannot be improved due to the fact that the influence of a large number of noise n-gram segments on the quality of the finally generated word embedding model is not reduced in the prior art.
To achieve the above and other related objects, according to one aspect of the present invention, there is provided a method for constructing a chinese word-segmentation-free word-embedded model, the method comprising: counting candidate segments in a corpus and word frequency information corresponding to the candidate segments; determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength; and constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.
In an embodiment of the present invention, the candidate segment is a chinese language model segment, and the step of counting the candidate segment in the corpus and word frequency information corresponding to the candidate segment includes: and counting the Chinese language model fragments and word frequency information thereof corresponding to different fixed length values in the corpus.
In one embodiment of the present invention, the step of determining the association strength of the candidate segment by combining the word frequency information, and generating the word embedded vocabulary according to the association strength includes: determining an unsupervised association metric of the candidate segment in combination with the word frequency information, wherein the unsupervised association metric characterizes the association strength of the candidate segment; and arranging the association strengths from large to small in sequence, and selecting K candidate fragments with the first association strength as a word embedded vocabulary.
In one embodiment of the present invention, the step of determining the unsupervised association metric of the candidate segment by combining the word frequency information includes: calculating the mutual information value of the candidate fragments, and determining the corresponding fragment combination when the mutual information value is minimum; determining a first set and a second set according to the fragment combination, and calculating a statistical relationship value between the fragment combination and the first set or the second set; and taking the product of the word frequency information, the mutual information value and the statistical relation value as an unsupervised association measurement index.
In an embodiment of the present invention, the step of determining the first set and the second set according to the segment combination, and calculating the statistical relationship value between the segment combination and the first set or the second set includes: taking the maximum value of the ratio of the word frequency information to the word frequency of the first set and the ratio of the word frequency information to the word frequency of the second set as a molecule, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator; the calculated value of the partial formula formed by the numerator and the denominator is used as the calculated value of the relative importance degree of the combination of the fragments in the first set or the second set; and determining the statistical relation value according to the relative importance degree calculated value.
In one embodiment of the present invention, the correlation strength is calculated for each length of the candidate segment; according to candidate fragments with different lengths, the association strength is sequentially arranged from large to small under each length, and K candidate fragments with the front association strength are selected as word embedded vocabularies, so that different numbers of candidate fragments are selected as word embedded vocabularies according to different lengths.
In one embodiment of the present invention, the step of constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set includes: based on a skip-gram model combined with negative sampling, a parameter optimization method is adopted to maximize positive sampling probability and minimize negative sampling probability, and the word embedding model is constructed.
The invention also provides a construction system of the Chinese word-segmentation-free word embedding model, which comprises: the segment statistics module is used for counting candidate segments in the corpus and word frequency information corresponding to the candidate segments; the association measurement module is used for determining association strength of the candidate fragments by combining the word frequency information and generating a word embedded vocabulary according to the association strength; and the model generation module is used for constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.
In yet another aspect, the present invention provides a medium having stored thereon a computer program which, when executed by a processor, implements a method for constructing a chinese word-segmentation-free embedding model.
In a final aspect the invention provides an apparatus comprising: a processor and a memory; the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so that the equipment executes the construction method of the Chinese word-segmentation-free word embedding model.
As described above, the method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model have the following beneficial effects:
a new unsupervised correlation metric is presented for screening n-gram fragments with strong correlation. The non-supervision association measurement index is combined with the word embedding model to construct a new word embedding model of non-segmentation Chinese word for Chinese language materials. The word embedding model obtained by the method can show better performance in downstream tasks.
Drawings
FIG. 1 is a schematic flow chart of a method for constructing a word-segmentation-free Chinese word embedding model according to an embodiment of the invention.
FIG. 2 is a flowchart showing the related metrics of the method for constructing the word-free Chinese word-embedding model according to an embodiment of the present invention.
FIG. 3 is a flowchart showing the correlation strength calculation in one embodiment of the method for constructing a Chinese word-segmentation-free word embedding model according to the present invention.
Fig. 4 shows an effect diagram of the method for constructing the Chinese word-segmentation-free word embedding model and the basic dictionary contrast.
Fig. 5 shows an effect diagram of the method for constructing the chinese word-segmentation-free word embedding model according to the present invention against the rich dictionary.
FIG. 6 is a schematic diagram of a system for constructing a Chinese word-segmentation-free word embedding model according to an embodiment of the present invention.
FIG. 7 is a schematic diagram showing the structural connection of the apparatus for constructing a Chinese word-segmentation-free word embedding model according to an embodiment of the present invention.
Description of element reference numerals
6. System for constructing Chinese word-segmentation-free word embedding model
61. Fragment statistics module
62. Correlation measurement module
63. Model generation module
7. Apparatus and method for controlling the operation of a device
71. Processor and method for controlling the same
72. Memory device
73. Communication interface
74. System bus
S11 to S13 steps
S121 to S122 steps
S121A to S121C steps
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
The method for constructing the Chinese word-segmentation-free word embedding model aims at the problem that noise n-gram is excessive in the vocabulary of the word-segmentation-free word embedding model at present, the Chinese corpus is used as a research object, and a negative sampling skip-gram model is utilized to provide a method for improving a word segmentation-free word embedding model by using an unsupervised correlation measurement index.
The following will explain in detail the principles and implementation of a method, a system, a medium and a device for constructing a chinese word-segmentation-free embedding model according to this embodiment with reference to fig. 1 to 7, so that those skilled in the art can understand the method, the system, the medium and the device for constructing a chinese word-segmentation-free embedding model according to this embodiment without creative effort.
Referring to fig. 1, a schematic flow chart of a method for constructing a word-segmentation-free Chinese word embedding model according to an embodiment of the invention is shown. As shown in FIG. 1, the method for constructing the Chinese word-segmentation-free word embedding model specifically comprises the following steps:
s11, counting candidate segments in the corpus and word frequency information corresponding to the candidate segments.
In this embodiment, the candidate segment is a chinese language model segment, for example, the candidate segment is an n-gram segment, and n-gram segments and word frequency information thereof corresponding to different fixed length values are counted in the corpus.
Specifically, a simple word segmentation device is realized through an n-gram model, and an n-gram fragment is obtained. The model is based on the assumption that the occurrence of the nth word is related to only the preceding n-1 words, but not to any other word, and the probability of the whole sentence is the product of the occurrence probabilities of the words. In general we only calculate the probabilities of two words before and after a word, i.e. n takes 2, and calculate the probabilities of n-2, # n-1, n+1, n+2. If n=3, the calculation is better; n=4, the calculation amount becomes large.
Specifically, the corpus is arranged, and all possible n-gram fragments and word frequency information thereof under a fixed length are counted. List management is carried out on n-gram fragments with different lengths and corresponding word frequency information to form a table 1. For example, "one" for 1 Chinese character in length in Table 1, the word frequency is 529285.
Table 1 table of candidate fields
And S12, determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength.
In this embodiment, the correlation strength is calculated for each length of the candidate segment.
Further, according to candidate fragments with different lengths, the association strength is sequentially arranged from large to small under each length, the candidate fragments with the first K association strength are selected to be used as word embedded vocabularies, and different numbers of candidate fragments are selected to be used as word embedded vocabularies according to different lengths.
Referring to fig. 2, a flowchart of a method for constructing a word-segmentation-free Chinese word embedding model according to an embodiment of the invention is shown. As shown in fig. 2, S12 includes:
s121, determining an unsupervised association metric index of the candidate segment by combining the word frequency information, wherein the unsupervised association metric index characterizes the association strength of the candidate segment. The PATIs (Pointwise Association with Times Information, unsupervised associated metrics) of the present invention can explore more strongly associated n-gram segments by considering more statistics information.
Referring to fig. 3, a flowchart of a method for constructing a word-segmentation-free Chinese word embedding model according to an embodiment of the invention is shown. As shown in fig. 3, S121 includes:
S121A, calculating the mutual information value of the candidate fragments, and determining the corresponding fragment combination when the mutual information value is minimum.
Specifically, the mutual information value is an MP value, g=w for each n-gram segment of length s i w i+1 ...w i+s (i is more than or equal to 0 and less than or equal to N-s), and the left part and the right part of g are respectively a=w i ...w k-1 ,b=w k ...w i+s (i<k<i+s), i.e., g=concat (a, b). f (f) a ,f b And f g And respectively representing word frequencies of the character strings a, b and the n-gram fragment g in the corpus.
For one n-gram fragment g=concat (a, b), its corresponding MP is defined as:
for a fixed length s of n-gram fragment g, there will always be a specific left and right a, b combination (a m ,b m ) MP can be minimized. Subsequently, the calculation of the AT will also be based on this particular combination (a m ,b m )。
S121B, determining a first set and a second set according to the fragment combination, and calculating the statistical relation value between the fragment combination and the first set or the second set.
In the present embodiment, S121B includes:
(1) And taking the maximum value of the ratio of the word frequency information to the word frequency of the first set and the ratio of the word frequency of the second set as a molecule, selecting the set with the minimum word frequency of the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator.
For a specific combination of n-gram fragments g (a m ,b m ) Has a plurality of n-gram fragments (a m ,b h ) And (a) j ,b m ) Then the first set { a } m Sum of }Second set {, b m The definition of } is as follows:
{a m ,*}={(a m ,b 1 ),(a m ,b 2 ),…,(a m ,b h ) Formula (2)
{*,b m }={(a 1 ,b m ),(a 2 ,b m ),…,(a j ,b m ) Formula (3)
Order theAnd->Respectively expressed in the set { a } m Sum }, b m The sum of the word frequencies of all n-gram fragments in }, then it is defined as:
for n-gram fragment g and specific combinations thereof (a m ,b m ) The variable rate represents f g And (3) withRatio f g And (3) withThe maximum value in the ratio, the rate, is defined as follows:
for two sets { a } m Sum }, b m ' and its correspondingAnd->Let sizeof represent the number of n-gram elements in the set, then AC can be defined as:
(2) And combining the calculated molecular value formed by the numerator and the denominator as the calculated relative importance degree value of the fragment in the first set or the second set.
Specifically, given a variable rate and a variable AC, the time values of the n-gram fragment g are defined as follows:
(3) And determining the statistical relation value according to the relative importance degree calculated value.
Specifically, for a specific combination of n-gram fragments g of length s (a m ,b m ) With unique variable times, the calculation formula of the AT is:
at=1+|logtimes|formula (9)
And S121C, taking the product of the word frequency information, the mutual information value and the statistical relation value as an unsupervised association measurement index.
Specifically, the formula of PATI (Pointwise Association with Times Information, unsupervised associated metrics) is as follows:
PATI= FxMP xAT equation (10)
Wherein f=fg is word frequency information, MP is mutual information value, and AT is statistical relation value.
MP is an improved version of mutual information PMI, which takes the marginal variable of each n-gram fragment g=concat (a, b), namely statistics of the left and right parts a and b of the n-gram, into more consideration in the calculation process of the association strength, so that the MP can be more sensitive to the local information of the n-gram.
The AT uses a specific combination (a m ,b m ) In the set { a } m (x) or (b) m Statistical information in the n-gram to further measure the correlation strength of the n-gram. The variable time is taken into considerationAnd (a) m ,b m ) Word frequency information of (a) m ,b m ) Before and after the number of adjacencies (a) m ,b m ) The higher the relative importance of the time values in the collection, the more general description (a m ,b m ) It is reasonable as a whole. In general, most reasonable n-gram values with higher correlation strength are much larger than those of unreasonable n-gram fragments.
S122, sequentially arranging the association strengths from large to small, and selecting K candidate fragments with the first association strength as a word embedded vocabulary.
Specifically, the proposed unsupervised correlation metrics are used to calculate the correlation strength of candidate n-gram segments for each segment length. And then selecting the n-gram fragment with the highest Top-K association strength as a vocabulary of the word embedding model. The Top-K problem is short for solving the problem that the front K is large or the front K is small in a large pile number.
S13, constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.
In this embodiment, based on the skip-gram model combined with negative sampling, a maximum likelihood estimation method is adopted to maximize positive sampling probability and minimize negative sampling probability, so as to construct the word embedding model. The invention uses the unsupervised associated measurement index to screen word embedded vocabulary, reconstructs positive sampling and negative sampling of the word embedded model, reduces the influence of noise n-gram segments on the model, and improves the expression of word embedded in downstream tasks.
It should be noted that, the maximum likelihood estimation is only one embodiment of the present invention for parameter estimation and optimization, and other methods for implementing parameter estimation and optimization are also included in the scope of the present invention.
Specifically, the PFNE learns word embedding based on a negative sampling skip-gram model, so that the calculated amount of model gradient descent is reduced, and model training is quickened. Positive sample set N of models p That is, the "center word-context pair" (w) generated by combining vocabulary and corpus t ,w c ) Negative sampling set N n Then a sufficiently large word list of the single-element language model is constructed, each n-gram segment is indexed in the list, and a negative sampling sample is randomly acquired according to the word frequency size of the n-gram in the word list. The objective function of the PFNE model is defined as follows:
wherein,,and->Respectively the center word w t And its context w c The model uses maximum likelihood estimation to predict context from the center word, maximize probability of positive samples, and minimize probability of negative samples to optimize word embedding model generated by the objective function. The optimization of the objective function adopts a random gradient descent method based on positive and negative sampling.
Referring to fig. 4 and fig. 5, an effect diagram of the method for constructing a chinese word-free embedding model and the basic dictionary of the present invention and an effect diagram of the method for constructing a chinese word-free embedding model and the rich dictionary of the present invention are shown respectively. In fig. 4 and 5, PFNE represents the result of comparing the n-gram fragment screened using the PATI algorithm with a dictionary; sembei is the result of comparing the n-gram fragments screened out by using frequency (word frequency) with a dictionary; SGNS-PMI is the result of comparing the n-gram fragment screened by PMI (Pointwise Mutual Information, mutual information) with a dictionary. The vertical axis is precision and the horizontal axis is recall. The expressions of the precision and recall are as follows:
further, the higher the curve, the longer, indicating that more reasonable n-gram fragments are screened. As can be seen in fig. 4 and 5, the curve of the solid line PFNE using the PATI algorithm is highest and longest among the three curves, so that it is illustrated that the method for constructing the word-segmentation-free Chinese word embedding model according to the present invention can screen more reasonable n-gram segments compared with the word embedding model (basic dictionary or rich dictionary) of the prior art.
The protection scope of the method for constructing the Chinese word-segmentation-free word embedding model is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes realized by the steps of increasing and decreasing and step replacement in the prior art according to the principles of the invention are included in the protection scope of the invention.
The present embodiment provides a computer storage medium having a computer program stored thereon, which when executed by a processor implements the method for constructing a chinese word-segmentation-free embedding model.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by computer program related hardware. The aforementioned computer program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned computer-readable storage medium includes: various computer storage media such as ROM, RAM, magnetic or optical disks may store program code.
The following describes in detail the construction system of the chinese word-segmentation-free word embedding model provided in this embodiment with reference to the drawings. It should be noted that, it should be understood that the division of the modules of the following system is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. The modules can be realized in a form of calling the processing element through software, can be realized in a form of hardware, can be realized in a form of calling the processing element through part of the modules, and can be realized in a form of hardware. For example: a module may be a separately established processing element or may be integrated in a chip of a system as described below. In addition, a certain module may be stored in the memory of the following system in the form of program codes, and the functions of the following certain module may be called and executed by a certain processing element of the following system. The implementation of the other modules is similar. All or part of the modules can be integrated together or can be implemented independently. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module below may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.
The following modules may be one or more integrated circuits configured to implement the above methods, for example: one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), one or more digital signal processors (Digital Signal Processor, DSP for short), one or more field programmable gate arrays (Field Programmable Gate Array, FPGA for short), and the like. When a module is implemented in the form of a processing element calling program code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may call program code. These modules may be integrated together and implemented in the form of a System-on-a-chip (SOC) for short.
Referring to fig. 6, a schematic diagram of a system for constructing a word-segmentation-free Chinese word embedding model according to an embodiment of the present invention is shown. As shown in fig. 6, the system 6 for constructing the chinese word-segmentation-free word embedding model includes: a segment statistics module 61, an association metrics module 62, and a model generation module 63.
The segment statistics module 61 is configured to statistics the candidate segments in the corpus and word frequency information corresponding to the candidate segments.
In this embodiment, the candidate segments are chinese language model segments, and the segment statistics module 61 is specifically configured to count chinese language model segments corresponding to different fixed length values and word frequency information thereof in the corpus.
The association metric module 62 is configured to determine an association strength of the candidate segments in combination with the word frequency information, and generate a word embedded vocabulary according to the association strength.
In this embodiment, the association metric module 62 is specifically configured to determine an unsupervised association metric indicator of the candidate segment in combination with the word frequency information, where the unsupervised association metric indicator characterizes an association strength of the candidate segment; and arranging the association strengths from large to small in sequence, and selecting K candidate fragments with the first association strength as a word embedded vocabulary.
The model generating module 63 is configured to construct a positive sampling set and a negative sampling set according to the vocabulary, and construct a word embedding model by combining the positive sampling set and the negative sampling set.
In this embodiment, the model generating module 63 is specifically configured to use a method of parameter optimization to maximize positive sampling probability and minimize negative sampling probability based on a skip-gram model in combination with negative sampling, and construct the word embedding model.
The system for constructing the Chinese word-segmentation-free word embedding model can realize the method for constructing the Chinese word-segmentation-free word embedding model, but the device for realizing the method for constructing the Chinese word-segmentation-free word embedding model comprises but is not limited to the structure of the system for constructing the Chinese word-segmentation-free word embedding model, which is listed in the embodiment, and all the structural variations and substitutions of the prior art according to the principles of the invention are included in the protection scope of the invention.
Referring to fig. 7, a schematic diagram of structural connection of a device for constructing a chinese word-segmentation-free embedding model according to an embodiment of the invention is shown. As shown in fig. 7, the present embodiment provides an apparatus 7, the apparatus 7 including: a processor 71, a memory 72, a communication interface 73, or/and a system bus 74; the memory 72 and the communication interface 73 are connected to the processor 71 via a system bus 74 and perform communication with each other, the memory 72 is used for storing a computer program, the communication interface 73 is used for communicating with other devices, and the processor 71 is used for running the computer program to enable the device 7 to execute the steps of the method for constructing the Chinese word segmentation free embedding model.
The system bus 74 mentioned above may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The system bus may be classified into an address bus, a data bus, a control bus, and the like. The communication interface 73 is used to enable communication between the database access apparatus and other devices such as clients, read-write libraries and read-only libraries. The memory 72 may include a random access memory (Random Access Memory, simply referred to as RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor 71 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Alication Specific Integrated Circuit, ASIC for short), field programmable gate arrays (Field Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In summary, the method, system, medium and equipment for constructing the Chinese word-segmentation-free word embedding model provide a new unsupervised association metric index for screening n-gram fragments with strong association degree. The non-supervision association measurement index is combined with the word embedding model to construct a new word embedding model of non-segmentation Chinese word for Chinese language materials. The word embedding model obtained by the method can show better performance in downstream tasks. The invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.
Claims (7)
1. The method for constructing the Chinese word-segmentation-free word embedding model is characterized by comprising the following steps of:
counting candidate segments in a corpus and word frequency information corresponding to the candidate segments;
determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength;
constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set;
determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength, wherein the word frequency information comprises the following steps: determining an unsupervised association metric of the candidate segment in combination with the word frequency information, wherein the unsupervised association metric characterizes the association strength of the candidate segment; sequentially arranging the association strengths from large to small, and selecting K candidate fragments with the previous association strength as a word embedded vocabulary;
wherein determining the unsupervised association metric of the candidate segment in combination with the word frequency information comprises:
a, calculating the mutual information value of the candidate fragments, and determining the corresponding fragment combination when the mutual information value is minimum; the mutual information value is MP value, MP is defined as:f a ,f b and f g The word frequency of the character strings a, b and the n-gram fragment g in the corpus is represented respectively;
b, determining a first set and a second set according to the fragment combination, and calculating a statistical relation value between the fragment combination and the first set or the second set, wherein the method comprises the following steps:
(1) Taking the maximum value of the ratio of the word frequency information to the word frequency of the first set and the ratio of the word frequency information to the word frequency of the second set as a molecule, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator;
for a specific combination of n-gram fragments g (a m ,b m ) A batch of n-gram fragments (a m ,b h ) And (a) j ,b m ) Then the first set { a } m (x) and (b) a second set m Defined as: { a m ,*}={(a m ,b 1 ),(a m ,b 2 ),…,(a m ,b h ) } and { b m }={(a 1 ,b m ),(a 2 ,b m ),…,(a j ,b m ) -a }; order theAnd->Respectively expressed in the set { a } m Sum }, b m The sum of the word frequencies of all n-gram fragments in }, respectively defined as:
For n-gram fragment gSpecific combinations (a) m ,b m ) The variable rate represents f g And (3) withRatio f g And->Maximum in the ratio, rate is defined as: />
For two sets { a } m Sum }, b m ' and its correspondingAnd->Let sizeof represent the number of n-gram elements in the set, then AC is defined as: />
(2) Combining the calculated values of the numerator and the denominator as calculated values of the relative importance of the segment combination in the first set or the second set; given a variable rate as a numerator and a variable AC as a denominator, the calculated relative importance value time value of the n-gram fragment g is defined as:
(3) Determining the statistical relationship value AT according to the relative importance degree calculated value time; the calculation formula of the statistical relation value AT is as follows: at=1+|logtimes|;
c, making the word frequency information F equal to F g Taking the product of the word frequency information F, the mutual information value MP and the statistical relation value AT as an unsupervised association measurement index; said non-monitoringThe formula of the governor-connection measurement index PATI is as follows: pati=f×mp×at.
2. The method for constructing a word-segmentation-free Chinese word embedding model according to claim 1, wherein the candidate segments are Chinese language model segments, and the step of counting the candidate segments in the corpus and word frequency information corresponding to the candidate segments comprises:
and counting the Chinese language model fragments and word frequency information thereof corresponding to different fixed length values in the corpus.
3. The method for constructing a Chinese word-segmentation-free word embedding model according to claim 1, wherein the method comprises the following steps of:
calculating the association strength of the candidate fragments of each length;
according to candidate fragments with different lengths, the association strength is sequentially arranged from large to small under each length, and K candidate fragments with the front association strength are selected as word embedded vocabularies, so that different numbers of candidate fragments are selected as word embedded vocabularies according to different lengths.
4. The method for constructing word-embedding models without word segmentation in chinese according to claim 1, wherein the step of constructing positive and negative sampling sets according to the vocabulary and constructing the word-embedding models in combination with the positive and negative sampling sets includes:
based on a skip-gram model combined with negative sampling, a parameter optimization method is adopted to maximize positive sampling probability and minimize negative sampling probability, and the word embedding model is constructed.
5. The system for constructing the Chinese word-segmentation-free word embedding model is characterized by comprising the following components:
the segment statistics module is used for counting candidate segments in the corpus and word frequency information corresponding to the candidate segments;
the association measurement module is used for determining association strength of the candidate fragments by combining the word frequency information and generating a word embedded vocabulary according to the association strength;
the model generation module is used for constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set;
determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength, wherein the word frequency information comprises the following steps: determining an unsupervised association metric of the candidate segment in combination with the word frequency information, wherein the unsupervised association metric characterizes the association strength of the candidate segment; sequentially arranging the association strengths from large to small, and selecting K candidate fragments with the previous association strength as a word embedded vocabulary;
wherein determining the unsupervised association metric of the candidate segment in combination with the word frequency information comprises:
a, calculating the mutual information value of the candidate fragments, and determining the corresponding fragment combination when the mutual information value is minimum; the mutual information value is MP value, MP is defined as:f a ,f b and f g The word frequency of the character strings a, b and the n-gram fragment g in the corpus is represented respectively;
b, determining a first set and a second set according to the fragment combination, and calculating a statistical relation value between the fragment combination and the first set or the second set, wherein the method comprises the following steps:
(1) Taking the maximum value of the ratio of the word frequency information to the word frequency of the first set and the ratio of the word frequency information to the word frequency of the second set as a molecule, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator;
for a specific combination of n-gram fragments g (a m ,b m ) A batch of n-gram fragments (a m ,b h ) And (a) j ,b m ) Then the first set { a } m Sum of }, andtwo sets {, b m Defined as: { a m ,*}={(a m ,b 1 ),(a m ,b 2 ),…,(a m ,b h ) } and { b m }={(a 1 ,b m ),(a 2 ,b m ),…,(a j ,b m ) -a }; order theAnd->Respectively expressed in the set { a } m Sum }, b m The sum of the word frequencies of all n-gram fragments in }, respectively defined as: />and
For n-gram fragment g and specific combinations thereof (a m ,b m ) The variable rate represents f g And (3) withRatio f g And->Maximum in the ratio, rate is defined as: />
For two sets { a } m Sum }, b m ' and its correspondingAnd->Let sizeof represent the number of n-gram elements in the set, then AC is defined as: />
(2) Combining the calculated values of the numerator and the denominator as calculated values of the relative importance of the segment combination in the first set or the second set; given a variable rate as a numerator and a variable AC as a denominator, the calculated relative importance value time value of the n-gram fragment g is defined as:
(3) Determining the statistical relationship value AT according to the relative importance degree calculated value time; the calculation formula of the statistical relation value AT is as follows: at=1+|logtimes|;
c, making the word frequency information F equal to F g Taking the product of the word frequency information F, the mutual information value MP and the statistical relation value AT as an unsupervised association measurement index; the formula of the unsupervised association metric PATIs is as follows: pati=f×mp×at.
6. A medium having stored thereon a computer program, which when executed by a processor, implements a method of constructing a chinese word-segmentation-free embedding model according to any one of claims 1 to 4.
7. An apparatus, comprising: a processor and a memory;
the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so that the device executes the method for constructing the Chinese word-segmentation-free embedding model according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010437000.3A CN113705227B (en) | 2020-05-21 | 2020-05-21 | Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010437000.3A CN113705227B (en) | 2020-05-21 | 2020-05-21 | Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113705227A CN113705227A (en) | 2021-11-26 |
CN113705227B true CN113705227B (en) | 2023-04-25 |
Family
ID=78645861
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010437000.3A Active CN113705227B (en) | 2020-05-21 | 2020-05-21 | Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113705227B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933183A (en) * | 2015-07-03 | 2015-09-23 | 重庆邮电大学 | Inquiring term rewriting method merging term vector model and naive Bayes |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
CN107015963A (en) * | 2017-03-22 | 2017-08-04 | 重庆邮电大学 | Natural language semantic parsing system and method based on deep neural network |
CN107273352A (en) * | 2017-06-07 | 2017-10-20 | 北京理工大学 | A kind of word insertion learning model and training method based on Zolu functions |
CN107491444A (en) * | 2017-08-18 | 2017-12-19 | 南京大学 | Parallelization word alignment method based on bilingual word embedded technology |
CN108959431A (en) * | 2018-06-11 | 2018-12-07 | 中国科学院上海高等研究院 | Label automatic generation method, system, computer readable storage medium and equipment |
CN110390018A (en) * | 2019-07-25 | 2019-10-29 | 哈尔滨工业大学 | A kind of social networks comment generation method based on LSTM |
-
2020
- 2020-05-21 CN CN202010437000.3A patent/CN113705227B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933183A (en) * | 2015-07-03 | 2015-09-23 | 重庆邮电大学 | Inquiring term rewriting method merging term vector model and naive Bayes |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
CN107015963A (en) * | 2017-03-22 | 2017-08-04 | 重庆邮电大学 | Natural language semantic parsing system and method based on deep neural network |
CN107273352A (en) * | 2017-06-07 | 2017-10-20 | 北京理工大学 | A kind of word insertion learning model and training method based on Zolu functions |
CN107491444A (en) * | 2017-08-18 | 2017-12-19 | 南京大学 | Parallelization word alignment method based on bilingual word embedded technology |
CN108959431A (en) * | 2018-06-11 | 2018-12-07 | 中国科学院上海高等研究院 | Label automatic generation method, system, computer readable storage medium and equipment |
CN110390018A (en) * | 2019-07-25 | 2019-10-29 | 哈尔滨工业大学 | A kind of social networks comment generation method based on LSTM |
Non-Patent Citations (2)
Title |
---|
Geewook Kim 等.Segmentation-free compositional n-gram embedding.《arxiv.org》.2018,1-9. * |
Xiaobin Wang 等.Unsupervised Learning Helps Supervised Neural Word Segmentation.《AAA-19》.2019,7200-7207. * |
Also Published As
Publication number | Publication date |
---|---|
CN113705227A (en) | 2021-11-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108073568B (en) | Keyword extraction method and device | |
CN111898366B (en) | Document subject word aggregation method and device, computer equipment and readable storage medium | |
US20190130249A1 (en) | Sequence-to-sequence prediction using a neural network model | |
CN108536595B (en) | Intelligent matching method and device for test cases, computer equipment and storage medium | |
CN107194430B (en) | Sample screening method and device and electronic equipment | |
CN109923556A (en) | Pointer sentry's mixed architecture | |
CN110442516B (en) | Information processing method, apparatus, and computer-readable storage medium | |
WO2022121163A1 (en) | User behavior tendency identification method, apparatus, and device, and storage medium | |
CN110210028A (en) | For domain feature words extracting method, device, equipment and the medium of speech translation text | |
CN112102899A (en) | Construction method of molecular prediction model and computing equipment | |
CN110728313B (en) | Classification model training method and device for intention classification recognition | |
CN109740660A (en) | Image processing method and device | |
CN111291824A (en) | Time sequence processing method and device, electronic equipment and computer readable medium | |
WO2015192798A1 (en) | Topic mining method and device | |
CN109299263A (en) | File classification method, electronic equipment and computer program product | |
CN112612887A (en) | Log processing method, device, equipment and storage medium | |
CN110334104B (en) | List updating method and device, electronic equipment and storage medium | |
CN114281983B (en) | Hierarchical text classification method, hierarchical text classification system, electronic device and storage medium | |
CN112561569B (en) | Dual-model-based store arrival prediction method, system, electronic equipment and storage medium | |
CN113705227B (en) | Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model | |
CN115374775A (en) | Method, device and equipment for determining text similarity and storage medium | |
CN111125329B (en) | Text information screening method, device and equipment | |
CN110335628B (en) | Voice test method and device of intelligent equipment and electronic equipment | |
CN115510847A (en) | Code workload analysis method and device | |
CN114021699A (en) | Gradient-based convolutional neural network pruning method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |