CN113705227B - Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model - Google Patents

Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model Download PDF

Info

Publication number
CN113705227B
CN113705227B CN202010437000.3A CN202010437000A CN113705227B CN 113705227 B CN113705227 B CN 113705227B CN 202010437000 A CN202010437000 A CN 202010437000A CN 113705227 B CN113705227 B CN 113705227B
Authority
CN
China
Prior art keywords
word
word frequency
candidate
constructing
frequency information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010437000.3A
Other languages
Chinese (zh)
Other versions
CN113705227A (en
Inventor
张一帆
王茂华
顾倩荣
黄永健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Advanced Research Institute of CAS
Original Assignee
Shanghai Advanced Research Institute of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Advanced Research Institute of CAS filed Critical Shanghai Advanced Research Institute of CAS
Priority to CN202010437000.3A priority Critical patent/CN113705227B/en
Publication of CN113705227A publication Critical patent/CN113705227A/en
Application granted granted Critical
Publication of CN113705227B publication Critical patent/CN113705227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method, a system, a medium and equipment for constructing a Chinese word-segmentation-free word embedding model, wherein the method for constructing the Chinese word-segmentation-free word embedding model comprises the following steps: counting candidate segments in a corpus and word frequency information corresponding to the candidate segments; determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength; and constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set. Aiming at the problem of excessive noise n-gram in the vocabulary of the existing word-segmentation-free word embedding model, the invention provides a method for improving the word-segmentation-free word embedding model by using an unsupervised correlation measurement index by taking Chinese corpus as a research object and utilizing a negative sampling skip-gram model.

Description

Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model
Technical Field
The invention belongs to the technical field of natural language processing, relates to a design method of a word embedding model, and particularly relates to a construction method, a system, a medium and equipment of a Chinese word embedding model without word segmentation.
Background
Word embedding is a basic task in the field of natural language processing, and plays an important role in downstream tasks such as machine translation, part-of-speech tagging and the like. Because there is no obvious separator between words in the Chinese corpus, the existing Chinese word embedding generally needs to perform Chinese word segmentation first to obtain word after word segmentation as a word embedding target. However, there are still many problems in the current chinese word segmentation, and these problems seriously affect the quality of chinese word embedding. Thus, for languages like chinese, a word-segmentation-free word embedding model is proposed and proved to be superior to the conventional word embedding method in order to avoid the influence of word segmentation errors.
The current word-segmentation-free word embedding model mainly takes an n-gram segment with highest Top-K word frequency as a model training object. However, merely considering word frequencies can result in a large number of noisy n-gram segments in the word-embedded vocabulary that can affect the quality of the resulting word-embedded.
Therefore, how to provide a design method for word embedding models without word segmentation, which reduces the influence of a large number of noise n-gram segments on the quality of the finally generated word embedding models, improves the quality of the word embedding models, and is a technical problem to be solved by the technicians in the field.
Disclosure of Invention
In view of the above drawbacks of the prior art, the present invention aims to provide a method, a system, a medium and a device for constructing a word-segmentation-free Chinese word embedding model, which are used for solving the problem that the quality of the word embedding model cannot be improved due to the fact that the influence of a large number of noise n-gram segments on the quality of the finally generated word embedding model is not reduced in the prior art.
To achieve the above and other related objects, according to one aspect of the present invention, there is provided a method for constructing a chinese word-segmentation-free word-embedded model, the method comprising: counting candidate segments in a corpus and word frequency information corresponding to the candidate segments; determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength; and constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.
In an embodiment of the present invention, the candidate segment is a chinese language model segment, and the step of counting the candidate segment in the corpus and word frequency information corresponding to the candidate segment includes: and counting the Chinese language model fragments and word frequency information thereof corresponding to different fixed length values in the corpus.
In one embodiment of the present invention, the step of determining the association strength of the candidate segment by combining the word frequency information, and generating the word embedded vocabulary according to the association strength includes: determining an unsupervised association metric of the candidate segment in combination with the word frequency information, wherein the unsupervised association metric characterizes the association strength of the candidate segment; and arranging the association strengths from large to small in sequence, and selecting K candidate fragments with the first association strength as a word embedded vocabulary.
In one embodiment of the present invention, the step of determining the unsupervised association metric of the candidate segment by combining the word frequency information includes: calculating the mutual information value of the candidate fragments, and determining the corresponding fragment combination when the mutual information value is minimum; determining a first set and a second set according to the fragment combination, and calculating a statistical relationship value between the fragment combination and the first set or the second set; and taking the product of the word frequency information, the mutual information value and the statistical relation value as an unsupervised association measurement index.
In an embodiment of the present invention, the step of determining the first set and the second set according to the segment combination, and calculating the statistical relationship value between the segment combination and the first set or the second set includes: taking the maximum value of the ratio of the word frequency information to the word frequency of the first set and the ratio of the word frequency information to the word frequency of the second set as a molecule, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator; the calculated value of the partial formula formed by the numerator and the denominator is used as the calculated value of the relative importance degree of the combination of the fragments in the first set or the second set; and determining the statistical relation value according to the relative importance degree calculated value.
In one embodiment of the present invention, the correlation strength is calculated for each length of the candidate segment; according to candidate fragments with different lengths, the association strength is sequentially arranged from large to small under each length, and K candidate fragments with the front association strength are selected as word embedded vocabularies, so that different numbers of candidate fragments are selected as word embedded vocabularies according to different lengths.
In one embodiment of the present invention, the step of constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set includes: based on a skip-gram model combined with negative sampling, a parameter optimization method is adopted to maximize positive sampling probability and minimize negative sampling probability, and the word embedding model is constructed.
The invention also provides a construction system of the Chinese word-segmentation-free word embedding model, which comprises: the segment statistics module is used for counting candidate segments in the corpus and word frequency information corresponding to the candidate segments; the association measurement module is used for determining association strength of the candidate fragments by combining the word frequency information and generating a word embedded vocabulary according to the association strength; and the model generation module is used for constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.
In yet another aspect, the present invention provides a medium having stored thereon a computer program which, when executed by a processor, implements a method for constructing a chinese word-segmentation-free embedding model.
In a final aspect the invention provides an apparatus comprising: a processor and a memory; the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so that the equipment executes the construction method of the Chinese word-segmentation-free word embedding model.
As described above, the method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model have the following beneficial effects:
a new unsupervised correlation metric is presented for screening n-gram fragments with strong correlation. The non-supervision association measurement index is combined with the word embedding model to construct a new word embedding model of non-segmentation Chinese word for Chinese language materials. The word embedding model obtained by the method can show better performance in downstream tasks.
Drawings
FIG. 1 is a schematic flow chart of a method for constructing a word-segmentation-free Chinese word embedding model according to an embodiment of the invention.
FIG. 2 is a flowchart showing the related metrics of the method for constructing the word-free Chinese word-embedding model according to an embodiment of the present invention.
FIG. 3 is a flowchart showing the correlation strength calculation in one embodiment of the method for constructing a Chinese word-segmentation-free word embedding model according to the present invention.
Fig. 4 shows an effect diagram of the method for constructing the Chinese word-segmentation-free word embedding model and the basic dictionary contrast.
Fig. 5 shows an effect diagram of the method for constructing the chinese word-segmentation-free word embedding model according to the present invention against the rich dictionary.
FIG. 6 is a schematic diagram of a system for constructing a Chinese word-segmentation-free word embedding model according to an embodiment of the present invention.
FIG. 7 is a schematic diagram showing the structural connection of the apparatus for constructing a Chinese word-segmentation-free word embedding model according to an embodiment of the present invention.
Description of element reference numerals
6. System for constructing Chinese word-segmentation-free word embedding model
61. Fragment statistics module
62. Correlation measurement module
63. Model generation module
7. Apparatus and method for controlling the operation of a device
71. Processor and method for controlling the same
72. Memory device
73. Communication interface
74. System bus
S11 to S13 steps
S121 to S122 steps
S121A to S121C steps
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
The method for constructing the Chinese word-segmentation-free word embedding model aims at the problem that noise n-gram is excessive in the vocabulary of the word-segmentation-free word embedding model at present, the Chinese corpus is used as a research object, and a negative sampling skip-gram model is utilized to provide a method for improving a word segmentation-free word embedding model by using an unsupervised correlation measurement index.
The following will explain in detail the principles and implementation of a method, a system, a medium and a device for constructing a chinese word-segmentation-free embedding model according to this embodiment with reference to fig. 1 to 7, so that those skilled in the art can understand the method, the system, the medium and the device for constructing a chinese word-segmentation-free embedding model according to this embodiment without creative effort.
Referring to fig. 1, a schematic flow chart of a method for constructing a word-segmentation-free Chinese word embedding model according to an embodiment of the invention is shown. As shown in FIG. 1, the method for constructing the Chinese word-segmentation-free word embedding model specifically comprises the following steps:
s11, counting candidate segments in the corpus and word frequency information corresponding to the candidate segments.
In this embodiment, the candidate segment is a chinese language model segment, for example, the candidate segment is an n-gram segment, and n-gram segments and word frequency information thereof corresponding to different fixed length values are counted in the corpus.
Specifically, a simple word segmentation device is realized through an n-gram model, and an n-gram fragment is obtained. The model is based on the assumption that the occurrence of the nth word is related to only the preceding n-1 words, but not to any other word, and the probability of the whole sentence is the product of the occurrence probabilities of the words. In general we only calculate the probabilities of two words before and after a word, i.e. n takes 2, and calculate the probabilities of n-2, # n-1, n+1, n+2. If n=3, the calculation is better; n=4, the calculation amount becomes large.
Specifically, the corpus is arranged, and all possible n-gram fragments and word frequency information thereof under a fixed length are counted. List management is carried out on n-gram fragments with different lengths and corresponding word frequency information to form a table 1. For example, "one" for 1 Chinese character in length in Table 1, the word frequency is 529285.
Table 1 table of candidate fields
Figure BDA0002502653860000051
And S12, determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength.
In this embodiment, the correlation strength is calculated for each length of the candidate segment.
Further, according to candidate fragments with different lengths, the association strength is sequentially arranged from large to small under each length, the candidate fragments with the first K association strength are selected to be used as word embedded vocabularies, and different numbers of candidate fragments are selected to be used as word embedded vocabularies according to different lengths.
Referring to fig. 2, a flowchart of a method for constructing a word-segmentation-free Chinese word embedding model according to an embodiment of the invention is shown. As shown in fig. 2, S12 includes:
s121, determining an unsupervised association metric index of the candidate segment by combining the word frequency information, wherein the unsupervised association metric index characterizes the association strength of the candidate segment. The PATIs (Pointwise Association with Times Information, unsupervised associated metrics) of the present invention can explore more strongly associated n-gram segments by considering more statistics information.
Referring to fig. 3, a flowchart of a method for constructing a word-segmentation-free Chinese word embedding model according to an embodiment of the invention is shown. As shown in fig. 3, S121 includes:
S121A, calculating the mutual information value of the candidate fragments, and determining the corresponding fragment combination when the mutual information value is minimum.
Specifically, the mutual information value is an MP value, g=w for each n-gram segment of length s i w i+1 ...w i+s (i is more than or equal to 0 and less than or equal to N-s), and the left part and the right part of g are respectively a=w i ...w k-1 ,b=w k ...w i+s (i<k<i+s), i.e., g=concat (a, b). f (f) a ,f b And f g And respectively representing word frequencies of the character strings a, b and the n-gram fragment g in the corpus.
For one n-gram fragment g=concat (a, b), its corresponding MP is defined as:
Figure BDA0002502653860000052
for a fixed length s of n-gram fragment g, there will always be a specific left and right a, b combination (a m ,b m ) MP can be minimized. Subsequently, the calculation of the AT will also be based on this particular combination (a m ,b m )。
S121B, determining a first set and a second set according to the fragment combination, and calculating the statistical relation value between the fragment combination and the first set or the second set.
In the present embodiment, S121B includes:
(1) And taking the maximum value of the ratio of the word frequency information to the word frequency of the first set and the ratio of the word frequency of the second set as a molecule, selecting the set with the minimum word frequency of the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator.
For a specific combination of n-gram fragments g (a m ,b m ) Has a plurality of n-gram fragments (a m ,b h ) And (a) j ,b m ) Then the first set { a } m Sum of }Second set {, b m The definition of } is as follows:
{a m ,*}={(a m ,b 1 ),(a m ,b 2 ),…,(a m ,b h ) Formula (2)
{*,b m }={(a 1 ,b m ),(a 2 ,b m ),…,(a j ,b m ) Formula (3)
Order the
Figure BDA0002502653860000061
And->
Figure BDA0002502653860000062
Respectively expressed in the set { a } m Sum }, b m The sum of the word frequencies of all n-gram fragments in }, then it is defined as:
Figure BDA0002502653860000063
Figure BDA0002502653860000064
for n-gram fragment g and specific combinations thereof (a m ,b m ) The variable rate represents f g And (3) with
Figure BDA0002502653860000065
Ratio f g And (3) with
Figure BDA0002502653860000066
The maximum value in the ratio, the rate, is defined as follows:
Figure BDA0002502653860000067
for two sets { a } m Sum }, b m ' and its corresponding
Figure BDA0002502653860000068
And->
Figure BDA0002502653860000069
Let sizeof represent the number of n-gram elements in the set, then AC can be defined as:
Figure BDA00025026538600000610
(2) And combining the calculated molecular value formed by the numerator and the denominator as the calculated relative importance degree value of the fragment in the first set or the second set.
Specifically, given a variable rate and a variable AC, the time values of the n-gram fragment g are defined as follows:
Figure BDA00025026538600000611
(3) And determining the statistical relation value according to the relative importance degree calculated value.
Specifically, for a specific combination of n-gram fragments g of length s (a m ,b m ) With unique variable times, the calculation formula of the AT is:
at=1+|logtimes|formula (9)
And S121C, taking the product of the word frequency information, the mutual information value and the statistical relation value as an unsupervised association measurement index.
Specifically, the formula of PATI (Pointwise Association with Times Information, unsupervised associated metrics) is as follows:
PATI= FxMP xAT equation (10)
Wherein f=fg is word frequency information, MP is mutual information value, and AT is statistical relation value.
MP is an improved version of mutual information PMI, which takes the marginal variable of each n-gram fragment g=concat (a, b), namely statistics of the left and right parts a and b of the n-gram, into more consideration in the calculation process of the association strength, so that the MP can be more sensitive to the local information of the n-gram.
The AT uses a specific combination (a m ,b m ) In the set { a } m (x) or (b) m Statistical information in the n-gram to further measure the correlation strength of the n-gram. The variable time is taken into consideration
Figure BDA0002502653860000071
And (a) m ,b m ) Word frequency information of (a) m ,b m ) Before and after the number of adjacencies (a) m ,b m ) The higher the relative importance of the time values in the collection, the more general description (a m ,b m ) It is reasonable as a whole. In general, most reasonable n-gram values with higher correlation strength are much larger than those of unreasonable n-gram fragments.
S122, sequentially arranging the association strengths from large to small, and selecting K candidate fragments with the first association strength as a word embedded vocabulary.
Specifically, the proposed unsupervised correlation metrics are used to calculate the correlation strength of candidate n-gram segments for each segment length. And then selecting the n-gram fragment with the highest Top-K association strength as a vocabulary of the word embedding model. The Top-K problem is short for solving the problem that the front K is large or the front K is small in a large pile number.
S13, constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.
In this embodiment, based on the skip-gram model combined with negative sampling, a maximum likelihood estimation method is adopted to maximize positive sampling probability and minimize negative sampling probability, so as to construct the word embedding model. The invention uses the unsupervised associated measurement index to screen word embedded vocabulary, reconstructs positive sampling and negative sampling of the word embedded model, reduces the influence of noise n-gram segments on the model, and improves the expression of word embedded in downstream tasks.
It should be noted that, the maximum likelihood estimation is only one embodiment of the present invention for parameter estimation and optimization, and other methods for implementing parameter estimation and optimization are also included in the scope of the present invention.
Specifically, the PFNE learns word embedding based on a negative sampling skip-gram model, so that the calculated amount of model gradient descent is reduced, and model training is quickened. Positive sample set N of models p That is, the "center word-context pair" (w) generated by combining vocabulary and corpus t ,w c ) Negative sampling set N n Then a sufficiently large word list of the single-element language model is constructed, each n-gram segment is indexed in the list, and a negative sampling sample is randomly acquired according to the word frequency size of the n-gram in the word list. The objective function of the PFNE model is defined as follows:
Figure BDA0002502653860000081
wherein,,
Figure BDA0002502653860000082
and->
Figure BDA0002502653860000083
Respectively the center word w t And its context w c The model uses maximum likelihood estimation to predict context from the center word, maximize probability of positive samples, and minimize probability of negative samples to optimize word embedding model generated by the objective function. The optimization of the objective function adopts a random gradient descent method based on positive and negative sampling.
Referring to fig. 4 and fig. 5, an effect diagram of the method for constructing a chinese word-free embedding model and the basic dictionary of the present invention and an effect diagram of the method for constructing a chinese word-free embedding model and the rich dictionary of the present invention are shown respectively. In fig. 4 and 5, PFNE represents the result of comparing the n-gram fragment screened using the PATI algorithm with a dictionary; sembei is the result of comparing the n-gram fragments screened out by using frequency (word frequency) with a dictionary; SGNS-PMI is the result of comparing the n-gram fragment screened by PMI (Pointwise Mutual Information, mutual information) with a dictionary. The vertical axis is precision and the horizontal axis is recall. The expressions of the precision and recall are as follows:
Figure BDA0002502653860000084
Figure BDA0002502653860000085
further, the higher the curve, the longer, indicating that more reasonable n-gram fragments are screened. As can be seen in fig. 4 and 5, the curve of the solid line PFNE using the PATI algorithm is highest and longest among the three curves, so that it is illustrated that the method for constructing the word-segmentation-free Chinese word embedding model according to the present invention can screen more reasonable n-gram segments compared with the word embedding model (basic dictionary or rich dictionary) of the prior art.
The protection scope of the method for constructing the Chinese word-segmentation-free word embedding model is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes realized by the steps of increasing and decreasing and step replacement in the prior art according to the principles of the invention are included in the protection scope of the invention.
The present embodiment provides a computer storage medium having a computer program stored thereon, which when executed by a processor implements the method for constructing a chinese word-segmentation-free embedding model.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by computer program related hardware. The aforementioned computer program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned computer-readable storage medium includes: various computer storage media such as ROM, RAM, magnetic or optical disks may store program code.
The following describes in detail the construction system of the chinese word-segmentation-free word embedding model provided in this embodiment with reference to the drawings. It should be noted that, it should be understood that the division of the modules of the following system is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. The modules can be realized in a form of calling the processing element through software, can be realized in a form of hardware, can be realized in a form of calling the processing element through part of the modules, and can be realized in a form of hardware. For example: a module may be a separately established processing element or may be integrated in a chip of a system as described below. In addition, a certain module may be stored in the memory of the following system in the form of program codes, and the functions of the following certain module may be called and executed by a certain processing element of the following system. The implementation of the other modules is similar. All or part of the modules can be integrated together or can be implemented independently. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module below may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.
The following modules may be one or more integrated circuits configured to implement the above methods, for example: one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), one or more digital signal processors (Digital Signal Processor, DSP for short), one or more field programmable gate arrays (Field Programmable Gate Array, FPGA for short), and the like. When a module is implemented in the form of a processing element calling program code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may call program code. These modules may be integrated together and implemented in the form of a System-on-a-chip (SOC) for short.
Referring to fig. 6, a schematic diagram of a system for constructing a word-segmentation-free Chinese word embedding model according to an embodiment of the present invention is shown. As shown in fig. 6, the system 6 for constructing the chinese word-segmentation-free word embedding model includes: a segment statistics module 61, an association metrics module 62, and a model generation module 63.
The segment statistics module 61 is configured to statistics the candidate segments in the corpus and word frequency information corresponding to the candidate segments.
In this embodiment, the candidate segments are chinese language model segments, and the segment statistics module 61 is specifically configured to count chinese language model segments corresponding to different fixed length values and word frequency information thereof in the corpus.
The association metric module 62 is configured to determine an association strength of the candidate segments in combination with the word frequency information, and generate a word embedded vocabulary according to the association strength.
In this embodiment, the association metric module 62 is specifically configured to determine an unsupervised association metric indicator of the candidate segment in combination with the word frequency information, where the unsupervised association metric indicator characterizes an association strength of the candidate segment; and arranging the association strengths from large to small in sequence, and selecting K candidate fragments with the first association strength as a word embedded vocabulary.
The model generating module 63 is configured to construct a positive sampling set and a negative sampling set according to the vocabulary, and construct a word embedding model by combining the positive sampling set and the negative sampling set.
In this embodiment, the model generating module 63 is specifically configured to use a method of parameter optimization to maximize positive sampling probability and minimize negative sampling probability based on a skip-gram model in combination with negative sampling, and construct the word embedding model.
The system for constructing the Chinese word-segmentation-free word embedding model can realize the method for constructing the Chinese word-segmentation-free word embedding model, but the device for realizing the method for constructing the Chinese word-segmentation-free word embedding model comprises but is not limited to the structure of the system for constructing the Chinese word-segmentation-free word embedding model, which is listed in the embodiment, and all the structural variations and substitutions of the prior art according to the principles of the invention are included in the protection scope of the invention.
Referring to fig. 7, a schematic diagram of structural connection of a device for constructing a chinese word-segmentation-free embedding model according to an embodiment of the invention is shown. As shown in fig. 7, the present embodiment provides an apparatus 7, the apparatus 7 including: a processor 71, a memory 72, a communication interface 73, or/and a system bus 74; the memory 72 and the communication interface 73 are connected to the processor 71 via a system bus 74 and perform communication with each other, the memory 72 is used for storing a computer program, the communication interface 73 is used for communicating with other devices, and the processor 71 is used for running the computer program to enable the device 7 to execute the steps of the method for constructing the Chinese word segmentation free embedding model.
The system bus 74 mentioned above may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The system bus may be classified into an address bus, a data bus, a control bus, and the like. The communication interface 73 is used to enable communication between the database access apparatus and other devices such as clients, read-write libraries and read-only libraries. The memory 72 may include a random access memory (Random Access Memory, simply referred to as RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor 71 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Alication Specific Integrated Circuit, ASIC for short), field programmable gate arrays (Field Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
In summary, the method, system, medium and equipment for constructing the Chinese word-segmentation-free word embedding model provide a new unsupervised association metric index for screening n-gram fragments with strong association degree. The non-supervision association measurement index is combined with the word embedding model to construct a new word embedding model of non-segmentation Chinese word for Chinese language materials. The word embedding model obtained by the method can show better performance in downstream tasks. The invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims (7)

1. The method for constructing the Chinese word-segmentation-free word embedding model is characterized by comprising the following steps of:
counting candidate segments in a corpus and word frequency information corresponding to the candidate segments;
determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength;
constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set;
determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength, wherein the word frequency information comprises the following steps: determining an unsupervised association metric of the candidate segment in combination with the word frequency information, wherein the unsupervised association metric characterizes the association strength of the candidate segment; sequentially arranging the association strengths from large to small, and selecting K candidate fragments with the previous association strength as a word embedded vocabulary;
wherein determining the unsupervised association metric of the candidate segment in combination with the word frequency information comprises:
a, calculating the mutual information value of the candidate fragments, and determining the corresponding fragment combination when the mutual information value is minimum; the mutual information value is MP value, MP is defined as:
Figure FDA0004126341740000011
f a ,f b and f g The word frequency of the character strings a, b and the n-gram fragment g in the corpus is represented respectively;
b, determining a first set and a second set according to the fragment combination, and calculating a statistical relation value between the fragment combination and the first set or the second set, wherein the method comprises the following steps:
(1) Taking the maximum value of the ratio of the word frequency information to the word frequency of the first set and the ratio of the word frequency information to the word frequency of the second set as a molecule, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator;
for a specific combination of n-gram fragments g (a m ,b m ) A batch of n-gram fragments (a m ,b h ) And (a) j ,b m ) Then the first set { a } m (x) and (b) a second set m Defined as: { a m ,*}={(a m ,b 1 ),(a m ,b 2 ),…,(a m ,b h ) } and { b m }={(a 1 ,b m ),(a 2 ,b m ),…,(a j ,b m ) -a }; order the
Figure FDA0004126341740000012
And->
Figure FDA0004126341740000013
Respectively expressed in the set { a } m Sum }, b m The sum of the word frequencies of all n-gram fragments in }, respectively defined as:
Figure FDA0004126341740000014
and +.>
Figure FDA0004126341740000015
For n-gram fragment gSpecific combinations (a) m ,b m ) The variable rate represents f g And (3) with
Figure FDA0004126341740000016
Ratio f g And->
Figure FDA0004126341740000017
Maximum in the ratio, rate is defined as: />
Figure FDA0004126341740000018
For two sets { a } m Sum }, b m ' and its corresponding
Figure FDA0004126341740000019
And->
Figure FDA00041263417400000110
Let sizeof represent the number of n-gram elements in the set, then AC is defined as: />
Figure FDA0004126341740000021
(2) Combining the calculated values of the numerator and the denominator as calculated values of the relative importance of the segment combination in the first set or the second set; given a variable rate as a numerator and a variable AC as a denominator, the calculated relative importance value time value of the n-gram fragment g is defined as:
Figure FDA0004126341740000022
(3) Determining the statistical relationship value AT according to the relative importance degree calculated value time; the calculation formula of the statistical relation value AT is as follows: at=1+|logtimes|;
c, making the word frequency information F equal to F g Taking the product of the word frequency information F, the mutual information value MP and the statistical relation value AT as an unsupervised association measurement index; said non-monitoringThe formula of the governor-connection measurement index PATI is as follows: pati=f×mp×at.
2. The method for constructing a word-segmentation-free Chinese word embedding model according to claim 1, wherein the candidate segments are Chinese language model segments, and the step of counting the candidate segments in the corpus and word frequency information corresponding to the candidate segments comprises:
and counting the Chinese language model fragments and word frequency information thereof corresponding to different fixed length values in the corpus.
3. The method for constructing a Chinese word-segmentation-free word embedding model according to claim 1, wherein the method comprises the following steps of:
calculating the association strength of the candidate fragments of each length;
according to candidate fragments with different lengths, the association strength is sequentially arranged from large to small under each length, and K candidate fragments with the front association strength are selected as word embedded vocabularies, so that different numbers of candidate fragments are selected as word embedded vocabularies according to different lengths.
4. The method for constructing word-embedding models without word segmentation in chinese according to claim 1, wherein the step of constructing positive and negative sampling sets according to the vocabulary and constructing the word-embedding models in combination with the positive and negative sampling sets includes:
based on a skip-gram model combined with negative sampling, a parameter optimization method is adopted to maximize positive sampling probability and minimize negative sampling probability, and the word embedding model is constructed.
5. The system for constructing the Chinese word-segmentation-free word embedding model is characterized by comprising the following components:
the segment statistics module is used for counting candidate segments in the corpus and word frequency information corresponding to the candidate segments;
the association measurement module is used for determining association strength of the candidate fragments by combining the word frequency information and generating a word embedded vocabulary according to the association strength;
the model generation module is used for constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set;
determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength, wherein the word frequency information comprises the following steps: determining an unsupervised association metric of the candidate segment in combination with the word frequency information, wherein the unsupervised association metric characterizes the association strength of the candidate segment; sequentially arranging the association strengths from large to small, and selecting K candidate fragments with the previous association strength as a word embedded vocabulary;
wherein determining the unsupervised association metric of the candidate segment in combination with the word frequency information comprises:
a, calculating the mutual information value of the candidate fragments, and determining the corresponding fragment combination when the mutual information value is minimum; the mutual information value is MP value, MP is defined as:
Figure FDA0004126341740000031
f a ,f b and f g The word frequency of the character strings a, b and the n-gram fragment g in the corpus is represented respectively;
b, determining a first set and a second set according to the fragment combination, and calculating a statistical relation value between the fragment combination and the first set or the second set, wherein the method comprises the following steps:
(1) Taking the maximum value of the ratio of the word frequency information to the word frequency of the first set and the ratio of the word frequency information to the word frequency of the second set as a molecule, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator;
for a specific combination of n-gram fragments g (a m ,b m ) A batch of n-gram fragments (a m ,b h ) And (a) j ,b m ) Then the first set { a } m Sum of }, andtwo sets {, b m Defined as: { a m ,*}={(a m ,b 1 ),(a m ,b 2 ),…,(a m ,b h ) } and { b m }={(a 1 ,b m ),(a 2 ,b m ),…,(a j ,b m ) -a }; order the
Figure FDA0004126341740000032
And->
Figure FDA0004126341740000033
Respectively expressed in the set { a } m Sum }, b m The sum of the word frequencies of all n-gram fragments in }, respectively defined as: />
Figure FDA0004126341740000034
and
Figure FDA0004126341740000035
For n-gram fragment g and specific combinations thereof (a m ,b m ) The variable rate represents f g And (3) with
Figure FDA0004126341740000036
Ratio f g And->
Figure FDA00041263417400000311
Maximum in the ratio, rate is defined as: />
Figure FDA0004126341740000037
For two sets { a } m Sum }, b m ' and its corresponding
Figure FDA0004126341740000038
And->
Figure FDA0004126341740000039
Let sizeof represent the number of n-gram elements in the set, then AC is defined as: />
Figure FDA00041263417400000310
(2) Combining the calculated values of the numerator and the denominator as calculated values of the relative importance of the segment combination in the first set or the second set; given a variable rate as a numerator and a variable AC as a denominator, the calculated relative importance value time value of the n-gram fragment g is defined as:
Figure FDA0004126341740000041
(3) Determining the statistical relationship value AT according to the relative importance degree calculated value time; the calculation formula of the statistical relation value AT is as follows: at=1+|logtimes|;
c, making the word frequency information F equal to F g Taking the product of the word frequency information F, the mutual information value MP and the statistical relation value AT as an unsupervised association measurement index; the formula of the unsupervised association metric PATIs is as follows: pati=f×mp×at.
6. A medium having stored thereon a computer program, which when executed by a processor, implements a method of constructing a chinese word-segmentation-free embedding model according to any one of claims 1 to 4.
7. An apparatus, comprising: a processor and a memory;
the memory is used for storing a computer program, and the processor is used for executing the computer program stored in the memory, so that the device executes the method for constructing the Chinese word-segmentation-free embedding model according to any one of claims 1 to 4.
CN202010437000.3A 2020-05-21 2020-05-21 Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model Active CN113705227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010437000.3A CN113705227B (en) 2020-05-21 2020-05-21 Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010437000.3A CN113705227B (en) 2020-05-21 2020-05-21 Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model

Publications (2)

Publication Number Publication Date
CN113705227A CN113705227A (en) 2021-11-26
CN113705227B true CN113705227B (en) 2023-04-25

Family

ID=78645861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010437000.3A Active CN113705227B (en) 2020-05-21 2020-05-21 Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model

Country Status (1)

Country Link
CN (1) CN113705227B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN107015963A (en) * 2017-03-22 2017-08-04 重庆邮电大学 Natural language semantic parsing system and method based on deep neural network
CN107273352A (en) * 2017-06-07 2017-10-20 北京理工大学 A kind of word insertion learning model and training method based on Zolu functions
CN107491444A (en) * 2017-08-18 2017-12-19 南京大学 Parallelization word alignment method based on bilingual word embedded technology
CN108959431A (en) * 2018-06-11 2018-12-07 中国科学院上海高等研究院 Label automatic generation method, system, computer readable storage medium and equipment
CN110390018A (en) * 2019-07-25 2019-10-29 哈尔滨工业大学 A kind of social networks comment generation method based on LSTM

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN107015963A (en) * 2017-03-22 2017-08-04 重庆邮电大学 Natural language semantic parsing system and method based on deep neural network
CN107273352A (en) * 2017-06-07 2017-10-20 北京理工大学 A kind of word insertion learning model and training method based on Zolu functions
CN107491444A (en) * 2017-08-18 2017-12-19 南京大学 Parallelization word alignment method based on bilingual word embedded technology
CN108959431A (en) * 2018-06-11 2018-12-07 中国科学院上海高等研究院 Label automatic generation method, system, computer readable storage medium and equipment
CN110390018A (en) * 2019-07-25 2019-10-29 哈尔滨工业大学 A kind of social networks comment generation method based on LSTM

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Geewook Kim 等.Segmentation-free compositional n-gram embedding.《arxiv.org》.2018,1-9. *
Xiaobin Wang 等.Unsupervised Learning Helps Supervised Neural Word Segmentation.《AAA-19》.2019,7200-7207. *

Also Published As

Publication number Publication date
CN113705227A (en) 2021-11-26

Similar Documents

Publication Publication Date Title
CN108073568B (en) Keyword extraction method and device
CN111898366B (en) Document subject word aggregation method and device, computer equipment and readable storage medium
US20190130249A1 (en) Sequence-to-sequence prediction using a neural network model
CN108536595B (en) Intelligent matching method and device for test cases, computer equipment and storage medium
CN107194430B (en) Sample screening method and device and electronic equipment
CN109923556A (en) Pointer sentry&#39;s mixed architecture
CN110442516B (en) Information processing method, apparatus, and computer-readable storage medium
WO2022121163A1 (en) User behavior tendency identification method, apparatus, and device, and storage medium
CN110210028A (en) For domain feature words extracting method, device, equipment and the medium of speech translation text
CN112102899A (en) Construction method of molecular prediction model and computing equipment
CN110728313B (en) Classification model training method and device for intention classification recognition
CN109740660A (en) Image processing method and device
CN111291824A (en) Time sequence processing method and device, electronic equipment and computer readable medium
WO2015192798A1 (en) Topic mining method and device
CN109299263A (en) File classification method, electronic equipment and computer program product
CN112612887A (en) Log processing method, device, equipment and storage medium
CN110334104B (en) List updating method and device, electronic equipment and storage medium
CN114281983B (en) Hierarchical text classification method, hierarchical text classification system, electronic device and storage medium
CN112561569B (en) Dual-model-based store arrival prediction method, system, electronic equipment and storage medium
CN113705227B (en) Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model
CN115374775A (en) Method, device and equipment for determining text similarity and storage medium
CN111125329B (en) Text information screening method, device and equipment
CN110335628B (en) Voice test method and device of intelligent equipment and electronic equipment
CN115510847A (en) Code workload analysis method and device
CN114021699A (en) Gradient-based convolutional neural network pruning method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant