CN113705227A - Method, system, medium and device for constructing Chinese non-segmented word and word embedding model - Google Patents

Method, system, medium and device for constructing Chinese non-segmented word and word embedding model Download PDF

Info

Publication number
CN113705227A
CN113705227A CN202010437000.3A CN202010437000A CN113705227A CN 113705227 A CN113705227 A CN 113705227A CN 202010437000 A CN202010437000 A CN 202010437000A CN 113705227 A CN113705227 A CN 113705227A
Authority
CN
China
Prior art keywords
word
constructing
embedding model
word embedding
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010437000.3A
Other languages
Chinese (zh)
Other versions
CN113705227B (en
Inventor
张一帆
王茂华
顾倩荣
黄永健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Advanced Research Institute of CAS
Original Assignee
Shanghai Advanced Research Institute of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Advanced Research Institute of CAS filed Critical Shanghai Advanced Research Institute of CAS
Priority to CN202010437000.3A priority Critical patent/CN113705227B/en
Publication of CN113705227A publication Critical patent/CN113705227A/en
Application granted granted Critical
Publication of CN113705227B publication Critical patent/CN113705227B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method, a system, a medium and equipment for constructing a Chinese non-participle word embedding model, wherein the method for constructing the Chinese non-participle word embedding model comprises the following steps: counting candidate fragments in a corpus and word frequency information corresponding to the candidate fragments; determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary list according to the association strength; and constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set. The invention provides a method for improving a non-participle word embedding model by using an unsupervised relevance measurement index, aiming at the problem of excessive noise n-gram in the existing non-participle word embedding model vocabulary, taking Chinese linguistic data as a research object and utilizing a negatively sampled skip-gram model.

Description

Method, system, medium and device for constructing Chinese non-segmented word and word embedding model
Technical Field
The invention belongs to the technical field of natural language processing, relates to a design method of a word embedding model, and particularly relates to a construction method, a system, a medium and equipment of a Chinese non-segmented word and word embedding model.
Background
Word embedding is used as a basic task in the field of natural language processing, and plays an important role in downstream tasks such as machine translation, part of speech tagging and the like. Because there is no obvious separator between the words in the Chinese corpus, the existing Chinese word embedding usually needs to perform Chinese word segmentation first to obtain the word after word segmentation as the target of word embedding. However, the current Chinese word segmentation still has a plurality of problems, and the problems can seriously affect the quality of Chinese word embedding. Therefore, for languages like Chinese, in order to avoid the influence caused by word segmentation errors, a word segmentation-free word embedding model is proposed and proved to be superior to the traditional word embedding method.
The existing word embedding model without word segmentation is mainly used for collecting n-gram fragments with the highest Top-K word frequency as model training objects. But considering only the word frequency results in a large number of noisy n-gram fragments in the vocabulary of word insertions, which may affect the quality of the final generated word insertions.
Therefore, how to provide a design method for a word-segmentation-free word embedding model, which reduces the influence of a large number of noise n-gram fragments on the quality of the finally generated word embedding model and improves the quality of the word embedding model, has become a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above disadvantages of the prior art, an object of the present invention is to provide a method, a system, a medium, and an apparatus for constructing a chinese non-segmented word-word embedding model, which are used to solve the problem that the prior art cannot reduce the influence of a large number of noise n-gram fragments on the quality of a finally generated word embedding model, and improve the quality of the word embedding model.
In order to achieve the above and other related objects, an aspect of the present invention provides a method for constructing a chinese non-segmented word and word embedding model, including: counting candidate fragments in a corpus and word frequency information corresponding to the candidate fragments; determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary list according to the association strength; and constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.
In an embodiment of the present invention, the candidate segments are chinese language model segments, and the step of counting the candidate segments in the corpus and the word frequency information corresponding to the candidate segments includes: and counting Chinese language model fragments corresponding to different fixed length values and word frequency information thereof in the corpus set.
In an embodiment of the present invention, the step of determining the association strength of the candidate segment by combining the word frequency information and generating a word-embedded vocabulary according to the association strength includes: determining an unsupervised association metric of the candidate segment in combination with the word frequency information, wherein the unsupervised association metric characterizes the association strength of the candidate segment; and sequentially arranging the association strengths from large to small, and selecting K candidate segments before the association strengths as a word-embedded vocabulary list.
In an embodiment of the present invention, the step of determining the unsupervised association metric of the candidate segment according to the word frequency information includes: calculating mutual information values of the candidate segments, and determining a corresponding segment combination when the mutual information values are minimum; determining a first set and a second set according to the fragment combination, and calculating a statistical relationship value between the fragment combination and the first set or the second set; and taking the product of the word frequency information, the mutual information value and the statistical relationship value as an unsupervised association measurement index.
In an embodiment of the present invention, the step of determining the first set and the second set according to the fragment combination and calculating the statistical relationship value between the fragment combination and the first set or the second set includes: taking the maximum value of the word frequency ratio of the word frequency information to the word frequency of the first set and the maximum value of the word frequency ratio of the word frequency information to the word frequency of the second set as a numerator, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator; calculating a fraction value formed by the numerator and the denominator as a relative importance degree calculation value of the segment combination in the first set or the second set; and determining the statistical relationship value according to the relative importance degree calculation value.
In an embodiment of the present invention, the calculation of the correlation strength is performed for the candidate segments of each length; according to the candidate segments with different lengths, the association strengths are sequentially arranged from large to small under each length, the K candidate segments before the association strengths are selected as word embedding vocabularies, and different numbers of candidate segments are selected as word embedding vocabularies according to different lengths.
In an embodiment of the present invention, the step of constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set includes: and on the basis of combining a skip-gram model with negative sampling, adopting a parameter optimization method to maximize the positive sampling probability and minimize the negative sampling probability, and constructing the word embedding model.
The invention also provides a system for constructing a Chinese non-participle word embedding model, which comprises the following steps: the fragment counting module is used for counting candidate fragments in the corpus and word frequency information corresponding to the candidate fragments; the association measurement module is used for determining the association strength of the candidate segments by combining the word frequency information and generating a word embedded vocabulary according to the association strength; and the model generation module is used for constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.
Still another aspect of the present invention provides a medium having stored thereon a computer program that, when being executed by a processor, implements the method for constructing the chinese non-segmented word embedding model.
A final aspect of the invention provides an apparatus comprising: a processor and a memory; the memory is used for storing computer programs, and the processor is used for executing the computer programs stored in the memory so as to enable the equipment to execute the construction method of the Chinese non-segmented word embedding model.
As mentioned above, the method, system, medium and device for constructing the Chinese non-segmented word and word embedding model have the following advantages:
a new unsupervised relevance metric is proposed for screening n-gram fragments with strong relevance. The unsupervised relevance measurement index is combined with a word embedding model to construct a new Chinese corpus-oriented word-segmentation-free Chinese word embedding model. The word embedding model obtained by the invention can show better performance in downstream tasks.
Drawings
Fig. 1 is a schematic flow chart illustrating a method for constructing a chinese non-segmented word-word embedding model according to an embodiment of the present invention.
Fig. 2 is a flowchart illustrating an association metric method of the method for constructing an embedding model of chinese non-segmented words according to an embodiment of the present invention.
Fig. 3 is a flowchart illustrating the calculation of the association strength of the method for constructing the chinese non-segmented word-word embedding model according to an embodiment of the present invention.
FIG. 4 is a diagram showing the effect of the construction method of the Chinese non-segmented word and word embedding model of the present invention compared with a basic dictionary.
FIG. 5 is a diagram illustrating the effect of the method for constructing an embedding model of Chinese non-segmented words and words in comparison with a rich dictionary according to the present invention.
FIG. 6 is a schematic structural diagram of a system for constructing an embedding model of Chinese non-segmented words according to an embodiment of the present invention.
FIG. 7 is a schematic structural connection diagram of an apparatus for constructing a Chinese non-segmented word embedding model according to an embodiment of the present invention.
Description of the element reference numerals
6 Chinese non-segmented word and word embedding model construction system
61 fragment statistical module
62 correlation metric module
63 model generation module
7 device
71 processor
72 memory
73 communication interface
74 system bus
S11-S13
S121 to S122
S121A-S121C steps
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that, in the following embodiments, features in the embodiments may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
The invention provides a method for constructing a Chinese word-segmentation-free word embedding model, which aims at the problem of excessive noise n-gram in the vocabulary of the existing word-segmentation-free word embedding model, takes Chinese linguistic data as a research object and utilizes a negatively sampled skip-gram model to improve the word-segmentation-free word embedding model by using an unsupervised relevance metric index.
The principle and implementation of the method, system, medium and apparatus for constructing a chinese non-segmented word embedding model according to the present embodiment will be described in detail below with reference to fig. 1 to 7, so that those skilled in the art can understand the method, system, medium and apparatus for constructing a chinese non-segmented word embedding model without creative work.
Please refer to fig. 1, which is a schematic flow chart illustrating a method for constructing a chinese non-segmented word embedding model according to an embodiment of the present invention. As shown in fig. 1, the method for constructing the chinese non-segmented word and word embedding model specifically includes the following steps:
and S11, counting the candidate fragments in the corpus and the word frequency information corresponding to the candidate fragments.
In this embodiment, the candidate segments are chinese language model segments, for example, the candidate segments are n-gram segments, and the n-gram segments corresponding to different fixed length values and their word frequency information are counted in the corpus set.
Specifically, a simple word segmentation device is realized through an n-gram model to obtain n-gram fragments. The model is based on the assumption that the occurrence of the nth word is only related to the first n-1 words and not to any other words, and that the probability of a complete sentence is the product of the probabilities of occurrence of the words. In general, we only compute the probability of two words before and after a word, i.e. n is 2, and compute the probability of n-2,. n-1, n +1, n + 2. If n is 3, the calculation effect is better; when n is 4, the amount of calculation becomes large.
Specifically, the corpus set is sorted, and all possible n-gram segments and word frequency information thereof under a fixed length are counted. And performing list management on the n-gram segments with different lengths and corresponding word frequency information to form a table 1. For example, the length is "one" of 1 Chinese character in Table 1, and the word frequency is 529285.
TABLE 1 candidate fields Table
Figure BDA0002502653860000051
And S12, determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary according to the association strength.
In this embodiment, the calculation of the correlation strength is performed for the candidate segment of each length.
Furthermore, according to the candidate segments with different lengths, the association strengths are sequentially arranged from large to small under each length, the candidate segments with K previous association strengths are selected as word embedding vocabularies, and the candidate segments with different numbers are selected as word embedding vocabularies according to different lengths.
Please refer to fig. 2, which is a flowchart illustrating an association metric method of the method for constructing a chinese non-segmented word embedding model according to an embodiment of the present invention. As shown in fig. 2, S12 includes:
s121, determining an unsupervised association metric index of the candidate segment by combining the word frequency information, wherein the unsupervised association metric index represents the association strength of the candidate segment. The PATI (position Association with Times Information) of the invention can explore more strongly associated n-gram fragments by considering more statistic Information.
Please refer to fig. 3, which is a flowchart illustrating the correlation strength calculation of the method for constructing the chinese non-segmented word embedding model according to an embodiment of the present invention. As shown in fig. 3, S121 includes:
S121A, calculating mutual information values of the candidate segments, and determining the corresponding segment combination when the mutual information values are minimum.
In particular, the mutual information value is an MP value, g ═ w for each n-gram fragment of length siwi+1...wi+s(i is not less than 0 and not more than N-s), and the left part and the right part of g are respectively a ═ wi...wk-1,b=wk...wi+s(i<k<i + s), i.e. g ═ concat (a, b). f. ofa,fbAnd fgRespectively representing the word frequencies of the character strings a, b and the n-gram fragment g in the corpus.
For an n-gram fragment g ═ concat (a, b), the corresponding MP is defined as:
Figure BDA0002502653860000052
for n-gram fragments g of fixed length s, there will always be a specific left and right a, b combination (a)m,bm) MP can be minimized. Subsequently, the AT's calculation will also be based on this particular combination (a)m,bm)。
S121B, determining a first set and a second set according to the fragment combination, and calculating the statistical relationship value of the fragment combination and the first set or the second set.
In the present embodiment, S121B includes:
(1) and taking the maximum value of the ratio of the word frequency information to the word frequency of the first set and the ratio of the word frequency of the second set as a numerator, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator.
For a particular combination of n-gram fragments g (a)m,bm) There are a collection of n-gram fragments (a) of the same lengthm,bh) And (a)j,bm) Then the first set { a }mAnd a second set {. b {. B { (R) }mThe following definitions are provided:
{am,*}={(am,b1),(am,b2),…,(am,bh) Equation (2)
{*,bm}={(a1,bm),(a2,bm),…,(aj,bm) Formula (3)
Order to
Figure BDA0002502653860000061
And
Figure BDA0002502653860000062
are respectively represented in the set { a }mA and bmThe sum of the word frequencies of all n-gram segments in the } is defined as:
Figure BDA0002502653860000063
Figure BDA0002502653860000064
for n-gram fragments g and specific combinations thereof (a)m,bm) The variable rate represents fgAnd
Figure BDA0002502653860000065
ratio and fgAnd
Figure BDA0002502653860000066
the maximum of the ratios, namely the rate, is defined as follows:
Figure BDA0002502653860000067
for two sets amA and bmAnd their counterparts
Figure BDA0002502653860000068
And
Figure BDA0002502653860000069
let sizeof represent the number of n-gram elements in the set, then AC can be defined as:
Figure BDA00025026538600000610
(2) and taking the numerator calculation value formed by the numerator and the denominator as the relative importance degree calculation value of the segment combination in the first set or the second set.
Specifically, given the variable rate and the variable AC, the time value of the n-gram fragment g at this time is defined as follows:
Figure BDA00025026538600000611
(3) and determining the statistical relationship value according to the relative importance degree calculation value.
In particular, for a particular combination of n-gram fragments g of length s (a)m,bm) If there is a unique variable time, the calculation formula of AT is:
AT 1+ | logtimes | formula (9)
And S121C, taking the product of the word frequency information, the mutual information value and the statistical relationship value as an unsupervised association measurement index.
Specifically, the formula of PATI (position Association with Times Information) is as follows:
PATI F × MP × AT equation (10)
Wherein, F-fg is the word frequency information, MP is the mutual information value, and AT is the statistical relationship value.
The MP is a modified version of the PMI, which considers more marginal variables of each n-gram fragment g ═ concat (a, b), i.e. statistics of the left and right parts a and b of the n-gram, in the calculation process of the correlation strength, so as to be more sensitive to local information of the n-gram.
AT by utilizing a specific combination of each n-gram fragment (a)m,bm) In the set { a }mB or bmThe statistical information in the } to further measure the strength of the association of n-grams. Variable times is taken into account
Figure BDA0002502653860000071
And (a)m,bm) Word frequency information of and (a)m,bm) Front and rear adjacency number of (a) to evaluatem,bm) The relative degree of importance in the collection, the higher the time value, generally states (a)m,bm) As a whole is reasonable. Generally, the times of most reasonable n-grams with higher correlation strengths are much larger than those of unreasonable n-gram fragments.
And S122, sequentially arranging the association strengths from large to small, and selecting K candidate segments before the association strengths as a word-embedded vocabulary.
Specifically, the proposed unsupervised relevance metric index is used to calculate the relevance strength of the candidate n-gram segment under each segment length. And then selecting the n-gram segment with the highest Top-K association strength as a vocabulary of the word embedding model. Wherein, the problem that the front K is large or small is solved in a large pile number, which is called Top-K problem for short.
And S13, constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.
In this embodiment, the word embedding model is constructed by using a skip-gram model in combination with negative sampling and by using a maximum likelihood estimation method to maximize the positive sampling probability and minimize the negative sampling probability. The method screens the vocabulary of word embedding by using the unsupervised relevance metric index, reconstructs the positive sampling and the negative sampling of the word embedding model, reduces the influence of noise n-gram fragments on the model, and improves the expression of word embedding in downstream tasks.
It should be noted that the maximum likelihood estimation is only one embodiment of the parameter estimation and optimization of the present invention, and other methods for implementing the parameter estimation and optimization are also included in the scope of the present invention.
Specifically, the PFNE learns word embedding based on a negatively sampled skip-gram model, so that the calculated amount of the model when the gradient of the model is reduced, and the model training is accelerated. Positive sampling set N of modelspThat is, the "headword-context pair" (w) generated by combining the vocabulary and the corpust,wc) Set of negative samples NnThen, a sufficiently large unigram language model vocabulary is constructed, each n-gram fragment is indexed in the list, and a negative sampling sample is randomly obtained according to the word frequency of the n-gram in the vocabulary. The objective function of the PFNE model is defined as follows:
Figure BDA0002502653860000081
wherein the content of the first and second substances,
Figure BDA0002502653860000082
and
Figure BDA0002502653860000083
are respectively the central word wtAnd its context wcUsing maximum likelihood estimation, the model predicts the context from the central word, maximizing the probability of positive samples while minimizing the probability of negative samples, such that the objective function is generatedThe word embedding model of (2) is optimal. The optimization of the objective function adopts a random gradient descent method based on positive and negative sampling.
Please refer to fig. 4 and fig. 5, which respectively show an effect diagram of the method for constructing the chinese non-segmented word embedding model according to the present invention compared with the basic dictionary and an effect diagram of the method for constructing the chinese non-segmented word embedding model according to the present invention compared with the rich dictionary. In FIGS. 4 and 5, PFNE represents the result of comparing the n-gram fragments screened using the PATI algorithm with a dictionary; sembei is the result of comparing the n-gram segments screened by using frequency (word frequency) with a dictionary; the SGNS-PMI is a result of comparing an n-gram fragment selected by PMI (Mutual Information) with a dictionary. The vertical axis is precision rate and the horizontal axis is recall rate. The expression of the accuracy rate and the recall rate is as follows:
Figure BDA0002502653860000084
Figure BDA0002502653860000085
further, the higher the curve, the longer it is, the more rational n-gram fragments are screened. It can be seen in fig. 4 and 5 that the curve of the solid line PFNE using the pat algorithm is the highest and longest among the three curves, thereby illustrating that the method for constructing the chinese non-segmented word embedding model of the present invention can screen more reasonable n-gram fragments than the word embedding model (basic dictionary or rich dictionary) of the prior art.
The protection scope of the method for constructing the Chinese non-segmented word and word embedding model is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes of increasing, decreasing and replacing the steps in the prior art according to the principle of the invention are included in the protection scope of the invention.
The present embodiment provides a computer storage medium, on which a computer program is stored, which, when executed by a processor, implements the method for constructing the chinese non-segmented word embedding model.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned computer-readable storage media comprise: various computer storage media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The following describes in detail the construction system of the chinese non-segmented word embedding model provided in this embodiment with reference to the drawings. It should be noted that the division of the modules of the following system is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity or may be physically separated. And the modules can be realized in a form that all software is called by the processing element, or in a form that all the modules are realized in a form that all the modules are called by the processing element, or in a form that part of the modules are called by the hardware. For example: a module may be a separate processing element, or may be integrated into a chip of the system described below. Further, a certain module may be stored in the memory of the following system in the form of program code, and a certain processing element of the following system may call and execute the function of the following certain module. Other modules are implemented similarly. All or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, the steps of the above method or the following modules may be implemented by hardware integrated logic circuits in a processor element or instructions in software.
The following modules may be one or more integrated circuits configured to implement the above methods, for example: one or more Application Specific Integrated Circuits (ASICs), one or more Digital Signal Processors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), and the like. When some of the following modules are implemented in the form of a program code called by a Processing element, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling the program code. These modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).
Please refer to fig. 6, which is a schematic structural diagram of a system for constructing an embedding model of chinese non-segmented words according to an embodiment of the present invention. As shown in fig. 6, the system 6 for constructing the chinese non-segmented word embedding model includes: a segment statistics module 61, an association metrics module 62 and a model generation module 63.
The segment counting module 61 is configured to count candidate segments in the corpus and word frequency information corresponding to the candidate segments.
In this embodiment, the candidate segments are chinese language model segments, and the segment counting module 61 is specifically configured to count chinese language model segments corresponding to different fixed length values and word frequency information thereof in the corpus.
The relevance metric module 62 is configured to determine the relevance strength of the candidate segment in combination with the word frequency information, and generate a word-embedded vocabulary according to the relevance strength.
In this embodiment, the association metric module 62 is specifically configured to determine, in combination with the word frequency information, an unsupervised association metric index of the candidate segment, where the unsupervised association metric index characterizes an association strength of the candidate segment; and sequentially arranging the association strengths from large to small, and selecting K candidate segments before the association strengths as a word-embedded vocabulary list.
The model generation module 63 is configured to construct a positive sampling set and a negative sampling set according to the vocabulary, and construct a word embedding model by combining the positive sampling set and the negative sampling set.
In this embodiment, the model generating module 63 is specifically configured to construct the word embedding model by using a parameter optimization method to maximize the positive sampling probability and minimize the negative sampling probability based on a skip-gram model combined with negative sampling.
The system for constructing the Chinese non-participle word embedding model can realize the method for constructing the Chinese non-participle word embedding model, but the device for realizing the method for constructing the Chinese non-participle word embedding model comprises but is not limited to the structure of the system for constructing the Chinese non-participle word embedding model listed in the embodiment, and all structural deformation and replacement in the prior art according to the principle of the invention are included in the protection scope of the invention.
Please refer to fig. 7, which is a schematic structural connection diagram of a device for constructing an embedding model of chinese non-segmented words according to an embodiment of the present invention. As shown in fig. 7, the present embodiment provides an apparatus 7, the apparatus 7 including: a processor 71, memory 72, communication interface 73, or/and system bus 74; the memory 72 and the communication interface 73 are connected with the processor 71 through a system bus 74 and perform communication with each other, the memory 72 is used for storing computer programs, the communication interface 73 is used for communicating with other devices, and the processor 71 is used for running the computer programs to enable the device 7 to execute the steps of the method for constructing the Chinese non-segmented word embedding model.
The system bus 74 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. The communication interface 73 is used to realize communication between the database access device and other devices (such as a client, a read-write library, and a read-only library). The Memory 72 may include a Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.
The Processor 71 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.
In summary, the method, system, medium and apparatus for constructing a Chinese non-segmented word-word embedding model of the present invention provide a new unsupervised correlation metric for screening n-gram segments with strong correlation. The unsupervised relevance measurement index is combined with a word embedding model to construct a new Chinese corpus-oriented word-segmentation-free Chinese word embedding model. The word embedding model obtained by the invention can show better performance in downstream tasks. The invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (10)

1. A method for constructing a Chinese non-participle word embedding model is characterized by comprising the following steps:
counting candidate fragments in a corpus and word frequency information corresponding to the candidate fragments;
determining the association strength of the candidate segments by combining the word frequency information, and generating a word embedded vocabulary list according to the association strength;
and constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.
2. The method for constructing the Chinese non-segmented word and word embedding model according to claim 1, wherein the candidate segments are Chinese language model segments, and the step of counting the candidate segments in the corpus and the word frequency information corresponding to the candidate segments comprises:
and counting Chinese language model fragments corresponding to different fixed length values and word frequency information thereof in the corpus set.
3. The method of claim 1, wherein the step of determining the association strength of the candidate segment in combination with the word frequency information and generating a vocabulary for word embedding according to the association strength comprises:
determining an unsupervised association metric of the candidate segment in combination with the word frequency information, wherein the unsupervised association metric characterizes the association strength of the candidate segment;
and sequentially arranging the association strengths from large to small, and selecting K candidate segments before the association strengths as a word-embedded vocabulary list.
4. The method for constructing the Chinese non-segmented word and word embedding model according to claim 3, wherein the step of determining the unsupervised relevance metric of the candidate segment in combination with the word frequency information comprises:
calculating mutual information values of the candidate segments, and determining a corresponding segment combination when the mutual information values are minimum;
determining a first set and a second set according to the fragment combination, and calculating a statistical relationship value between the fragment combination and the first set or the second set;
and taking the product of the word frequency information, the mutual information value and the statistical relationship value as an unsupervised association measurement index.
5. The method as claimed in claim 4, wherein the step of determining the first set and the second set according to the fragment combination and calculating the statistical relationship between the fragment combination and the first set or the second set comprises:
taking the maximum value of the word frequency ratio of the word frequency information to the word frequency of the first set and the maximum value of the word frequency ratio of the word frequency information to the word frequency of the second set as a numerator, selecting the set with the minimum word frequency in the first set and the second set, and taking the reciprocal of the number of elements in the set as a denominator;
calculating a fraction value formed by the numerator and the denominator as a relative importance degree calculation value of the segment combination in the first set or the second set;
and determining the statistical relationship value according to the relative importance degree calculation value.
6. The method for constructing the Chinese non-segmented word and word embedding model according to claim 3, wherein:
calculating the correlation strength of the candidate segments of each length;
according to the candidate segments with different lengths, the association strengths are sequentially arranged from large to small under each length, the K candidate segments before the association strengths are selected as word embedding vocabularies, and different numbers of candidate segments are selected as word embedding vocabularies according to different lengths.
7. The method for constructing a Chinese non-segmented word-word embedding model according to claim 1, wherein the step of constructing a positive sample set and a negative sample set according to the vocabulary, and constructing the word embedding model by combining the positive sample set and the negative sample set comprises:
and on the basis of combining a skip-gram model with negative sampling, adopting a parameter optimization method to maximize the positive sampling probability and minimize the negative sampling probability, and constructing the word embedding model.
8. A system for constructing a Chinese non-participle word embedding model is characterized by comprising the following steps:
the fragment counting module is used for counting candidate fragments in the corpus and word frequency information corresponding to the candidate fragments;
the association measurement module is used for determining the association strength of the candidate segments by combining the word frequency information and generating a word embedded vocabulary according to the association strength;
and the model generation module is used for constructing a positive sampling set and a negative sampling set according to the vocabulary, and constructing a word embedding model by combining the positive sampling set and the negative sampling set.
9. A medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of constructing the chinese non-segmented word embedding model according to any one of claims 1 to 7.
10. An apparatus, comprising: a processor and a memory;
the memory is used for storing a computer program, and the processor is used for executing the computer program stored by the memory to enable the equipment to execute the construction method of the Chinese non-segmented word embedding model according to any one of claims 1 to 7.
CN202010437000.3A 2020-05-21 2020-05-21 Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model Active CN113705227B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010437000.3A CN113705227B (en) 2020-05-21 2020-05-21 Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010437000.3A CN113705227B (en) 2020-05-21 2020-05-21 Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model

Publications (2)

Publication Number Publication Date
CN113705227A true CN113705227A (en) 2021-11-26
CN113705227B CN113705227B (en) 2023-04-25

Family

ID=78645861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010437000.3A Active CN113705227B (en) 2020-05-21 2020-05-21 Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model

Country Status (1)

Country Link
CN (1) CN113705227B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN107015963A (en) * 2017-03-22 2017-08-04 重庆邮电大学 Natural language semantic parsing system and method based on deep neural network
CN107273352A (en) * 2017-06-07 2017-10-20 北京理工大学 A kind of word insertion learning model and training method based on Zolu functions
CN107491444A (en) * 2017-08-18 2017-12-19 南京大学 Parallelization word alignment method based on bilingual word embedded technology
CN108959431A (en) * 2018-06-11 2018-12-07 中国科学院上海高等研究院 Label automatic generation method, system, computer readable storage medium and equipment
CN110390018A (en) * 2019-07-25 2019-10-29 哈尔滨工业大学 A kind of social networks comment generation method based on LSTM

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933183A (en) * 2015-07-03 2015-09-23 重庆邮电大学 Inquiring term rewriting method merging term vector model and naive Bayes
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN107015963A (en) * 2017-03-22 2017-08-04 重庆邮电大学 Natural language semantic parsing system and method based on deep neural network
CN107273352A (en) * 2017-06-07 2017-10-20 北京理工大学 A kind of word insertion learning model and training method based on Zolu functions
CN107491444A (en) * 2017-08-18 2017-12-19 南京大学 Parallelization word alignment method based on bilingual word embedded technology
CN108959431A (en) * 2018-06-11 2018-12-07 中国科学院上海高等研究院 Label automatic generation method, system, computer readable storage medium and equipment
CN110390018A (en) * 2019-07-25 2019-10-29 哈尔滨工业大学 A kind of social networks comment generation method based on LSTM

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GEEWOOK KIM 等: "Segmentation-free compositional n-gram embedding" *
XIAOBIN WANG 等: "Unsupervised Learning Helps Supervised Neural Word Segmentation" *

Also Published As

Publication number Publication date
CN113705227B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
US11604956B2 (en) Sequence-to-sequence prediction using a neural network model
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
CN109960724B (en) Text summarization method based on TF-IDF
WO2021139262A1 (en) Document mesh term aggregation method and apparatus, computer device, and readable storage medium
CN109840287A (en) A kind of cross-module state information retrieval method neural network based and device
CN110210028B (en) Method, device, equipment and medium for extracting domain feature words aiming at voice translation text
CN108519971B (en) Cross-language news topic similarity comparison method based on parallel corpus
CN112035598A (en) Intelligent semantic retrieval method and system and electronic equipment
WO2020232898A1 (en) Text classification method and apparatus, electronic device and computer non-volatile readable storage medium
CN110728313B (en) Classification model training method and device for intention classification recognition
CN109740660A (en) Image processing method and device
CN111581949A (en) Method and device for disambiguating name of learner, storage medium and terminal
WO2022121163A1 (en) User behavior tendency identification method, apparatus, and device, and storage medium
CN110348020A (en) A kind of English- word spelling error correction method, device, equipment and readable storage medium storing program for executing
CN111382260A (en) Method, device and storage medium for correcting retrieved text
WO2016095645A1 (en) Stroke input method, device and system
CN112329460A (en) Text topic clustering method, device, equipment and storage medium
WO2015192798A1 (en) Topic mining method and device
CN111291824A (en) Time sequence processing method and device, electronic equipment and computer readable medium
CN112232055A (en) Text detection and correction method based on pinyin similarity and language model
CN112287656A (en) Text comparison method, device, equipment and storage medium
CN117235582A (en) Multi-granularity information processing method and device based on electronic medical record
CN113705227B (en) Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model
CN109241281B (en) Software failure reason generation method, device and equipment
CN114266249A (en) Mass text clustering method based on birch clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant