CN113076740A - Synonym mining method and device in government affair service field - Google Patents

Synonym mining method and device in government affair service field Download PDF

Info

Publication number
CN113076740A
CN113076740A CN202110350274.3A CN202110350274A CN113076740A CN 113076740 A CN113076740 A CN 113076740A CN 202110350274 A CN202110350274 A CN 202110350274A CN 113076740 A CN113076740 A CN 113076740A
Authority
CN
China
Prior art keywords
abbreviation
synonym
word
candidate
entries
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110350274.3A
Other languages
Chinese (zh)
Inventor
聂砂
刘海
贾国琛
罗奕康
崔震
戴菀庭
师文宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202110350274.3A priority Critical patent/CN113076740A/en
Publication of CN113076740A publication Critical patent/CN113076740A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention relates to the technical field of artificial intelligence, and provides a synonym mining method and a synonym mining device in the field of government affair service, wherein the method comprises the following steps: generating a new word candidate word set; determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index value lower than a preset threshold value to obtain the new word set in the government affair service field; recording the new word set into a word library in the field of government affairs service, and constructing an abbreviation synonym set; inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between an original entry and a plurality of abbreviation synonym candidate entries; and obtaining the abbreviation synonyms corresponding to the original entry according to the similarity between the original entry and the multiple candidate abbreviation synonym entries, and recording the abbreviation synonyms into a synonym word library in the field of government affairs services. The new word candidate word set is filtered by utilizing the multivariate statistical indexes, so that the accuracy rate of new word discovery is improved; by constructing the double-tower model, the abbreviation synonym can be effectively identified, and the efficiency of synonym mining is improved.

Description

Synonym mining method and device in government affair service field
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a synonym mining method and device in the field of government affair services.
Background
Synonyms are often used in natural language processing tasks such as information retrieval, text classification, and the like. Specifically, before performing processing tasks such as information search and text classification, it is necessary to acquire synonyms and recognize synonyms. For example, in an application scenario of information retrieval, a plurality of words belonging to synonyms may be classified into the same category, and when there is a keyword of a synonym in an input text, the synonym may be used to replace the original keyword for searching, thereby increasing the recall rate of searching.
However, due to the continuous development of the government affairs service field in recent years, a large number of new words and proper nouns emerge, so that the word segmentation is inaccurate, and meanwhile, the synonym mining is difficult. Because Chinese semantics are varied, synonym mining difficulty is high, and most of synonym libraries of the existing government affair service fields are filled manually, so that synonym mining efficiency is low. And the new word definition in the government affair service field does not have a unified standard at present; the government affair service field has a wide range of fields, including a plurality of sub-fields such as humanity, economy, society and the like, the composition rule of new words is difficult to master, and the composition rule of the new words in each field is different; due to reasons such as sparse text in the government affair service field, the low-frequency new word recognition difficulty is high, and the accuracy of new word discovery is low.
Disclosure of Invention
The embodiment of the invention provides a synonym mining method in the field of government affairs service, which is used for improving the efficiency of synonym mining and the accuracy of new word discovery, and comprises the following steps:
generating a new word candidate word set according to a language database in the government affair service field;
determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index value lower than a preset threshold value to obtain the new word set in the government affair service field;
recording a new word set in the government affair service field into a word bank in the government affair service field, and constructing an abbreviation synonym set according to the word bank in the government affair service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry;
inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between the original entry and the multiple abbreviation synonym candidate entries; the double-tower model is obtained by training according to the word bank in the government affairs service field and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym;
and obtaining the abbreviation synonyms corresponding to the original entry according to the similarity between the original entry and the plurality of candidate abbreviation synonym entries, and inputting the abbreviation synonym synonyms into a synonym word bank in the field of government affairs services.
In the embodiment of the present invention, generating a new word candidate word set according to a corpus in the field of government affairs services includes:
splicing any two adjacent words in a language database in the government affair service field by using a BiGram method to serve as candidate new words to obtain a plurality of candidate new words;
and filtering the candidate new words containing the non-word-forming elements in the plurality of candidate new words to generate a new word candidate word set.
In specific implementation, the multivariate statistical indicator comprises:
word frequency, degree of adjacency change, entropy, and inter-point mutual information.
In a specific embodiment of the present invention, the boundary of the preset threshold is determined according to a distribution relative shift between a government affairs service field dictionary and a new word candidate word corresponding to the multivariate statistical index.
In the specific embodiment of the invention, the construction of the synonym set of the abbreviations according to the word bank in the field of government affairs services comprises the following steps:
selecting a word from a word bank in the field of government affairs service as an original entry;
randomly deleting partial sub-words in the original entry to form a plurality of possible entries of the abbreviation synonyms;
calculating the editing distance between the original entry and each possible synonym entry, and taking the possible abbreviation synonym entries with the editing distance meeting a preset value as candidate abbreviation synonym entries corresponding to the original entry;
and generating an abbreviation synonym set according to the original entry and the abbreviation synonym candidate entry corresponding to the original entry.
In a specific implementation process, the pre-construction process of the double-tower model comprises the following steps:
constructing an abbreviation synonym sample set according to the government affair service field word library; sample data in the abbreviation synonym sample set comprises original entries, abbreviation entries and synonym marks; the synonym mark is used for representing whether the original entry and the abbreviation entry are synonyms or not;
respectively inputting original entries and abbreviation entries in the abbreviation sample set into a Bert model to obtain embedding word vectors corresponding to the original entries and the abbolding word vectors corresponding to the abbreviation entries;
taking the imbedding word vector corresponding to the original entry and the imbedding word vector corresponding to the abbreviation entry as word vector expressions of the CNN-DSSM, and constructing a double-tower model;
and training the constructed double-tower model by using the synonym sample set of the acronyms to obtain the trained double-tower model.
Specifically, the sample set of synonyms of abbreviations includes:
a positive sample subset and a negative sample subset;
the abbreviation entries of the positive sample subset are obtained by randomly adding and deleting the original entries, and the editing distance between the abbreviation entries of the positive sample subset and the original entries accords with a preset value; synonyms for the positive sample subset are marked as synonyms;
synonyms for the negative examples subset are marked as non-synonyms.
In an embodiment of the present invention, the method further includes a method for mining synonyms in the field of government affairs services, and on the basis of the method for mining synonyms in the field of government affairs services provided above, the method further includes:
sequencing a plurality of abbreviation synonym candidate entries in the abbreviation synonym set by utilizing a BM25 algorithm to obtain a first sequencing result of the abbreviation synonym candidate entries;
obtaining a second ordering result of the multiple abbreviation synonym candidate entries according to the similarity between the original entry and the multiple abbreviation synonym candidate entries;
combining the first ordering results of the multiple abbreviation synonym candidate entries and the second ordering results of the multiple abbreviation synonym candidate entries to determine the ordering results of the multiple abbreviation synonym candidate entries;
and determining the first candidate entry of the abbreviation synonyms in the sequencing result as the abbreviation synonyms corresponding to the original entry.
The embodiment of the invention also provides a synonym mining device in the field of government affairs service, which is used for improving the efficiency of synonym mining and the accuracy rate of new word discovery, and comprises the following components:
the new word candidate set generating module is used for generating a new word candidate set according to the language database in the government affair service field;
the new word generation module is used for determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index values lower than a preset threshold value to obtain a new word set in the government affair service field;
the system comprises an abbreviation synonym set construction module, a database and a database management module, wherein the abbreviation synonym set construction module is used for recording a new vocabulary set in the government affair service field into a word bank in the government affair service field and constructing an abbreviation synonym set according to the word bank in the government affair service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry;
the similarity determining module is used for inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between the original entry and the multiple abbreviation synonym candidate entries; the double-tower model is obtained by training according to the word bank in the government affairs service field and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym;
and the abbreviation synonym determining module is used for obtaining the abbreviation synonyms corresponding to the original entries according to the similarity between the original entries and the multiple abbreviation synonym candidate entries and recording the abbreviation synonyms into a synonym thesaurus in the field of government affairs services.
In a specific embodiment, the new word candidate set generating module includes:
the candidate new word generating unit is used for splicing any two adjacent words in the corpus of the government affair service field by using a BiGram method to serve as candidate new words to obtain a plurality of candidate new words;
and the non-word forming filtering unit is used for filtering the candidate new words containing the non-word forming elements in the plurality of candidate new words to generate a new word candidate word set.
In an embodiment of the present invention, the multivariate statistical indicator includes:
word frequency, degree of adjacency change, entropy, and inter-point mutual information.
And the boundary of the preset threshold is determined according to the distribution relative deviation between the government affair service field dictionary corresponding to the multivariate statistical index and the new word candidate word.
In a specific embodiment, the abbreviation synonym set building module is specifically configured to:
selecting a word from a word bank in the field of government affairs service as an original entry;
randomly deleting partial sub-words in the original entry to form a plurality of possible entries of the abbreviation synonyms;
calculating the editing distance between the original entry and each possible synonym entry, and taking the possible abbreviation synonym entries with the editing distance meeting a preset value as candidate abbreviation synonym entries corresponding to the original entry;
and generating an abbreviation synonym set according to the original entry and the abbreviation synonym candidate entry corresponding to the original entry.
In the specific embodiment of the present invention, the method further includes:
a double tower model pre-construction module for:
constructing an abbreviation synonym sample set according to the government affair service field word library; sample data in the abbreviation synonym sample set comprises original entries, abbreviation entries and synonym marks; the synonym mark is used for representing whether the original entry and the abbreviation entry are synonyms or not;
respectively inputting original entries and abbreviation entries in the abbreviation sample set into a Bert model to obtain embedding word vectors corresponding to the original entries and the abbolding word vectors corresponding to the abbreviation entries;
taking the imbedding word vector corresponding to the original entry and the imbedding word vector corresponding to the abbreviation entry as word vector expressions of the CNN-DSSM, and constructing a double-tower model;
and training the constructed double-tower model by using the synonym sample set of the acronyms to obtain the trained double-tower model.
Specifically, the sample set of synonyms of abbreviations includes:
a positive sample subset and a negative sample subset;
the abbreviation entries of the positive sample subset are obtained by randomly adding and deleting the original entries, and the editing distance between the abbreviation entries of the positive sample subset and the original entries accords with a preset value; synonyms for the positive sample subset are marked as synonyms;
synonyms for the negative examples subset are marked as non-synonyms.
In a specific embodiment of the present invention, a synonym mining device in the field of government affairs services is further provided, and on the basis of the above device, the synonym mining device further includes: a rank adjustment module to:
sequencing a plurality of abbreviation synonym candidate entries in the abbreviation synonym set by utilizing a BM25 algorithm to obtain a first sequencing result of the abbreviation synonym candidate entries;
obtaining a second ordering result of the multiple abbreviation synonym candidate entries according to the similarity between the original entry and the multiple abbreviation synonym candidate entries;
combining the first ordering results of the multiple abbreviation synonym candidate entries and the second ordering results of the multiple abbreviation synonym candidate entries to determine the ordering results of the multiple abbreviation synonym candidate entries;
and determining the first candidate entry of the abbreviation synonyms in the sequencing result as the abbreviation synonyms corresponding to the original entry.
The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the synonym mining method in the government affair service field when executing the computer program.
An embodiment of the present invention also provides a computer-readable storage medium storing a computer program for executing the method for mining synonyms in the field of government affairs services.
In the embodiment of the invention, a new word candidate word set is generated according to a corpus in the government affair service field; determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index value lower than a preset threshold value to obtain the new word set in the government affair service field; recording a new word set in the government affair service field into a word bank in the government affair service field, and constructing an abbreviation synonym set according to the word bank in the government affair service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry; inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between an original entry and a plurality of abbreviation synonym candidate entries; the double-tower model is obtained by training according to a word bank in the field of government affairs services and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym; and obtaining the abbreviation synonyms corresponding to the original entry according to the similarity between the original entry and the multiple candidate abbreviation synonym entries, and recording the abbreviation synonyms into a synonym word library in the field of government affairs services. Filtering the new word candidate word set by using the multivariate statistical indexes, and screening out the new word candidate words with multivariate statistical index values lower than a preset threshold value, wherein the accuracy of new word discovery is improved by the multivariate statistical indexes; by constructing the double-tower model, the abbreviation synonyms in the field of government affairs service can be effectively identified, and compared with the prior art of manually filling the synonym thesaurus, the efficiency of synonym mining is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of a synonym mining method in the field of government affairs services according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of an implementation method of step 101 in the embodiment of the present invention.
Fig. 3 is a schematic diagram of a method for implementing step 103 in the embodiment of the present invention.
FIG. 4 is a schematic diagram of a pre-construction process of a two-tower model in an embodiment of the present invention.
Fig. 5 is a diagram illustrating a synonym mining method for the field of government services according to an embodiment of the present invention.
Fig. 6 is a flow chart illustrating a new word discovery process according to an embodiment of the present invention.
FIG. 7 is a schematic diagram of a two tower model training process in accordance with one embodiment of the present invention.
FIG. 8 is a diagram illustrating a candidate mining process in accordance with an embodiment of the present invention.
Fig. 9 is a schematic diagram of a synonym mining device in the field of government affairs services according to an embodiment of the present invention.
Fig. 10 is a schematic structural diagram of a new word candidate set generating module 901 in the embodiment of the present invention.
Fig. 11 is a schematic diagram of a synonym mining device in the field of government services according to an embodiment of the present invention.
FIG. 12 is a schematic diagram of an electronic device for synonym mining for the field of government services in an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a synonym mining method in the field of government affairs service, which is used for improving the efficiency of synonym mining and the accuracy rate of new word discovery, and as shown in figure 1, the method comprises the following steps:
step 101: generating a new word candidate word set according to a language database in the government affair service field;
step 102: determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index value lower than a preset threshold value to obtain the new word set in the government affair service field;
step 103: recording a new word set in the government affair service field into a word bank in the government affair service field, and constructing an abbreviation synonym set according to the word bank in the government affair service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry;
step 104: inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between an original entry and a plurality of abbreviation synonym candidate entries; the double-tower model is obtained by training according to a word bank in the field of government affairs services and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym;
step 105: and obtaining the abbreviation synonyms corresponding to the original entry according to the similarity between the original entry and the multiple candidate abbreviation synonym entries, and recording the abbreviation synonyms into a synonym word library in the field of government affairs services.
As can be known from the process shown in fig. 1, in the embodiment of the present invention, a new word candidate word set is generated according to a corpus of a government affairs service field; determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index value lower than a preset threshold value to obtain the new word set in the government affair service field; recording a new word set in the government affair service field into a word bank in the government affair service field, and constructing an abbreviation synonym set according to the word bank in the government affair service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry; inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between an original entry and a plurality of abbreviation synonym candidate entries; the double-tower model is obtained by training according to a word bank in the field of government affairs services and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym; and obtaining the abbreviation synonyms corresponding to the original entry according to the similarity between the original entry and the multiple candidate abbreviation synonym entries, and recording the abbreviation synonyms into a synonym word library in the field of government affairs services. Filtering the new word candidate word set by using the multivariate statistical indexes, and screening out the new word candidate words with multivariate statistical index values lower than a preset threshold value, wherein the accuracy of new word discovery is improved by the multivariate statistical indexes; by constructing the double-tower model, the abbreviation synonyms in the field of government affairs service can be effectively identified, and compared with the prior art of manually filling the synonym thesaurus, the efficiency of synonym mining is improved.
In specific implementation, a new word candidate word set is generated according to a corpus of the government affairs service field. Specifically, according to the method shown in fig. 2, generating a new word candidate set includes:
step 201: splicing any two adjacent words in a language database in the government affair service field by using a BiGram method to serve as candidate new words to obtain a plurality of candidate new words;
step 202: and filtering the candidate new words containing the non-word-forming elements in the plurality of candidate new words to generate a new word candidate word set.
The BiGram method is a common binary component part-of-speech method in NLP (Neuro-linear Programming), i.e., each two characters from the beginning to the end of a sentence constitute a word. Based on the BiGram method, any two adjacent words in each sentence in the corpus of the government affairs service field can be spliced to obtain a plurality of candidate new words. And after the candidate new words are obtained, filtering the candidate new words containing the non-word-forming elements in the plurality of candidate new words to generate a new word candidate word set. In particular embodiments, commonly used stop words include, for example: and, or, after, etc. the words are regarded as non-word-forming elements, if the candidate new words contain the non-word-forming elements, the candidate new words are considered as being incapable of forming words, and the words are filtered. Further, the candidate new words can be filtered according to the characteristics of the government affairs service field, for example, the characteristics of the words of the government affairs service field are as follows: the word formation rate of new words in combination forms of noun + noun, noun + suffix, prefix + noun, verb + noun, etc. is relatively high, so candidate new words conforming to the above combination forms are retained and added into a new word candidate set, for example: marital registration, taxi driver's license, etc.
After a new word candidate word set is generated, new word candidate words in the new word candidate word set need to be further filtered. In specific implementation, a multivariate statistical index value of each new word candidate word in the new word candidate word set is determined, and the new word candidate words with the multivariate statistical index value lower than a preset threshold value are screened out to obtain the new word set in the government affairs service field. In an embodiment, the multivariate statistical indicator comprises: word frequency, degree of adjacency change, entropy, and inter-point mutual information.
(1) Word frequency: the method refers to the frequency of the new word candidate words appearing in the corpus of the government affairs service field, and mainly inspects the possibility of the new word candidate words appearing independently. Most of new word candidate words can be filtered by the index, and the filtering efficiency can be effectively improved.
(2) Degree of adjacent variation: for measuring how likely a new word candidate will appear in different contexts. If a new word candidate can be adapted to different contexts, then the possibility of being able to string into a word is high. For a word string w containing n words, { w ═ w1,…,wnLet VL(w) the number of different words left-adjacent to the word string, VR(w) tableThe number of different characters adjacent to the left of the word string is shown, and the definition of the adjacency change degree is as follows:
VG(w)=logV(w)
wherein V (w) ═ min (V)L(w),VR(w))。
(3) Information entropy: the method is used for measuring the uncertainty of a random variable, and the larger the information entropy of the random variable is, the larger the uncertainty is. The uncertainty of adjacent characters before and after a word string of the new word candidate word is measured by using the information entropy, the larger the information entropy before and after the new word candidate word is, the more possible the new word candidate word is to form a word independently, otherwise, the more possible the new word candidate word is combined with the information of the preceding and following texts to form a word.
Figure BDA0003001933080000091
Figure BDA0003001933080000092
Wherein w represents a word string of the new word candidate word;
HL(w) represents the information entropy of adjacent characters before the candidate word of the new word;
P(wn) Representing the occurrence probability of adjacent characters before the new word candidate word;
n represents the total number of adjacent characters before the word string of the new word candidate word;
HR(w) represents the information entropy of the adjacent characters behind the new candidate;
P(wm) Representing the occurrence probability of adjacent characters behind the new candidate;
m represents the total number of contiguous words after the word string of the new word candidate word.
(4) Mutual information between points: mutual information represents the reduction of uncertainty of another random variable Y after the known random variable X, but because the value of the random variable in the discovery of new words is unique, the invention uses the point-to-point mutual information as an index for measuring the characteristic:
Figure BDA0003001933080000093
wherein p (x) and p (y) represent the probability of x and y appearing in the document, respectively;
p (x, y) represents the probability of the new word formed by combining x and y in the document, and the index can quantify the degree of cohesion of the character combination, if the value of mutual information is larger, the probability that x and y are combined into the new word is higher, and the degree of cohesion is higher; conversely, it indicates that the higher the likelihood of a phrase boundary between x and y, the lower the degree of aggregation.
And respectively counting the word frequency, the adjacency change degree, the information entropy and the point-to-point mutual information of each new word candidate word, comparing the word frequency, the adjacency change degree, the information entropy and the point-to-point mutual information with a preset threshold value, and screening out the new word candidate words which do not conform to the range of the preset threshold value to obtain a new word set in the government affair service field. In a specific embodiment, the boundary of the preset threshold is determined according to the distribution relative offset between the government affairs service field dictionary corresponding to the multiple statistical indexes and the new word candidate word. Specifically, a boundary [ l, r ] of a preset threshold is determined by adopting a boundary scanning method, if the distribution of the government service field dictionary of one statistical index is right-biased relative to the new word candidate words, l is the minimum value of the statistical index of all the new word candidate words, and r is the quantile of the statistical index in the government service field dictionary. If the distribution of the government service field dictionary is left-biased relative to the new word candidate words, r is the maximum value of the statistical index of all the new word candidate words, and l is the quantile of the statistical index in the government service field dictionary.
And after a new word set of the government affair service field is obtained, the new word set of the government affair service field is recorded into a word bank of the government affair service field, and an abbreviation synonym set is constructed according to the word bank of the government affair service field. The abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry. The specific implementation process, as shown in fig. 3, includes:
step 301: selecting a word from a word bank in the field of government affairs service as an original entry;
step 302: randomly deleting partial sub-words in the original entry to form a plurality of possible entries of the abbreviation synonyms;
step 303: calculating the editing distance between the original entry and each possible synonym entry, and taking the possible abbreviation synonym entries with the editing distance meeting a preset value as candidate abbreviation synonym entries corresponding to the original entry;
step 304: and generating an abbreviation synonym set according to the original entry and the abbreviation synonym candidate entry corresponding to the original entry.
Because the synonym range is too large and the rule of the Chinese synonym cannot be generalized, the synonym recognition in the embodiment of the invention mainly focuses on recognition of the abbreviations, such as social security and social security. Randomly deleting part of the sub-words in the original entry to form a plurality of possible entries of the abbreviation synonyms, for example, the original entry is social security, and the forming of the plurality of possible entries of the abbreviation synonyms includes: society, security, social security, and the like. And respectively calculating the editing distance between the social security and the social security, the social security and the social security. The Edit Distance (MED) is an index used to measure the similarity between two sequences. Colloquially, the edit distance refers to the minimum number of single character edit operations required to convert one word to another between two words.
There are and only three single character editing operations defined: insertion, deletion, replacement.
Representing the edit distance of two strings a and b as leva,b(| a |, | b |), where | a | and | b | correspond to the lengths of a and b, respectively. The edit distance can be described in the following mathematical language:
Figure BDA0003001933080000101
in the embodiment of the present invention, the preset value is set to 2, that is, if the edit distance between the original entry and one possible synonym entry is 2, the possible abbreviation synonym entry is used as a candidate abbreviation synonym entry corresponding to the original entry.
And after constructing an abbreviation synonym set, inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between the original entry and a plurality of abbreviation synonym candidate entries. The two-tower model is obtained according to the training of a word bank in the field of government affairs services and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym. In a specific embodiment, the pre-construction process of the double tower model, as shown in fig. 4, includes:
step 401: constructing an abbreviation synonym sample set according to a word bank in the field of government affairs service; sample data in the abbreviation synonym sample set comprises original entries, abbreviation entries and synonym marks; the synonym mark is used for representing whether the original entry and the abbreviation entry are synonyms or not;
step 402: respectively inputting original entries and abbreviation entries in the abbreviation sample set into a Bert model to obtain embedding word vectors corresponding to the original entries and the abbolding word vectors corresponding to the abbreviation entries;
step 403: taking the imbedding word vector corresponding to the original entry and the imbedding word vector corresponding to the abbreviation entry as word vector expressions of the CNN-DSSM, and constructing a double-tower model;
step 404: and training the constructed double-tower model by using the synonym sample set of the acronyms to obtain the trained double-tower model.
In the specific implementation process, the synonym sample set of the abbreviations comprises the following steps: a positive sample subset and a negative sample subset; the method comprises the steps that abbreviation entries of a positive sample subset are obtained by randomly adding and deleting original entries, the editing distance between the abbreviation entries of the positive sample subset and the original entries accords with a preset value, and synonyms of the positive sample subset are marked as synonyms; while synonyms for the negative examples subset are marked as non-synonymous. In a specific embodiment, a synonym sample set of abbreviations is constructed, each sample comprises < word 1, word 2, synonym flag >, the first term 1 represents the original entry, the second term 2 is the constructed abbreviation entry, and the third column of synonym flag represents whether the two are synonymous, for example, the synonym flag is 1, otherwise, the synonym flag is 0. The sample set construction process is divided into 3 steps:
(1) and constructing a positive sample, wherein the positive sample comprises two parts, the first part is an original entry in a public abbreviation word library, the second part is a possible entry formed by randomly adding and deleting sub-words of the original entry part, the entry with the editing distance of 2 from the original entry is screened out, the synonym mark is marked as 1, and the process is repeated for n times.
(2) Constructing a negative sample, randomly extracting sub-words of other entries in the word stock except the original entry, constructing possible entries of the words in the same manner as the first step, marking synonym as 0, and repeating for n';
(3) the positive and negative samples are combined to form a sample set.
And inputting the original entry and the abbreviation entry in the abbreviation synonym sample set into the Bert model in two ways to obtain an embedding word vector corresponding to the original entry and an embedding word vector corresponding to the abbreviation entry. The Bert model is a good feature representation for word learning essentially by operating an automatic-supervision learning method on the basis of massive linguistic data, and the automatic-supervision learning refers to supervision learning operated on data without artificial labels. In a particular NLP task, the characteristics of Bert can be directly used to represent the word embedding characteristics as the task. BERT therefore provides a model for migratory learning by other tasks, which can be fine-tuned or fixed according to the task and then used as a feature extractor. The biggest characteristic of Bert is that the traditional RNN and CNN are abandoned, and the distance between two words at any position is converted into 1 through the Attention mechanism, thereby effectively solving the problem of troublesome long-term dependence in NLP. embedding is a way to convert discrete variables into a continuous vector representation, and in a neural network, embedding is very useful because it can not only reduce the spatial dimension of a discrete variable, but also represent the variable meaningfully.
After acquiring the embedding word vector corresponding to the original entry and the embedding word vector corresponding to the abbreviation entry, taking the embedding word vector corresponding to the original entry and the embedding word vector corresponding to the abbreviation entry as word vector expressions of the CNN-DSSM, and constructing a double-tower model. The double tower model (DSSM) is characterized in that an input layer has two input channels, word vector expression is obtained through an embedding layer, matching is carried out on a matching layer, and Semantic similarity of two input words can be converted into a posterior probability to be output. In the specific embodiment of the invention, a Bert + CNN + DSSM process is adopted, the original entry and the abbreviation entry are respectively input into a Bert model to obtain an embedding word vector, the embedding word vector is input into a Text-CNN, key information is extracted by using a Max Pooling layer, the embedding word vector and the abbreviation word vector are subjected to cosine evaluation after nonlinear transformation through a multilayer neural network, when the label is 1, cosine-similarity (cosine similarity) is larger, otherwise, the cosine-similarity (cosine similarity) is smaller, a loss function adopts an am-softmax model, the semantic similarity of the two entries can be predicted, low-dimensional semantic vector expression of a certain entry can be obtained, and the similarity between expression word vectors is utilized, so that the cosine similarity calculation is more reasonable compared with the method of directly utilizing the word vectors.
The CNN-DSSM is also called CLSM (conditional language management model), and the CNN-DSSM is mainly different from the DSSM in an input layer and a presentation layer and is used for improving the defect that the DSSM loses context information. The Text-CNN is to apply a convolutional neural network CNN to a Text classification task, and extract key information in a sentence by using a plurality of kernel of different sizes, so that local relevance can be better captured.
And after the similarity between the original entry and the multiple candidate abbreviated synonym entries is obtained, obtaining the abbreviated synonym corresponding to the original entry according to the similarity between the original entry and the multiple candidate abbreviated synonym entries, and inputting the abbreviated synonym into a synonym word bank in the field of government affairs services. Specifically, the abbreviation synonym candidate entry with the highest similarity ranking is used as the abbreviation synonym corresponding to the original entry.
Since the CNN-DSSM basic principle is to make the vectors of synonymous names as close as possible and the vectors of non-synonymous names as far as possible, if the positive samples are too small relative to the total sample size, a bias error due to point prediction is likely to occur, i.e., the non-synonymous names are not recognized but become closer. According to the observation, most non-synonymous names classified as synonymous tend to have very few intersections on wording, while their true names tend to have been judged nearest to the name, such as: "social insurance" and "social news," therefore, the BM25 is required to perform semantic similarity ranking based on the principle of co-occurrence of words to neutralize the bias introduced by the point estimates of CNN-DSSM. Therefore, in an embodiment of the present invention, a method for mining synonyms in the field of government affairs services is further provided, as shown in fig. 5, on the basis of fig. 1, the method further includes:
step 501: sequencing a plurality of abbreviation synonym candidate entries in the abbreviation synonym set by utilizing a BM25 algorithm to obtain a first sequencing result of the abbreviation synonym candidate entries;
step 502: obtaining a second ordering result of the multiple abbreviation synonym candidate entries according to the similarity between the original entry and the multiple abbreviation synonym candidate entries;
step 503: combining the first ordering results of the multiple abbreviation synonym candidate entries and the second ordering results of the multiple abbreviation synonym candidate entries to determine the ordering results of the multiple abbreviation synonym candidate entries;
step 504: and determining the first candidate entry of the abbreviation synonyms in the ordering result as the abbreviation synonyms corresponding to the original entry.
The BM25 is an algorithm for evaluating the relevance between search terms and documents, and is an algorithm proposed based on a probabilistic search model, and the following BM25 algorithm is described in a simple manner: the method comprises the following steps that a query and a batch of documents D are provided, the relevance score between the query and each document D is calculated, the query is segmented to obtain a word qi, and then the score of the word is composed of 3 parts: the relevance between words qi and D; the relevance between words qi and D; a weight of each word; for each word score we sum up to get a score of the relevance between query and document D.
When the step 503 is implemented, determining the ordering result of the plurality of abbreviation synonym candidate entries according to the following formula:
ranknew=1/rankcnn-dssm+1/rankBM25
wherein, ranknewRepresents the score when reordering; rankcnn-dssmRepresenting the corresponding scoring scores in the first sorting; rankBM25Representing the corresponding scoring scores in the second ranking.
A specific example is given below to illustrate how embodiments of the present invention perform synonym mining in the field of government services.
The example is divided into two parts of new word discovery and synonym identification:
the main difficulty of new word discovery in the current government affairs service field is as follows: the definition of the new word has no unified standard at present; the government affair service field has a wide range of fields, including a plurality of sub-fields such as humanity, economy, society and the like, the composition rule of new words is difficult to master, and the composition rule of the new words in each field is different; due to reasons such as sparse text in the government affair service field, the difficulty in identifying the low-frequency new words is high.
Aiming at the problems, the present example adopts multivariate statistical indexes for filtering, so as to improve the accuracy of new word recognition, and the new word recognition process of the present example is shown in fig. 6, and includes two parts, namely candidate word set generation and statistical index filtering:
and generating candidate words by adopting a BiGram method, and splicing two adjacent words to serve as candidate new words. For common stop words, the text is considered as a non-word-forming module. And preliminarily filtering candidate words by adopting a rule mode, wherein the candidate words comprise a single word combination rule and a multi-word combination rule, and generating a new word candidate word set.
(1) The single word combination rule is used for filtering candidate words containing non-idioms, and the non-idioms single word dictionary comprises the following steps: and, post, and/or the like.
(2) The multi-word combination rule mainly filters the characteristics of words in the government service field, and the new words in the government service field are characterized by higher word formation rate of nouns + nouns, nouns + suffixes, prefixes + nouns and verbs + nouns, such as: marital registration, taxi driver's license, etc.
The candidate word filtering method is characterized in that candidate word indexes are distributed close to a word bank in the field of government affair service through a multivariate statistical index filtering mode. The multivariate statistical indexes comprise word frequency, adjacent variation degree, information entropy, mutual information and the like.
And after filtering according to the multivariate statistical indexes, manually screening the obtained new word candidate set, and inputting the new word candidate set into a word bank in the field of government affairs service.
Because the synonym range is too large and the rule of Chinese synonym cannot be generalized, the synonym recognition in this example focuses on the recognition of abbreviations, such as social security and social security. The abbreviation recognition adopts a method of double-tower model supervised learning, so that errors caused by direct matching of word vector similarity are avoided, and the double-tower model training process is shown in figure 7:
firstly, constructing an abbreviation sample library, inputting an entry 1 and an entry 2 into a Bert model in two ways to obtain corresponding embedding word vectors, respectively inputting the embedding word vectors into a presentation layer (CNN) of a double-tower model, respectively inputting the embedding word vectors into a matching layer of the double-tower model, and calculating to obtain cosine, thereby obtaining matching scores of the entry 1 and the entry 2.
After the model training is completed, the specific processing flow is shown in fig. 8, and includes:
(1) generating candidate words according to a method that the editing distance meets a preset value, and screening candidate entries with similar keywords through a Tf-idf algorithm (term frequency-inverse document frequency);
(2) inputting the candidate entries and the original entries into a double-tower model together to score the similarity of the candidate entries and the original entries;
(3) sorting the candidate entry set through a BM25 algorithm;
(4) the sequencing result of CNN-DSSM and BM25 is synthesized by the following formula
ranknew=1/rankcnn-dssm+1/rankBM25
(5) After the final sorting result is obtained, taking the 1 st as an abbreviation synonym;
(6) and generating a new abbreviation word bank, manually screening, and inputting into a synonym word bank in the field of government affairs service.
According to the specific implementation, the candidate new word set is filtered by utilizing the multivariate statistical indexes, the richness of entries and words before and after the entries and the aggregation degree inside sub-words are considered, and the new word discovery accuracy is improved. By training the double-tower model by using the method of Bert + CNN + DSSM to recognize the abbreviation, errors caused by unsupervised learning are effectively avoided, errors caused by matching directly by using word vectors are effectively avoided, bias errors caused by point estimation of CNN-DSSM are neutralized by introducing a BM25 algorithm, sequencing is adjusted, and the most accurate abbreviation pair is obtained.
The implementation of the above specific application is only an example, and the rest of the embodiments are not described in detail.
Based on the same inventive concept, embodiments of the present invention further provide a synonym mining device in the field of government affairs services, and because the principle of the problem solved by the synonym mining device in the field of government affairs services is similar to that of the synonym mining method in the field of government affairs services, the implementation of the synonym mining device in the field of government affairs services can refer to the implementation of the synonym mining method in the field of government affairs services, and the repeated parts are not repeated, and the specific structure is shown in fig. 9:
a new word candidate set generating module 901, configured to generate a new word candidate set according to a corpus in the government affairs service field;
a new word generation module 902, configured to determine a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screen out new word candidate words with multivariate statistical index values lower than a preset threshold value to obtain a new word set in the government affairs service field;
an abbreviation synonym set construction module 903, configured to record a new set of words in the government affairs service field into a word bank in the government affairs service field, and construct an abbreviation synonym set according to the word bank in the government affairs service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry;
a similarity determining module 904, configured to input the abbreviation synonym set into a pre-constructed two-tower model, so as to obtain similarities between an original entry and multiple abbreviation synonym candidate entries; the double-tower model is obtained by training according to a word bank in the field of government affairs services and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym;
and the abbreviation synonym determining module 905 is used for obtaining the abbreviation synonyms corresponding to the original entries according to the similarity between the original entries and the multiple candidate entries of the abbreviation synonyms, and inputting the abbreviation synonyms into a synonym thesaurus in the field of government affairs services.
In a specific embodiment, the structure of the new word candidate set generating module 901, as shown in fig. 10, includes:
a candidate new word generation unit 1001 configured to splice any two adjacent words in a corpus in the government service field by using a BiGram method to obtain a plurality of candidate new words;
the non-word forming filtering unit 1002 is configured to filter candidate new words including non-word forming elements in the multiple candidate new words, and generate a new word candidate set.
In an embodiment, the multivariate statistical indicator comprises: word frequency, degree of adjacency change, entropy, and inter-point mutual information. And the boundary of the preset threshold is determined according to the distribution relative deviation between the government affair service field dictionary corresponding to the multivariate statistical index and the new word candidate word.
In specific implementation, the abbreviation synonym set building module 903 is specifically configured to:
selecting a word from a word bank in the field of government affairs service as an original entry;
randomly deleting partial sub-words in the original entry to form a plurality of possible entries of the abbreviation synonyms;
calculating the editing distance between the original entry and each possible synonym entry, and taking the possible abbreviation synonym entries with the editing distance meeting a preset value as candidate abbreviation synonym entries corresponding to the original entry;
and generating an abbreviation synonym set according to the original entry and the abbreviation synonym candidate entry corresponding to the original entry.
In an embodiment of the present invention, a synonym mining device in the field of government affairs services is further provided, where on the basis of fig. 9, the synonym mining device further includes: a double tower model pre-construction module for:
constructing an abbreviation synonym sample set according to a word bank in the field of government affairs service; sample data in the abbreviation synonym sample set comprises original entries, abbreviation entries and synonym marks; the synonym mark is used for representing whether the original entry and the abbreviation entry are synonyms or not;
respectively inputting original entries and abbreviation entries in the abbreviation sample set into a Bert model to obtain embedding word vectors corresponding to the original entries and the abbolding word vectors corresponding to the abbreviation entries;
taking the imbedding word vector corresponding to the original entry and the imbedding word vector corresponding to the abbreviation entry as word vector expressions of the CNN-DSSM, and constructing a double-tower model;
and training the constructed double-tower model by using the synonym sample set of the acronyms to obtain the trained double-tower model.
Specifically, a sample set of abbreviation synonyms, comprising:
a positive sample subset and a negative sample subset; the method comprises the steps that the abbreviation entries of a positive sample subset are obtained by randomly increasing and deleting the original entries, and the editing distance between the abbreviation entries of the positive sample subset and the original entries accords with a preset value; and synonyms for the positive sample subset are marked as synonyms; synonyms for the negative examples subset are marked as non-synonymous.
In order to avoid bias errors caused by point estimation of CNN-DSSM, a synonym mining device in the field of government affairs services is further provided in the specific embodiment of the present invention, as shown in fig. 11, on the basis of fig. 10, the synonym mining device further includes:
a sorting adjustment module 1101 configured to:
sequencing a plurality of abbreviation synonym candidate entries in the abbreviation synonym set by utilizing a BM25 algorithm to obtain a first sequencing result of the abbreviation synonym candidate entries;
obtaining a second ordering result of the multiple abbreviation synonym candidate entries according to the similarity between the original entry and the multiple abbreviation synonym candidate entries;
combining the first ordering results of the multiple abbreviation synonym candidate entries and the second ordering results of the multiple abbreviation synonym candidate entries to determine the ordering results of the multiple abbreviation synonym candidate entries;
and determining the first candidate entry of the abbreviation synonyms in the ordering result as the abbreviation synonyms corresponding to the original entry.
Fig. 12 is a schematic block diagram of a system configuration of an electronic apparatus 1200 according to an embodiment of the present application. As shown in fig. 12, the electronic device 1200 may include a central processing unit 1201 and a memory 1202; the memory 1202 is coupled to the central processor 1201. Notably, this fig. 12 is exemplary; other types of structures may also be used in addition to or in place of the structure to implement telecommunications or other functions.
In one embodiment, synonym mining functions for the domain of government services may be integrated into the central processor 1201. The central processing unit 1201 may be configured to control as follows:
generating a new word candidate word set according to a language database in the government affair service field;
determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index value lower than a preset threshold value to obtain the new word set in the government affair service field;
recording a new word set in the government affair service field into a word bank in the government affair service field, and constructing an abbreviation synonym set according to the word bank in the government affair service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry;
inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between an original entry and a plurality of abbreviation synonym candidate entries; the double-tower model is obtained by training according to a word bank in the field of government affairs services and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym;
and obtaining the abbreviation synonyms corresponding to the original entry according to the similarity between the original entry and the multiple candidate abbreviation synonym entries, and recording the abbreviation synonyms into a synonym word library in the field of government affairs services.
As can be seen from the above description, the electronic device provided in the embodiment of the present application generates a new word candidate set according to the corpus of the government affairs service field; determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index value lower than a preset threshold value to obtain the new word set in the government affair service field; recording a new word set in the government affair service field into a word bank in the government affair service field, and constructing an abbreviation synonym set according to the word bank in the government affair service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry; inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between an original entry and a plurality of abbreviation synonym candidate entries; the double-tower model is obtained by training according to a word bank in the field of government affairs services and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym; and obtaining the abbreviation synonyms corresponding to the original entry according to the similarity between the original entry and the multiple candidate abbreviation synonym entries, and recording the abbreviation synonyms into a synonym word library in the field of government affairs services. Filtering the new word candidate word set by using the multivariate statistical indexes, and screening out the new word candidate words with multivariate statistical index values lower than a preset threshold value, wherein the accuracy of new word discovery is improved by the multivariate statistical indexes; by constructing the double-tower model, the abbreviation synonyms in the field of government affairs service can be effectively identified, and compared with the prior art of manually filling the synonym thesaurus, the efficiency of synonym mining is improved.
In another embodiment, the synonym mining device in the government service field may be configured separately from the central processor 1201, for example, the synonym mining device in the government service field may be configured as a chip connected to the central processor 1201, and the synonym mining function in the government service field is realized by the control of the central processor.
As shown in fig. 12, the electronic device 1200 may further include: a communication module 1203, an input unit 1204, an audio processor 1205, a display 1206, a power supply 1207. It is noted that the electronic device 1200 also does not necessarily include all of the components shown in FIG. 12; furthermore, the electronic device 1200 may also comprise components not shown in fig. 12, which may be referred to in the prior art.
As shown in fig. 12, a central processing unit 1201, also sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, the central processing unit 1201 receiving input and controlling operation of the various components of the electronic device 1200.
The memory 1202 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 1201 can execute the program stored in the memory 1202 to realize information storage or processing, or the like.
The input unit 1204 provides input to the central processor 1201. The input unit 1204 is, for example, a key or a touch input device. The power source 1207 is used to supply power to the electronic apparatus 1200. The display 1206 is used to display an object to be displayed, such as an image and characters. The display may be, for example, an LCD display, but is not limited thereto.
The memory 1202 may be a solid state memory such as Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 1202 may also be some other type of device. The memory 1202 includes a buffer memory 1221 (sometimes referred to as a buffer). The memory 1202 may include an application/function storage 1222, the application/function storage 1222 storing application programs and function programs or a flow for executing the operation of the electronic device 1200 by the central processing unit 1201.
The memory 1202 may also include a data store 1223, the data store 1223 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by the electronic device. Driver storage 1224 of memory 1202 may include various drivers for the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, address book applications, etc.).
The communication module 1203 is a transmitter/receiver 1203 that transmits and receives signals via an antenna 1208. A communication module (transmitter/receiver) 1203 is coupled to the central processor 1201 to provide an input signal and receive an output signal, which may be the same as the case of a conventional mobile communication terminal.
Based on different communication technologies, a plurality of communication modules 1203, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, may be disposed in the same electronic device. The communication module (transmitter/receiver) 1203 is also coupled to the speaker 1209 and the microphone 1210 via the audio processor 1205 to provide audio output via the speaker 1209 and receive audio input from the microphone 1210 for typical telecommunications functions. The audio processor 1205 may include any suitable buffers, decoders, amplifiers and so forth. In addition, an audio processor 1205 is also coupled to the central processor 1201 to enable recording of sounds locally through the microphone 1210 and to enable playing of locally stored sounds through the speaker 1209.
An embodiment of the present invention further provides a computer-readable storage medium capable of implementing all the steps in the synonym mining method for the government service field in the above embodiment, where the computer-readable storage medium stores thereon a computer program, and the computer program implements all the steps of the synonym mining method for the government service field in the above embodiment when being executed by a processor, for example, the processor implements the following steps when executing the computer program:
generating a new word candidate word set according to a language database in the government affair service field;
determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index value lower than a preset threshold value to obtain the new word set in the government affair service field;
recording a new word set in the government affair service field into a word bank in the government affair service field, and constructing an abbreviation synonym set according to the word bank in the government affair service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry;
inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between an original entry and a plurality of abbreviation synonym candidate entries; the double-tower model is obtained by training according to a word bank in the field of government affairs services and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym;
and obtaining the abbreviation synonyms corresponding to the original entry according to the similarity between the original entry and the multiple candidate abbreviation synonym entries, and recording the abbreviation synonyms into a synonym word library in the field of government affairs services.
As can be seen from the above description, the computer-readable storage medium provided in the embodiment of the present invention generates a new word candidate set according to the corpus of the government affairs service field; determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index value lower than a preset threshold value to obtain the new word set in the government affair service field; recording a new word set in the government affair service field into a word bank in the government affair service field, and constructing an abbreviation synonym set according to the word bank in the government affair service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry; inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between an original entry and a plurality of abbreviation synonym candidate entries; the double-tower model is obtained by training according to a word bank in the field of government affairs services and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym; and obtaining the abbreviation synonyms corresponding to the original entry according to the similarity between the original entry and the multiple candidate abbreviation synonym entries, and recording the abbreviation synonyms into a synonym word library in the field of government affairs services. Filtering the new word candidate word set by using the multivariate statistical indexes, and screening out the new word candidate words with multivariate statistical index values lower than a preset threshold value, wherein the accuracy of new word discovery is improved by the multivariate statistical indexes; by constructing the double-tower model, the abbreviation synonyms in the field of government affairs service can be effectively identified, and compared with the prior art of manually filling the synonym thesaurus, the efficiency of synonym mining is improved.
In summary, the method and the device for mining synonyms in the field of government affairs services provided by the embodiment of the invention have the following advantages:
generating a new word candidate word set according to a language database of the government affair service field; determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index value lower than a preset threshold value to obtain the new word set in the government affair service field; recording a new word set in the government affair service field into a word bank in the government affair service field, and constructing an abbreviation synonym set according to the word bank in the government affair service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry; inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between an original entry and a plurality of abbreviation synonym candidate entries; the double-tower model is obtained by training according to a word bank in the field of government affairs services and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym; and obtaining the abbreviation synonyms corresponding to the original entry according to the similarity between the original entry and the multiple candidate abbreviation synonym entries, and recording the abbreviation synonyms into a synonym word library in the field of government affairs services. Filtering the new word candidate word set by using the multivariate statistical indexes, and screening out the new word candidate words with multivariate statistical index values lower than a preset threshold value, wherein the accuracy of new word discovery is improved by the multivariate statistical indexes; by constructing the double-tower model, the abbreviation synonyms in the field of government affairs service can be effectively identified, and compared with the prior art of manually filling the synonym thesaurus, the efficiency of synonym mining is improved.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (18)

1. A synonym mining method in the field of government affairs services is characterized by comprising the following steps:
generating a new word candidate word set according to a language database in the government affair service field;
determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index value lower than a preset threshold value to obtain the new word set in the government affair service field;
recording a new word set in the government affair service field into a word bank in the government affair service field, and constructing an abbreviation synonym set according to the word bank in the government affair service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry;
inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between the original entry and the multiple abbreviation synonym candidate entries; the double-tower model is obtained by training according to the word bank in the government affairs service field and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym;
and obtaining the abbreviation synonyms corresponding to the original entry according to the similarity between the original entry and the plurality of candidate abbreviation synonym entries, and inputting the abbreviation synonym synonyms into a synonym word bank in the field of government affairs services.
2. The method of claim 1, wherein generating a set of new word candidates from a corpus of government service fields comprises:
splicing any two adjacent words in a language database in the government affair service field by using a BiGram method to serve as candidate new words to obtain a plurality of candidate new words;
and filtering the candidate new words containing the non-word-forming elements in the plurality of candidate new words to generate a new word candidate word set.
3. The method of claim 1, wherein the multivariate statistical measures comprise:
word frequency, degree of adjacency change, entropy, and inter-point mutual information.
4. The method for mining synonyms in the field of government affairs services according to claim 1, wherein the boundary of the preset threshold is determined according to a distributed relative shift between a dictionary of the field of government affairs services corresponding to a multivariate statistical index and a candidate word of a new word.
5. The method of mining synonyms for the domain of government services according to claim 1, wherein constructing a set of synonyms for acronyms from a thesaurus for domain of government services comprises:
selecting a word from a word bank in the field of government affairs service as an original entry;
randomly deleting partial sub-words in the original entry to form a plurality of possible entries of the abbreviation synonyms;
calculating the editing distance between the original entry and each possible synonym entry, and taking the possible abbreviation synonym entries with the editing distance meeting a preset value as candidate abbreviation synonym entries corresponding to the original entry;
and generating an abbreviation synonym set according to the original entry and the abbreviation synonym candidate entry corresponding to the original entry.
6. The method of synonym mining for the field of government services according to claim 1, wherein the pre-construction process of the two-tower model includes:
constructing an abbreviation synonym sample set according to the government affair service field word library; sample data in the abbreviation synonym sample set comprises original entries, abbreviation entries and synonym marks; the synonym mark is used for representing whether the original entry and the abbreviation entry are synonyms or not;
respectively inputting original entries and abbreviation entries in the abbreviation sample set into a Bert model to obtain embedding word vectors corresponding to the original entries and the abbolding word vectors corresponding to the abbreviation entries;
taking the imbedding word vector corresponding to the original entry and the imbedding word vector corresponding to the abbreviation entry as word vector expressions of the CNN-DSSM, and constructing a double-tower model;
and training the constructed double-tower model by using the synonym sample set of the acronyms to obtain the trained double-tower model.
7. The method of synonym mining for the field of government services according to claim 6, wherein the sample set of acronym synonyms comprises:
a positive sample subset and a negative sample subset;
the abbreviation entries of the positive sample subset are obtained by randomly adding and deleting the original entries, and the editing distance between the abbreviation entries of the positive sample subset and the original entries accords with a preset value; synonyms for the positive sample subset are marked as synonyms;
synonyms for the negative examples subset are marked as non-synonyms.
8. The method of synonym mining for the field of government services according to claim 1, further comprising:
sequencing a plurality of abbreviation synonym candidate entries in the abbreviation synonym set by utilizing a BM25 algorithm to obtain a first sequencing result of the abbreviation synonym candidate entries;
obtaining a second ordering result of the multiple abbreviation synonym candidate entries according to the similarity between the original entry and the multiple abbreviation synonym candidate entries;
combining the first ordering results of the multiple abbreviation synonym candidate entries and the second ordering results of the multiple abbreviation synonym candidate entries to determine the ordering results of the multiple abbreviation synonym candidate entries;
and determining the first candidate entry of the abbreviation synonyms in the sequencing result as the abbreviation synonyms corresponding to the original entry.
9. A synonym mining device in the field of government affairs services, characterized by comprising:
the new word candidate set generating module is used for generating a new word candidate set according to the language database in the government affair service field;
the new word generation module is used for determining a multivariate statistical index value of each new word candidate word in the new word candidate word set, and screening out the new word candidate words with the multivariate statistical index values lower than a preset threshold value to obtain a new word set in the government affair service field;
the system comprises an abbreviation synonym set construction module, a database and a database management module, wherein the abbreviation synonym set construction module is used for recording a new vocabulary set in the government affair service field into a word bank in the government affair service field and constructing an abbreviation synonym set according to the word bank in the government affair service field; the abbreviation synonym set comprises an original entry and a plurality of abbreviation synonym candidate entries corresponding to the original entry;
the similarity determining module is used for inputting the abbreviation synonym set into a pre-constructed double-tower model to obtain the similarity between the original entry and the multiple abbreviation synonym candidate entries; the double-tower model is obtained by training according to the word bank in the government affairs service field and is used for determining the similarity between the entry in the abbreviation synonym set and the candidate entry of the abbreviation synonym;
and the abbreviation synonym determining module is used for obtaining the abbreviation synonyms corresponding to the original entries according to the similarity between the original entries and the multiple abbreviation synonym candidate entries and recording the abbreviation synonyms into a synonym thesaurus in the field of government affairs services.
10. The synonym mining device for the government service field according to claim 9, wherein the new word candidate set generating module includes:
the candidate new word generating unit is used for splicing any two adjacent words in the corpus of the government affair service field by using a BiGram method to serve as candidate new words to obtain a plurality of candidate new words;
and the non-word forming filtering unit is used for filtering the candidate new words containing the non-word forming elements in the plurality of candidate new words to generate a new word candidate word set.
11. The apparatus for mining synonyms in the field of government services according to claim 9, wherein the multivariate statistical index comprises:
word frequency, degree of adjacency change, entropy, and inter-point mutual information.
12. The apparatus for mining synonyms in the government service field according to claim 9, wherein the boundary of the preset threshold is determined according to a distribution relative shift between a dictionary of the government service field corresponding to the multiple statistical indexes and a candidate word of a new word.
13. The synonym mining device for the government service field according to claim 9, wherein the abbreviation synonym set construction module is specifically configured to:
selecting a word from a word bank in the field of government affairs service as an original entry;
randomly deleting partial sub-words in the original entry to form a plurality of possible entries of the abbreviation synonyms;
calculating the editing distance between the original entry and each possible synonym entry, and taking the possible abbreviation synonym entries with the editing distance meeting a preset value as candidate abbreviation synonym entries corresponding to the original entry;
and generating an abbreviation synonym set according to the original entry and the abbreviation synonym candidate entry corresponding to the original entry.
14. The synonym mining device for the government service field according to claim 9, further comprising: a double tower model pre-construction module for:
constructing an abbreviation synonym sample set according to the government affair service field word library; sample data in the abbreviation synonym sample set comprises original entries, abbreviation entries and synonym marks; the synonym mark is used for representing whether the original entry and the abbreviation entry are synonyms or not;
respectively inputting original entries and abbreviation entries in the abbreviation sample set into a Bert model to obtain embedding word vectors corresponding to the original entries and the abbolding word vectors corresponding to the abbreviation entries;
taking the imbedding word vector corresponding to the original entry and the imbedding word vector corresponding to the abbreviation entry as word vector expressions of the CNN-DSSM, and constructing a double-tower model;
and training the constructed double-tower model by using the synonym sample set of the acronyms to obtain the trained double-tower model.
15. The synonym mining device for the government service field according to claim 14, wherein the sample set of acronyms includes:
a positive sample subset and a negative sample subset;
the abbreviation entries of the positive sample subset are obtained by randomly adding and deleting the original entries, and the editing distance between the abbreviation entries of the positive sample subset and the original entries accords with a preset value; synonyms for the positive sample subset are marked as synonyms;
synonyms for the negative examples subset are marked as non-synonyms.
16. The synonym mining device for the government service field according to claim 9, further comprising: a rank adjustment module to:
sequencing a plurality of abbreviation synonym candidate entries in the abbreviation synonym set by utilizing a BM25 algorithm to obtain a first sequencing result of the abbreviation synonym candidate entries;
obtaining a second ordering result of the multiple abbreviation synonym candidate entries according to the similarity between the original entry and the multiple abbreviation synonym candidate entries;
combining the first ordering results of the multiple abbreviation synonym candidate entries and the second ordering results of the multiple abbreviation synonym candidate entries to determine the ordering results of the multiple abbreviation synonym candidate entries;
and determining the first candidate entry of the abbreviation synonyms in the sequencing result as the abbreviation synonyms corresponding to the original entry.
17. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 8 when executing the computer program.
18. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 8.
CN202110350274.3A 2021-03-31 2021-03-31 Synonym mining method and device in government affair service field Pending CN113076740A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110350274.3A CN113076740A (en) 2021-03-31 2021-03-31 Synonym mining method and device in government affair service field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110350274.3A CN113076740A (en) 2021-03-31 2021-03-31 Synonym mining method and device in government affair service field

Publications (1)

Publication Number Publication Date
CN113076740A true CN113076740A (en) 2021-07-06

Family

ID=76614185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110350274.3A Pending CN113076740A (en) 2021-03-31 2021-03-31 Synonym mining method and device in government affair service field

Country Status (1)

Country Link
CN (1) CN113076740A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861638A (en) * 2022-06-10 2022-08-05 安徽工程大学 Chinese synonym expansion method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861638A (en) * 2022-06-10 2022-08-05 安徽工程大学 Chinese synonym expansion method and device

Similar Documents

Publication Publication Date Title
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
AU2019263758B2 (en) Systems and methods for generating a contextually and conversationally correct response to a query
CN106649818B (en) Application search intention identification method and device, application search method and server
Jung Semantic vector learning for natural language understanding
CN111475623B (en) Case Information Semantic Retrieval Method and Device Based on Knowledge Graph
Zhao et al. Ngram2vec: Learning improved word representations from ngram co-occurrence statistics
CN110019732B (en) Intelligent question answering method and related device
US8892420B2 (en) Text segmentation with multiple granularity levels
WO2021093755A1 (en) Matching method and apparatus for questions, and reply method and apparatus for questions
WO2008046104A2 (en) Methods and systems for knowledge discovery
CN107180026B (en) Event phrase learning method and device based on word embedding semantic mapping
CN111324771B (en) Video tag determination method and device, electronic equipment and storage medium
CN115438166A (en) Keyword and semantic-based searching method, device, equipment and storage medium
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN115495555A (en) Document retrieval method and system based on deep learning
CN111090771A (en) Song searching method and device and computer storage medium
WO2019173085A1 (en) Intelligent knowledge-learning and question-answering
Liu et al. Open intent discovery through unsupervised semantic clustering and dependency parsing
CN111625621A (en) Document retrieval method and device, electronic equipment and storage medium
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN113076740A (en) Synonym mining method and device in government affair service field
CN115357691B (en) Semantic retrieval method, system, equipment and computer readable storage medium
CN115831117A (en) Entity identification method, entity identification device, computer equipment and storage medium
CN112966501B (en) New word discovery method, system, terminal and medium
CN114842982A (en) Knowledge expression method, device and system for medical information system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination