CN116562268A - Method and device for generating synonymous sentence library, electronic equipment and storage medium - Google Patents

Method and device for generating synonymous sentence library, electronic equipment and storage medium Download PDF

Info

Publication number
CN116562268A
CN116562268A CN202310371130.5A CN202310371130A CN116562268A CN 116562268 A CN116562268 A CN 116562268A CN 202310371130 A CN202310371130 A CN 202310371130A CN 116562268 A CN116562268 A CN 116562268A
Authority
CN
China
Prior art keywords
corpus
target format
sentence
sentences
synonym
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310371130.5A
Other languages
Chinese (zh)
Other versions
CN116562268B (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Moore Threads Technology Co Ltd
Original Assignee
Moore Threads Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Moore Threads Technology Co Ltd filed Critical Moore Threads Technology Co Ltd
Priority to CN202310371130.5A priority Critical patent/CN116562268B/en
Publication of CN116562268A publication Critical patent/CN116562268A/en
Application granted granted Critical
Publication of CN116562268B publication Critical patent/CN116562268B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The disclosure relates to a method and a device for generating a synonymous sentence library, electronic equipment and a storage medium, and relates to the technical field of computers, wherein the method comprises the following steps: obtaining a corpus database comprising a first corpus, a second corpus and a third corpus, generating synonymous sentences according to at least one of the first corpus, the second corpus and the third corpus, and generating a synonymous sentence library according to synonymous sentences. The first corpus, the second corpus and the third corpus are corpora with different contents. According to the embodiment of the disclosure, the synonym is determined through different corpus and the relation between the corpus and the synonym, so that the synonym library is generated according to the synonym, and the synonym library is simply, accurately and quickly constructed.

Description

Method and device for generating synonymous sentence library, electronic equipment and storage medium
Technical Field
The disclosure relates to the field of computer technology, and in particular, to a method and device for generating a synonymous sentence library, an electronic device and a storage medium.
Background
The field of natural language processing (Natural Language Processing, NLP) is an intersecting field of computer science, artificial intelligence, and information engineering, involving knowledge of statistics, linguistics, etc., with the goal of letting a computer "understand" natural language to perform tasks such as text detection, text recognition, text classification, language translation, and question answering.
Synonym mining is widely used in the field of natural language processing, such as in the field of information retrieval, entity information identification, knowledge question and answer, and the like. For example, in the field of knowledge questions and answers, after receiving a question input by a user, a computer generally obtains a synonym of the question, and then searches a database for a matching answer to the question according to the question and the synonym thereof.
In the related art, in order to obtain a synonym library, the synonym mining method includes: open source data (e.g., semantic similarity match publication data), by manually composing synonyms (or manually tagging synonyms), text generation, grammar generation, web crawlers, etc. However, the data amount of open source data (such as semantic similarity match disclosure data) is small, and task requirements (such as model training) cannot be met; the cost of manual writing (or manual labeling) is high, and the efficiency is low; the text generation method has poor effect and uncontrollable generated result; the grammar generation method needs to manually write templates; according to the web crawler method, each large website has an anti-crawler mechanism, and the crawling difficulty is high. Therefore, in the related art, the mining process of the synonym is complicated, not efficient enough, a great deal of manpower is wasted, and the situation that the data amount of the mined synonym is insufficient may exist.
Disclosure of Invention
The present disclosure proposes a technical solution for generating a synonym library.
According to an aspect of the present disclosure, there is provided a method for generating a synonym library, including: obtaining a corpus database, wherein the corpus database comprises: the first corpus, the second corpus and the third corpus are corpora with different contents; generating synonymous sentences according to at least one of the first corpus, the second corpus and the third corpus; and generating a synonym library according to the synonyms.
In one possible implementation manner, the first corpus is composed of synonym pairs, the second corpus is composed of a dictionary, the third corpus is composed of translation documents, and synonyms are generated according to at least one of the first corpus, the second corpus and the third corpus, wherein the synonyms comprise at least one of the following: generating synonyms according to paraphrases of at least one synonym in the first corpus in any version of dictionary in the second corpus; synonyms generated according to the paraphrases of the same word in dictionaries of different versions in the second corpus; and generating synonymous sentences according to different versions of translation texts in the third corpus of the same paragraph.
In one possible implementation manner, the generating synonymous sentences according to at least one of the first corpus, the second corpus and the third corpus includes: respectively cleaning the first data of the first corpus, the second corpus and the third corpus to obtain the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format; generating paraphrasing pairs and/or translation sentence pairs according to at least one of the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format; and performing second data cleaning on the paraphrase pair and/or the translation sentence pair to obtain a synonym after second data cleaning.
In one possible implementation manner, performing the first data cleansing on the second corpus to obtain a second corpus in a second target format, including: removing words and sentences containing preset identifiers from each vocabulary entry in each dictionary in the second corpus; performing format setting on the second corpus from which the words and sentences containing the preset identification are removed, and obtaining a second corpus in an initial format; acquiring the matching degree of the words in the second corpus in the initial format and each sentence in the paraphrasing corresponding to the words according to a preset matching model; and removing other sentences except the sentences with the maximum matching degree from the paraphrases corresponding to the words to obtain a second corpus in a second target format, wherein the second target format is that each word has a corresponding paraphrase.
In one possible implementation manner, performing first data cleaning on the first corpus to obtain a first corpus in a first target format, including: removing words and sentences containing preset identifiers in the first corpus; and performing format setting on the first corpus from which the words and sentences containing the preset identification are removed to obtain a second corpus in a first target format, wherein the first target format corresponds to one word for each word.
In one possible implementation manner, performing the first data cleansing on the third corpus to obtain a third corpus in a third target format, including: removing words and sentences containing preset identifiers in the third corpus; and performing format setting on the third corpus from which the words and sentences containing the preset identification are removed to obtain a third corpus in a third target format, wherein the third target format corresponds to one paragraph for each paragraph.
In one possible implementation manner, paraphrasing pairs and/or translation sentence pairs are generated according to at least one of the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format, and the paraphrasing pairs and/or translation sentence pairs are generated in a manner including at least one of the following: determining a paraphrase pair corresponding to each synonym pair in the first corpus in the first target format according to the second corpus in the second target format; according to the paraphrases in the dictionaries of different versions in the second corpus of the second target format, determining paraphrase pairs formed by different paraphrases of the dictionaries of different versions corresponding to the same word; and determining translation sentence pairs formed by translations of different versions corresponding to the same sentence according to the translation texts of different versions in the third corpus of the third target format.
In one possible implementation manner, determining, according to the translated text of different versions in the third corpus in the third target format, a pair of translated sentences formed by translations of different versions corresponding to the same sentence includes: obtaining translation texts of different versions of the same paragraph in a third corpus of the third target format, wherein the translation texts of different versions comprise a first translation text and a second translation text; and forming one or more pairs of translation sentence pairs by the first translation text and the second translation text according to the sentence sequence.
In one possible implementation, performing a second data cleansing on the paraphrase pair to obtain a synonym after the second data cleansing, including: splitting the paraphrasing into a plurality of sentences in the case that any one of the paraphrasing pairs is a plurality of sentences having a plurality of meanings; according to a preset matching model, obtaining the matching degree of each sentence of one paraphrase in the paraphrasing pair and each sentence of the other paraphrasing pair; and removing other sentences except the sentence with the largest matching degree from the paraphrasing pair to obtain the synonym after the second data cleaning.
In one possible implementation manner, performing second data cleansing on the translation sentence pair to obtain a synonym after the second data cleansing, including: obtaining the matching degree of the translation sentence pairs according to a preset matching model; and clearing the translation sentence pairs with the matching degree smaller than a preset threshold value to obtain synonyms after the second data are cleaned.
In one possible implementation, the preset matching model is a trained matching model, and the method further includes: inputting training data in a training data set into a matching model in batches to obtain feature vectors corresponding to each corpus in the training data of each batch, wherein the training data of each batch comprises N pairs of synonymous corpora, each pair of synonymous corpora comprises 2 corpora which are synonymous with each other, and N is an integer larger than 1; respectively traversing each feature vector in N pairs of feature vectors, taking the feature vector of the synonymous corpus corresponding to the current feature vector as a positive sample, and taking 2N-2 feature vectors corresponding to the rest 2N-2 corpora as negative samples; inputting the current feature vector, 1 positive sample and 2N-2 negative samples into a loss function to obtain contrast loss, wherein the loss function is used for calculating the contrast between the positive sample similarity and the negative sample similarity, the positive sample similarity is the similarity between the current feature vector and the positive sample, and the negative sample similarity is the average similarity between the current feature vector and 2N-2 negative samples; and training the matching model according to the comparison loss to obtain a trained matching model.
In one possible implementation, the method further includes: writing a script language program; executing the script language program to generate a synonym library; wherein the scripting language program is for: obtaining a corpus database, wherein the corpus database comprises: the first corpus, the second corpus and the third corpus are corpora with different contents; generating synonymous sentences according to at least one of the first corpus, the second corpus and the third corpus; and generating a synonym library according to the synonyms.
According to an aspect of the present disclosure, there is provided a generation apparatus of a synonym library, including: the acquisition module is used for acquiring a corpus database, and the corpus database comprises: the first corpus, the second corpus and the third corpus are corpora with different contents; the first generation module is used for generating synonymous sentences according to at least one of the first corpus, the second corpus and the third corpus; and the second generation module is used for generating a synonym library according to the synonym.
In one possible implementation manner, the first corpus consists of synonym pairs, the second corpus consists of a dictionary, and the first generation module is used for at least one of the following: generating synonyms according to paraphrases of at least one synonym in the first corpus in any version of dictionary in the second corpus; synonyms generated according to the paraphrases of the same word in dictionaries of different versions in the second corpus; and generating synonymous sentences according to different versions of translation texts in the third corpus of the same paragraph.
In one possible implementation manner, the first generating module is configured to: respectively cleaning the first data of the first corpus, the second corpus and the third corpus to obtain the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format; generating paraphrasing pairs and/or translation sentence pairs according to at least one of the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format; and performing second data cleaning on the paraphrase pair and/or the translation sentence pair to obtain a synonym after second data cleaning.
In one possible implementation manner, performing the first data cleansing on the second corpus to obtain a second corpus in a second target format, including: removing words and sentences containing preset identifiers from each vocabulary entry in each dictionary in the second corpus; performing format setting on the second corpus from which the words and sentences containing the preset identification are removed, and obtaining a second corpus in an initial format; acquiring the matching degree of the words in the second corpus in the initial format and each sentence in the paraphrasing corresponding to the words according to a preset matching model; and removing other sentences except the sentences with the maximum matching degree from the paraphrases corresponding to the words to obtain a second corpus in a second target format, wherein the second target format is that each word has a corresponding paraphrase.
In one possible implementation manner, performing first data cleaning on the first corpus to obtain a first corpus in a first target format, including: removing words and sentences containing preset identifiers in the first corpus; and performing format setting on the first corpus from which the words and sentences containing the preset identification are removed to obtain a second corpus in a first target format, wherein the first target format corresponds to one word for each word.
In one possible implementation manner, performing the first data cleansing on the third corpus to obtain a third corpus in a third target format, including: removing words and sentences containing preset identifiers in the third corpus; and performing format setting on the third corpus from which the words and sentences containing the preset identification are removed to obtain a third corpus in a third target format, wherein the third target format corresponds to one paragraph for each paragraph.
In one possible implementation manner, paraphrasing pairs and/or translation sentence pairs are generated according to at least one of the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format, and the paraphrasing pairs and/or translation sentence pairs are generated in a manner including at least one of the following: determining a paraphrase pair corresponding to each synonym pair in the first corpus in the first target format according to the second corpus in the second target format; according to the paraphrases in the dictionaries of different versions in the second corpus of the second target format, determining paraphrase pairs formed by different paraphrases of the dictionaries of different versions corresponding to the same word; and determining translation sentence pairs formed by translations of different versions corresponding to the same sentence according to the translation texts of different versions in the third corpus of the third target format.
In one possible implementation manner, determining, according to the translated text of different versions in the third corpus in the third target format, a pair of translated sentences formed by translations of different versions corresponding to the same sentence includes: obtaining translation texts of different versions of the same paragraph in a third corpus of the third target format, wherein the translation texts of different versions comprise a first translation text and a second translation text; and forming one or more pairs of translation sentence pairs by the first translation text and the second translation text according to the sentence sequence.
In one possible implementation, performing a second data cleansing on the paraphrase pair to obtain a synonym after the second data cleansing, including: splitting the paraphrasing into a plurality of sentences in the case that any one of the paraphrasing pairs is a plurality of sentences having a plurality of meanings; according to a preset matching model, obtaining the matching degree of each sentence of one paraphrase in the paraphrasing pair and each sentence of the other paraphrasing pair; and removing other sentences except the sentence with the largest matching degree from the paraphrasing pair to obtain the synonym after the second data cleaning.
In one possible implementation manner, performing second data cleansing on the translation sentence pair to obtain a synonym after the second data cleansing, including: obtaining the matching degree of the translation sentence pairs according to a preset matching model; and clearing the translation sentence pairs with the matching degree smaller than a preset threshold value to obtain synonyms after the second data are cleaned.
In one possible implementation manner, the preset matching model is a trained matching model, and the first generating module is further configured to: inputting training data in a training data set into a matching model in batches to obtain feature vectors corresponding to each corpus in the training data of each batch, wherein the training data of each batch comprises N pairs of synonymous corpora, each pair of synonymous corpora comprises 2 corpora which are synonymous with each other, and N is an integer larger than 1; respectively traversing each feature vector in N pairs of feature vectors, taking the feature vector of the synonymous corpus corresponding to the current feature vector as a positive sample, and taking 2N-2 feature vectors corresponding to the rest 2N-2 corpora as negative samples; inputting the current feature vector, 1 positive sample and 2N-2 negative samples into a loss function to obtain contrast loss, wherein the loss function is used for calculating the contrast between the positive sample similarity and the negative sample similarity, the positive sample similarity is the similarity between the current feature vector and the positive sample, and the negative sample similarity is the average similarity between the current feature vector and 2N-2 negative samples; and training the matching model according to the comparison loss to obtain a trained matching model.
In one possible implementation manner, the apparatus further includes a writing module, configured to: writing a script language program; executing the script language program to generate a synonym library; wherein the scripting language program is for: obtaining a corpus database, wherein the corpus database comprises: the first corpus, the second corpus and the third corpus are corpora with different contents; generating synonymous sentences according to at least one of the first corpus, the second corpus and the third corpus; and generating a synonym library according to the synonyms.
According to an aspect of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.
According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.
According to an aspect of the present disclosure, there is provided a computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when executed in a processor of an electronic device, performs the above method.
In the embodiment of the disclosure, a corpus database including a first corpus, a second corpus and a third corpus is obtained, synonymous sentences are generated according to at least one of the first corpus, the second corpus and the third corpus, and a synonymous sentence library is generated according to the synonymous sentences. The first corpus, the second corpus and the third corpus are corpora with different contents. By the method, the synonyms can be determined through different corpus and the relation between the corpus and the synonym, and then the synonym library is generated according to the synonyms, so that the synonym library is simply, accurately and quickly constructed.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.
Fig. 1 shows a flowchart of a method of generating a library of synonyms, according to an embodiment of the present disclosure.
Fig. 2 shows a schematic diagram of a first corpus according to an embodiment of the disclosure.
Fig. 3 shows a schematic diagram of a dictionary in accordance with an embodiment of the present disclosure.
Fig. 4 shows a schematic diagram of a second corpus in an initial format according to an embodiment of the disclosure.
Fig. 5 shows a schematic diagram of a second corpus in another initial format according to an embodiment of the disclosure.
Fig. 6 shows a schematic diagram of an initial cosine matrix according to an embodiment of the present disclosure.
Fig. 7 shows a schematic diagram of a cosine matrix according to an embodiment of the present disclosure.
Fig. 8 shows a schematic diagram of a tag matrix according to an embodiment of the present disclosure.
Fig. 9 shows a schematic diagram of a library of synonyms, according to an embodiment of the present disclosure.
Fig. 10 shows a block diagram of a generation apparatus of a synonym library according to an embodiment of the present disclosure.
Fig. 11 shows a block diagram of an electronic device, according to an embodiment of the disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.
Fig. 1 shows a flowchart of a method for generating a synonym library according to an embodiment of the present disclosure, as shown in fig. 1, where the method for generating a synonym library includes: in step S11, a corpus database is obtained, where the corpus database includes a first corpus, a second corpus, and a third corpus, and the first corpus, the second corpus, and the third corpus are corpora with different contents.
In step S12, synonymous sentences are generated according to at least one of the first corpus, the second corpus, and the third corpus.
In step S13, a synonym library is generated from the synonyms.
By the method, the synonyms can be determined through different corpus and the relation between the corpus and the synonym, and then the synonym library is generated according to the synonyms, so that the synonym library is simply, accurately and quickly constructed.
In one possible implementation manner, the method for generating the synonym library may be executed by an electronic device such as a terminal device or a server, where the electronic device may be a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction. Processing components of an electronic device may include, but are not limited to: a central processor (Central Processing Unit, CPU), a graphics processor (Graphic Processing Unit, GPU), a General-purpose graphics processor (General-Purpose Computing on Graphics Processing Units, GPGPU), a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a tensor processor (Tensor Processing Unit, TPU), a field programmable gate array (Field Programmable Gate Array, FPGA), or other programmable logic device. The processor in the electronic device for executing the method may be a completely new design, or may be obtained by modifying a processing unit of an existing processor, which is not limited in this disclosure.
In one possible implementation, synonymous sentences refer to sentences that are different in language and have the same or similar semantics, and are interchangeable in time. Such as "elder Zhang criticizing little king", "old Zhang Baxiao king criticizing", "little Wang Beilao criticizing", these three sentences are synonymous sentences.
In one possible implementation, the synonym library may be a collection of large numbers of synonyms stored in the terminal device (or server) for a long period of time, organized, sharable, and managed the same. The synonymous sentence library generated by the embodiment of the disclosure can be applied to the field of natural language processing and is used for improving the accuracy of tasks such as knowledge point filtering, text classification, semantic clustering and the like. The synonym library generated by the embodiment of the disclosure can also be applied to the search field, and after receiving a sentence input by a user, a search engine can acquire the synonym of the sentence first, and then search matched content according to the sentence and the synonym thereof. The synonym library generated by the embodiment of the disclosure can also be applied to the field of machine learning, and can be used as a training set of a neural network. The synonym library of the embodiments of the present disclosure may be applied to various scenarios, which is not limited by the present disclosure.
In one possible implementation, in step S11, the data in the corpus database may originate from data published by a third party, or from published data collected by itself (e.g., crawled from the internet), or from locally stored data. The obtained corpus database can be a corpus database facing a certain field, such as a corpus database facing a medical field, a corpus database facing a biological field, and the like; alternatively, the obtained corpus database may be a corpus database oriented to a general field or fields, such as news corpus data (including a large number of normative terms), social corpus data (including a large number of non-normative terms), and the like. It should be appreciated that the data in the corpus database is data that the third party authorizes to agree to use, and that embodiments of the present disclosure do not limit the source of the data in the corpus database, and the domain in which it is oriented, if such conditions are met.
In a possible implementation manner, in step S11, the obtained corpus database may include a first corpus, a second corpus, and a third corpus, where the first corpus, the second corpus, and the third corpus are corpora with different contents. For example, the obtained corpus database may include: at least one of a first corpus composed of synonym pairs, a second corpus composed of dictionaries, and a third corpus composed of translated documents.
Wherein the first corpus is a set of synonym pairs, and the first corpus can be obtained by collecting open-source synonym data (e.g., hadamard synonym forest). FIG. 2 illustrates a schematic diagram of a first corpus in which, as shown in FIG. 2, synonym pairs, consisting of words 1 and 2, may be arranged in a "words 1\t word 2" format, where 1\t represents the Tab key, according to embodiments of the present disclosure. The present disclosure is presented by way of example only with respect to FIG. 2, and is not limited in the number and presentation of pairs of synonyms that the first corpus can include.
The second corpus is a collection of various dictionaries, which are also called "dictionaries", and are tool books that collect words and phrases in a certain order and explain them for human inspection and reference. The dictionary in the second corpus may include a modern chinese dictionary, a foreign language dictionary, a chinese dictionary, a discipline (encyclopedia) dictionary, a personal name dictionary, a place name dictionary, etc., and the disclosure is not limited to the number and types of dictionaries included in the second corpus.
The third corpus is a set of translated documents, and the first corpus can be obtained by collecting public translated documents or published books. The translation document may include a translation document of a dialect, a translation document of a foreign language (e.g., a translation document of english), a translation document of poetry, etc., and the disclosure is not limited to the data and type of the translation document in the third corpus.
In step S12, synonymous sentences may be generated according to at least one of the first corpus, the second corpus, and the third corpus in the corpus database. Wherein the synonym includes at least one of: generating synonyms according to paraphrases of at least one synonym in the first corpus in any version of dictionary in the second corpus; synonyms generated according to the paraphrases of the same word in dictionaries of different versions in the second corpus; and generating synonymous sentences according to different versions of translation texts in the third corpus of the same paragraph.
In one possible implementation, in step S12, the paraphrasing characteristics of the synonym pairs (two words that are synonyms for each other) corresponding to the dictionary are considered as follows: sentences are different and semantically identical. And performing synonym mining according to the paraphrases of at least one synonym pair in the dictionary of any version in the second corpus, and forming a pair of synonyms from the dictionary paraphrases which are synonyms.
For example, "fishermen" and "fishermen" are synonym pairs in a first corpus, and the second corpus may include different versions of dictionaries, and the definitions of "fishermen" and "fishermen" in a certain dictionary, namely:
Fishermen: people who are in the industry of fishing.
Fishing: men in the field of fishing.
A pair of synonyms can be made for "people in the fishing industry" and "men in the fishing industry".
In one possible implementation, in step S12, the large probabilities are sentences that are different and semantically identical, considering dictionary definitions of the same word in different versions. And mining synonyms according to the paraphrases of the same word in the dictionaries of different versions in the second corpus, and forming a pair of synonyms from the paraphrases of the same word in the dictionaries of different versions.
For example, the definition of a score boat sword in a dictionary: the metaphor is limited to an idiom, and the mind or the method is not changed along with the change of the situation. Paraphrasing of the score boat sword in another dictionary: the metaphors are not clear about the constraint method.
The "metaphor is limited to an example, and the" concept or method is not changed with the change of situation "and the" metaphor is limited to a method, and the "flexibility is not known" may be constituted as a pair of synonyms.
In one possible implementation, in step S12, considering that the translation is performed by different publishers (or different translators) for the same text (or foreign document), the translated result is necessarily sentence-by-sentence, and the semantics are the same. Synonyms can be generated from different versions of translated text in the third corpus of the same paragraph.
For example, selected from the first paragraph of "history-gaozu Ji Di eight volumes": high ancestor, pei Fengyi middle yang lining people, surname Liu, chinese character 'quan'. The father is taigong and the mother is Liu. First Liu tasted to be bright and dream and mind. When the thunder and lightning is dark and meditation is too great, the dragon is seen on the medicine. It is known that high progenitor is produced.
A version of history translation: the gaozu is a person in the middle yang of Fengyi county, with Liu, a word season. His father was too public and mother was Liu. Liu had been resting on the glossy shore and in dream to blend with the mind before the high ancestor was born. At that time, the thunder was flashing, faint, and the person just before she was seen with the dragon on her. Shortly, liu had a pregnant and high ancestral life.
Another version of the history translation: the gaozu is a person in the middle yang of Fengyi county, liu of last name, and a word season. His father called taigong and mother called Liu. Early Liu had been resting on the large shoreside, meeting the water in dreams. When the lightning is ringing, the sky is dark, and the dragon is found to lie on Liu when the person looks before the person gets over. She had a pregnant and had developed high ancestor.
It can be seen that translated sentences are not identical for the same sentence, different versions of the translation, but the semantics are identical, so that synonyms can be generated from the pair of translated text. The "chinese book version history" may be translated: the gaozu is a person in the middle yang of Fengyi county, surname Liu, the word "and" Gaozu "is a pair of synonymous sentences formed by the people in the middle yang of the Pei county, the Liu of the last name and the word" Gaozu "; the father is the taigong, the mother is Liu, and the father is called taigong, and the mother is called Liu, so as to form a pair of synonyms; the Liu is at rest on the glossy bank before the ancestor is not born, the dream is crossed with the mind, the Liu is at rest on the glossy bank before the ancestor is born, and the dream is crossed with the water spirit, so that a pair of synonymous sentences are formed; the method comprises the steps of (1) flashing the current lightning, dimly and darkly, looking at her right before the sun, seeing that the dragon is on her body "and (b) flashing the current lightning, dimly and darkly, looking at the sun, finding that the dragon is on Liu body" to form a pair of synonymous sentences; the term "soon Liu had a pregnancy, the term" developed a high ancestor "and" soon she had a pregnancy, and the term "developed a high ancestor" constituted a pair of synonyms. These synonym pairs mined based on the translation of the dialect may be determined as synonyms.
For another example, for the first paragraph of the foreign book "aged and sea", a translated version: he was an old man and a person drawn a boat to fish in the ocean current of the gulf of mexico. While he had not caught a fish for eighty-four days. There is a boy and he together in the first forty days, but there is no fish caught in forty days, the boy's parent tells him that the old person is indeed the arctic-that is the most moldy person-so the child hears the parent's instruction to go to another boat to catch three good fish in the first week. The child sees that the elderly, driving back on the empty boat every day, feel hard, and he always removes his hook wire or fishhook and harpoon from his roll, and the sail on the mast from his roll. Folding and nailing the sail by using a flour bag: when rolled up, looks like a permanently failed flag.
Another translated version: he is an elderly person fishing alone on a boat in the gulf stream, who has been forty-four days ago, and a fish has not been arrested. During the first forty days, a male child is present with him. However, after forty days, a parent of a child has not caught a fish, he says that the old is now quite thoroughly "pouring blood mould", that is, pouring mould to the extreme, and the child hears their instruction, goes to another boat, and catches three good fish with the first worship. The child sees that the boat is always empty when the old person comes back every day, and is uncomfortable to wear, and he always goes off the shore, helps the old person to take the rolled fishing rope, or the fishhook and the harpoon, and also has a sail wound on the mast. The sail is patched by a flour bag sheet, and after being folded, the sail looks like a flag with a surface marked with permanent failure.
It can be seen that the translated sentences are not identical for the same segment of foreign text, but the semantics are identical for different versions of translation, so that synonyms can be generated from the pair of translated text. For specific reference, the above text is translated, and will not be described in detail herein.
In one possible implementation manner, in the case that the hardware resources are relatively sufficient, in order to improve the mining efficiency of the synonyms, the synonyms may be generated in a parallel processing manner. For example, multiple threads may be provided in the processor of the electronic device, such as thread a, thread B, and thread C, where threads a-C may execute synchronously and share execution resources of the processor, thread a is configured to generate synonyms from paraphrases of at least one synonym in a first corpus to a dictionary of any version in a second corpus, thread B is configured to generate synonyms from paraphrases of the same term in a dictionary of a different version in the second corpus, and thread C is configured to generate synonyms from translation text of the same paragraph in a different version in a third corpus.
Or under the condition of tense hardware resources, a synonym mining mode can be adopted first to generate corresponding synonyms, and then other synonym mining modes are adopted to further expand the data volume of the synonyms. For example, synonyms may be generated from paraphrases of at least one synonym in the first corpus to any version of the dictionary in the second corpus, then from paraphrases of the same word in different versions of the dictionary in the second corpus, and then from translation text of different versions of the same paragraph in the third corpus. The present disclosure does not limit the order in which synonyms are generated.
In step S12, a synonym is generated, and in step S13, a synonym library is generated from the synonym.
In one possible implementation manner, in order to obtain a more comprehensive data set, the synonym generating method in step S12 may be adopted, each piece of data in the corpus database in step S11 is traversed, and all synonyms generated in step S12 are aggregated to obtain a synonym library.
Through steps S11 to S13, a synonym library can be generated from synonyms generated by paraphrasing in a dictionary by synonyms, synonyms generated by paraphrasing of the same word in dictionaries of different versions, and synonyms generated by translation texts of different versions of the same paragraph. Compared with the prior art, the method has the advantages that the determining process of the synonyms is complicated, the labor cost is high, and the method can simply, conveniently, accurately and efficiently acquire a large number of synonyms under the condition that the synonyms are not manually mined, and the synonym library is formed by the large number of synonyms.
In one possible implementation, in order to obtain synonyms more accurately and more efficiently, step S12 may include: respectively cleaning the first data of the first corpus, the second corpus and the third corpus to obtain the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format; generating paraphrasing pairs and/or translation sentence pairs according to at least one of the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format; and performing second data cleaning on the paraphrase pair and/or the translation sentence pair to obtain a synonym after second data cleaning.
In the following, three types of synonyms are exemplified as the generation method of each of them.
In a possible implementation manner, in step S12, synonyms may be generated according to at least one synonym pair in the first corpus and paraphrases in any version of the dictionary in the second corpus, and the steps may include steps A1 to A3:
in step A1, the first corpus and the second corpus may be cleaned to obtain the first corpus in the first target format and the second corpus in the second target format, respectively.
Performing first data cleaning on the paraphrasing words in the first corpus to obtain a first corpus in a first target format, wherein the first target format is that each word corresponds to one word; wherein the first data cleansing may be used to cleanse text, paraphraseology and notation not related to the paraphraseology, such as: paraphrasing of paraphrasing words, "ā qn (name)", and the like, and sorting into a first corpus in a first target format if the paraphrasing words are not one word to one word. Thus, in the first corpus, a plurality of pieces of data may be included, each piece of data may be arranged in a format of "word \t" where "\t" represents a Tab key.
Performing first data cleaning on the dictionary in the second corpus to obtain a second corpus in a second target format, wherein the second target format is that each word has a corresponding paraphrasing; wherein the first data cleansing may be used to cleanse words and symbols in the dictionary that are not related to paraphrasing, such as: "ā qn < name >", "-hook", etc., the terms and definitions in the dictionary are reserved, and a second corpus in a second target format is obtained. Thus, in the second corpus, a plurality of pieces of data may be included, each piece of data may be arranged in a format of "word\t paraphrased", where "\t" represents a Tab key.
In step A2, according to the second corpus in the second target format, a paraphrase pair corresponding to each synonym pair in the first corpus in the first target format is determined.
In step A3, the paraphrase pair is subjected to second data cleaning, and synonyms after the second data cleaning are obtained. Wherein the second data cleansing is used for cleansing the paraphrases in the paraphrases into a plurality of sentences with various meanings so as to obtain more accurate synonyms. For example, there is a synonym pair "accounting" in the first corpus, and its corresponding paraphrase pair is "statistically calculated accounts". After depletion or failure, again disputes with humans. Accounting is performed for a period of time. "wherein," accounting "is an ambiguous word whose corresponding sentence in paraphrasing" disputes again with humans after being either depletion or failure. The "obvious and" settles accounts over a period of time. "second data purge, either synonymous or to purge accounts within a period of settlement" in the paraphrasing pair. ", a synonym is obtained, namely: "statistically calculated accounts". Accounting is performed for a period of time. "
In this way, synonyms can be generated by using the paraphrasing of the synonym pairs in the dictionary, and multiple data cleanups can be performed during the process of obtaining the paraphrasing pairs, resulting in more accurate synonyms.
In one possible implementation manner, performing first data cleaning on the first corpus to obtain a first corpus in a first target format, including: removing words and sentences containing preset identifiers in the first corpus; and performing format setting on the first corpus from which the words and sentences containing the preset identification are removed to obtain a first corpus in a first target format, wherein the first target format corresponds to one word for each word.
For example, assume that one term present in the first corpus is a "happy synonym: the method is cheerful, and words and sentences containing preset identifications can be removed, namely synonyms of' are as follows: "," the No. "get" happy and happy with each other ". Then, the format setting can be performed on "happy and happy" to obtain "happy and happy", "happy and happy" of the first target format.
By the method, the first corpus in the first target format can be obtained, and the subsequent efficient and accurate generation of synonyms is facilitated.
In one possible implementation, step A1 may include: removing words and sentences containing preset identifiers from each vocabulary entry in each dictionary in the second corpus; performing format setting on the second corpus from which the words and sentences containing the preset identification are removed, and obtaining a second corpus in an initial format; acquiring the matching degree of the words in the second corpus in the initial format and each sentence in the paraphrasing corresponding to the words according to a preset matching model; and removing other sentences except the sentences with the maximum matching degree from the paraphrasing corresponding to the words to obtain a second corpus in a second target format. The matching model can be obtained through training of a first training data set, wherein the first training data set comprises single words in a dictionary and single sentence paraphrasing corresponding to the single words. By the method, the second corpus in the second target format can be obtained, and the subsequent efficient and accurate generation of synonyms is facilitated.
Fig. 3 is a schematic diagram of a dictionary according to an embodiment of the present disclosure, as shown in fig. 3, words and sentences including preset identifiers (for example, a square area in fig. 3) in the dictionary in the second corpus may be removed, so as to obtain a second corpus in an initial format;
In an example, the preset identifier may include a pinyin identifier, a part-of-speech identifier, a skip direction identifier, a use case identifier, and the like, and in practical application, the content of the preset identifier may be adjusted, which is not limited in this disclosure.
As shown in FIG. 3, the dictionary may have phonetic symbols and part-of-speech symbols, and the phonetic symbols and part-of-speech symbols such as "ā ili a n < action >", ā im i ng < action > ", etc. may be cleaned by using regular expressions. The regular expression cleaning is to perform regular matching on preset identifiers, match unnecessary characters and replace the unnecessary characters with blank spaces or other contents.
There will be "see below", "same page", etc. in the dictionary, such skip-pointing identifiers, which can be deleted and copied to the word interpretation to which they point.
As illustrated in fig. 3, there may also be case identifications, such as "orphan oligobus, the regular expression cleaning method can be adopted to remove the case identifications.
In an example, program code (e.g., scripting language) may be written using predefined cleaning rules that clean up words and sentences in a dictionary that contain preset identifiers.
Then, the second corpus that has been clarified to contain the sentences of the preset identity may be formatted into an initial format of "word \t paraphrasing", where 1\t represents the Tab key, resulting in a second corpus of the initial format. Fig. 4 shows a schematic diagram of a second corpus in an initial format according to an embodiment of the disclosure.
However, in the second corpus in the initial format, the paraphrases corresponding to each word may exist in multiple sentences, and sentences representing the source of the word may exist in the multiple sentences. FIG. 5 is a schematic diagram of a second corpus in another initial format, as shown in FIG. 5, with some initial format second corpus resulting from idiom dictionary clean-up transformations, in which paraphrasing would include sentences identifying the idiom origin.
For example, the corresponding paradox is: refers to the dilemma of no clothing and no food. Qing dynasty Cao Xueqin "dream of Red mansions" first hundred times: the "mother" is not as close to starving as the mother. "wherein, qing dynasty Cao Xueqin" dream of Red mansions "first hundred times: the "mother" is not as close to starving as the mother. "is a sentence representing the provenance.
The medicine can be cleaned by a preset matching model, and the medicine is respectively taken as ' frost-free starvation ' and ' no-clothes and no-food dilemma ' and ' Qing. Cao Xueqin ' dream of the red blood cell ' first hundred times: the ' mother ' ancestor wants to have no match yet to be starved next to freeze '; the matching degree of the 'starvation of frozen' and 'the' dilemma without clothes and food 'is the largest, the' dilemma without clothes and food 'in the definition of' starvation of frozen 'can be kept, and' Qing-Cao Xueqin 'dream of red building' first hundred times: the 'mother' ancestor wants to be starved less than the freezing threshold. Thus, the resulting "starvation of ice refers to a no coat and no food dilemma" conforms to the second target format.
In an example, the matching model may be trained from a first training data set, where the first training data set includes univocal words in a dictionary and univocal sentence definitions corresponding to the univocal words, and the matching model may be used to determine a neural network model of the degree of matching of each word in the second corpus in the initial format with each sentence in its corresponding paraphrasing.
For dictionary data in the second corpus, the data is authoritative data, and the use of punctuation in the text is very standard. In the dictionary, punctuations such as a period, a question mark and a sighing mark are used as sentence breaking, and each sentence is a complete sentence. The paraphrased data is one of these sentences.
In the dictionary, data of only one sentence of text in the paraphrasing, which is the paraphrasing of, for example, a univocal word, may be mined first. The unigram word and the unigram paraphrase may be formed into a pair of positive samples as training data for the matching model.
The matching model may be an open source matching model, such as a sibert model (a model integrating search and generation based on an open source bert model and a UniLM idea). It should be appreciated that the matching model may include at least one of a plurality of convolutional layers, deconvolution layers, fully connected layers, may be a convolutional neural network (Convolutional Neural Network, CNN), a recurrent neural network (Recurrent Neural Network, RNN), a deep neural network (Deep Neural Networks, DNN), etc., and the present disclosure is not limited to the network structure of the matching model.
In order to improve the training effect of the matching model, the training matching model which is already trained and is open in source can be downloaded, and based on the first training data set, the matching model is continuously subjected to fine tuning training (training) by adopting a contrast learning method, so that the trained matching model is obtained.
The fine tuning training (fine) is to modify the output layer of the matching model by using the network structure of the known matching model and the network parameters of the known matching model, and fine tune the network parameters of a plurality of layers before the last layer, so that the training efficiency of the matching model can be improved.
In the training process, the matching model can be input in batches (batch) according to training data in the first training data set, and the matching model is trained to obtain a trained matching model.
The training data in the first training data set is input into the matching model in batches to obtain feature vectors corresponding to each corpus in each batch of training data, wherein the training data in each batch comprises N pairs of synonymous corpuses, each pair of synonymous corpuses comprises 2 corpuses which are synonymous with each other, and N is an integer greater than 1; respectively traversing each feature vector in N pairs of feature vectors, taking the feature vector of the synonymous corpus corresponding to the current feature vector as a positive sample, and taking 2N-2 feature vectors corresponding to the rest 2N-2 corpora as negative samples; inputting the current feature vector, 1 positive sample and 2N-2 negative samples into a loss function to obtain contrast loss, wherein the loss function is used for calculating the contrast between the positive sample similarity and the negative sample similarity, the positive sample similarity is the similarity between the current feature vector and the positive sample, and the negative sample similarity is the average similarity between the current feature vector and 2N-2 negative samples; and training the matching model according to the comparison loss to obtain a trained matching model.
For example, as described above, the first training data set includes the univocal words in the dictionary, and the univocal paraphrasing corresponding to the univocal words. Assuming that training data for each batch includes 4 pairs (n=4) of synonymous corpora, training data for any batch can be expressed as: the word system comprises [ word 1, word 2, word 3, word 4 and word 4], wherein word 1 and word 1 are a pair of mutually synonymous corpus pairs, word 2 and word 2 are a pair of mutually synonymous corpus pairs, word 3 and word 3 are a pair of mutually synonymous corpus pairs, and word 4 are a pair of mutually synonymous corpus pairs.
The training data [ single meaning word 1, single sentence meaning 1, single meaning word 2, single sentence meaning 2, single meaning word 3, single sentence meaning 3, single meaning word 4 and single sentence meaning 4] of any batch are input into a matching module, and feature vectors corresponding to each corpus in the batch of training data, namely [ a, a1, b, b1, c, c1, d, d1] can be obtained. Wherein a represents the feature vector of the univocal word 1, a1 represents the feature vector of the univocal word 1, b represents the feature vector of the univocal word 2, b1 represents the feature vector of the univocal word 2, c represents the feature vector of the univocal word 3, c1 represents the feature vector of the univocal word 3, d represents the feature vector of the univocal word 4, and d1 represents the feature vector of the univocal word 4.
The 8 feature vectors [ a, a1, b, b1, c, c1, d, d1] may be traversed respectively, where i represents the feature vector currently traversed, and in the case where i=1, the 1 st feature vector a is illustrated as currently traversed, where the feature vector a1 of the synonymous corpus corresponding to the current feature vector a may be used as a positive sample, and the remaining 6 feature vectors b, b1, c, c1, d, d1 may be used as negative samples. And inputting the current feature vector a, the positive sample a1 and the negative samples b, b1, c, c1, d and d1 into a loss function to obtain the contrast loss corresponding to the feature vector a.
In the case where i=2, it is explained that the 2 nd feature vector a1 is currently traversed, and at this time, the feature vector a of the synonymous corpus corresponding to the current feature vector a1 may be used as a positive sample, and the remaining 6 feature vectors b, b1, c, c1, d, d1 may be used as negative samples. And inputting the current feature vector a1, the positive sample a and the negative samples b, b1, c, c1, d and d1 into a loss function to obtain the contrast loss corresponding to the feature vector a 1.
When i=3, it is explained that the 3 rd feature vector b is currently traversed, at this time, the feature vector b1 of the synonymous corpus corresponding to the current feature vector b may be used as a positive sample, and the remaining 6 feature vectors a, a1, c, c1, d, d1 may be used as negative samples. And inputting the current feature vector b, the positive sample b1 and the negative samples a, a1, c, c1, d and d1 into a loss function to obtain the contrast loss corresponding to the feature vector b.
In the case of i=4, the description is that the 4 th feature vector b1 is currently traversed, and at this time, the feature vector b of the synonymous corpus corresponding to the current feature vector b1 may be used as a positive sample, and the remaining 6 feature vectors a, a1, c, c1, d, d1 may be used as negative samples. And inputting the current feature vector b1, the positive sample b and the negative samples a, a1, c, c1, d and d1 into a loss function to obtain the contrast loss corresponding to the feature vector b 1.
In the case of i=5, the 5 th feature vector c is illustrated as being currently traversed, and at this time, the feature vector c1 of the synonymous corpus corresponding to the current feature vector c may be taken as a positive sample, and the remaining 6 feature vectors a, a1, b, b1, d, d1 may be taken as negative samples. And inputting the current feature vector c, the positive sample c1 and the negative samples a, a1, b, b1, d and d1 into a loss function to obtain the contrast loss corresponding to the feature vector c.
When i=6, it is described that the 6 th feature vector c1 is currently traversed, and at this time, the feature vector c of the synonymous corpus corresponding to the current feature vector c1 may be used as a positive sample, and the remaining 6 feature vectors a, a1, b, b1, d, d1 may be used as negative samples. And inputting the current feature vector c1, the positive sample c and the negative samples a, a1, b, b1, d and d1 into a loss function to obtain the contrast loss corresponding to the feature vector c 1.
In the case of i=7, the 7 th feature vector d is illustrated as being currently traversed, and at this time, the feature vector d1 of the synonymous corpus corresponding to the current feature vector d may be taken as a positive sample, and the remaining 6 feature vectors a, a1, b, b1, c, c1 may be taken as negative samples. And inputting the current feature vector d, the positive sample d1 and the negative samples a, a1, b, b1, c and c1 into a loss function to obtain the contrast loss corresponding to the feature vector d.
When i=8, the 8 th feature vector d1 is illustrated as currently traversing, at this time, the feature vector d of the synonymous corpus corresponding to the current feature vector d1 may be used as a positive sample, and the remaining 6 feature vectors a, a1, b, b1, c, c1 may be used as negative samples. And inputting the current feature vector d1, the positive sample d and the negative samples a, a1, b, b1, c and c1 into a loss function to obtain the contrast loss corresponding to the feature vector d 1.
And then, according to the contrast loss corresponding to each feature vector in the training data of each batch, the network parameters of the matching model can be adjusted until the network parameters of the matching model are converged, so that the trained matching model is obtained.
In an example, the loss function is used to calculate a contrast between a positive sample similarity, which is a similarity of a current feature vector to a positive sample, and a negative sample similarity, which is an average similarity of the current feature vector to a plurality of negative samples. For example, the loss function may be expressed as:
In formula (1), L i Representing the contrast loss of the current ith corpus in the training data of each batch, zi representing the current feature vector of the current ith corpus in the training data of each batch predicted by the matching model,positive sample representing current ith corpus, z j Represents the jth negative sample of the K negative samples, K represents the number of negative samples, S represents the cosine function,representing cosine similarity of the current feature vector to the positive sample, S (z i ,z j ) And (3) representing cosine similarity between the current feature vector and the negative sample, wherein tau represents a hyper-parameter for controlling the distinguishing degree of the matching model to the negative sample. If the value of τ is greater, the distribution of S becomes smoother, the resulting contrast loss will be less differentiated for all negative samples, resulting in no emphasis on learning the matching model. If the value of τ is smaller, the matching model will be more focused on negative samples that are closer to the positive samples, but those negative samples are likely to be potentially positive samples, which can result in the matching model being difficult to converge or poorly generalize.
As can be seen from the loss function in the formula (1), the numerator portion represents the similarity between positive examples, and the denominator represents the similarity between positive examples and negative examples, so that the larger the similarity of the same category is, the smaller the similarity of different categories is, and the smaller the loss is.
By the method, training efficiency and accuracy can be improved, and a more accurate matching model is obtained.
In practical applications, in order to improve the operation efficiency, a cosine matrix and a label matrix may be first constructed, where the cosine matrix is used to calculate the similarity between samples, and the label matrix is used to mark positive and negative samples, for example, positive samples may be marked with 1 and negative samples may be marked with 0. Then, the contrast loss corresponding to each batch of training data can be determined according to the cosine matrix, the label matrix and the loss function shown in the formula (1).
In an example, when training data [ univocal word 1, univocal word 2, univocal word 3, univocal word 4] of any batch is input into the matching module, feature vectors corresponding to each corpus in the batch of training data, namely [ a, a1, b, b1, c, c1, d, d1]. Wherein a represents the feature vector of the univocal word 1, a1 represents the feature vector of the univocal word 1, b represents the feature vector of the univocal word 2, b1 represents the feature vector of the univocal word 2, c represents the feature vector of the univocal word 3, c1 represents the feature vector of the univocal word 3, d represents the feature vector of the univocal word 4, and d1 represents the feature vector of the univocal word 4.
The cosine similarity of a and a1, b1, c1, d1 can be respectively recorded into an initial cosine matrix, cosine similarity is made between a1 and a, b1, c1, d1, and the values are recorded in an initial cosine matrix, b and the values of the similarity with a, a1, b1, c1, d and d1 are recorded into an initial cosine matrix, b1 and the similarity with a, a1, b, c1, d and d1 are recorded into the initial cosine matrix, c and a, a1, b1, c1, d and d1 are respectively cosine-like, the values are recorded in an initial cosine matrix, c1 and a, a1, b1, c, d and d1 are respectively cosine-like, the values are recorded in the initial cosine matrix, d and a, a1, b1, c1 and d1 are respectively subjected to cosine similarity, the values are recorded in an initial cosine matrix, and d1 and a, a1, b1, c1 and d are respectively subjected to cosine similarity, and the values are recorded in the initial cosine matrix. Fig. 6 shows a schematic diagram of an initial cosine matrix of an embodiment of the present disclosure, it being understood that fig. 6 shows values of the initial cosine matrix for illustration only, and the present disclosure is not limited in this regard.
Considering that, in the initial cosine matrix, the cosine similarity between itself and itself is meaningless (for example, the cosine similarity between a and a), the diagonal value of the initial cosine matrix may be reduced to a small value (for example, 0.01), so as to obtain a cosine matrix, and fig. 7 is a schematic diagram of the cosine matrix according to an embodiment of the disclosure. It should be understood that the present disclosure is not limited to the value of the diagonal of the initial cosine matrix, but is only illustrated as 0.01 in fig. 7, and may be 0.001, 0.0001, etc.
In synchronization, to identify whether cosine similarity in a cosine matrix is positive sample similarity or negative sample similarity, a label matrix may be constructed, fig. 8 shows a schematic diagram of a label matrix of an embodiment of the present disclosure. As shown in fig. 8, the matrix elements determined by a and a1, b and b1, c and c1, d and d1 may be set to 1, and the rest matrix elements may be set to 0, so that cosine similarity corresponding to the 1 position is positive sample similarity, and cosine similarity corresponding to the 0 position is negative sample similarity.
Then, the contrast loss corresponding to each batch of training data can be determined according to the cosine matrix, the label matrix and the loss function shown in the formula (1). Therefore, the positive sample similarity and the negative sample similarity can be uniformly calculated through the cosine matrix, and the contrast loss can be obtained when the positive sample similarity and the negative sample similarity are input into the loss function, so that the contrast loss can be more efficiently determined.
The second corpus in the initial format can be cleaned one by one after the trained matching model is obtained, any piece of data in the second corpus in the initial format is aimed at, each sentence and word of each paraphrase can be used for matching degree calculation by using the trained matching model, a sentence with the largest matching degree is selected to be used as the paraphrase data of the current word, and other sentences are cleaned.
For example, a semi-supervised training method may be adopted, after a certain amount of data is cleaned by using the matching model (for example, a second corpus set meeting a preset number of second target formats is cleaned), the matching model may be further subjected to fine tuning training (defining) by using the data, then the second corpus set is cleaned, then the matching model is subjected to fine tuning training (defining) by using the cleaned second corpus set, after a few cycles, the data of the second corpus set in the second target format will be stabilized, and the cleaning effect also reaches the best.
By the method, the dictionary in the second corpus is subjected to first data cleaning, so that the second corpus in the second target format can be obtained more accurately, and subsequent efficient and clear determination of synonymous sentences is facilitated.
In step A1, a first corpus in a first target format and a second corpus in a second target format are obtained, and in step A2, corresponding paraphrasing pairs of each synonym pair in the first corpus in the first target format can be determined according to the second corpus in the second target format;
illustratively, assume that the synonym pairs present for any row in the first corpus in the first target format are: "word A word B". "word A" and "word B" may be retrieved from a second corpus in a second target format, and "paraphrase C" corresponding to "word A" and "paraphrase D" corresponding to "word B" may be determined, with "paraphrase C" and "paraphrase D" forming a pair of paraphrase pairs in the format "paraphrase C paraphrase D".
In step A2, determining a paraphrase pair corresponding to each synonym pair in the first corpus in the first target format, and in step A3, performing second data cleaning on the paraphrase pairs to obtain synonym sentences after second data cleaning.
In one possible implementation, step A3 may include: splitting the paraphrasing into a plurality of sentences in the case that any one of the paraphrasing pairs is a plurality of sentences having a plurality of meanings; according to a preset matching model, obtaining the matching degree of each sentence of one paraphrase in the paraphrasing pair and each sentence of the other paraphrasing pair; and removing other sentences except the sentence with the largest matching degree from the paraphrasing pair to obtain the synonym after the second data cleaning.
The preset matching model may directly output the matching degree of the paraphrasing pair, or the preset matching model may also be a feature extraction model, which may be used to determine feature vectors corresponding to each paraphrasing pair, and may determine the matching degree of the paraphrasing pair by calculating the similarity between the feature vectors, which is not limited in this disclosure.
The second data cleansing can solve the word ambiguous case: for example, "anger and fire" are a pair of synonyms, but "fire" is a polysemous word, and the corresponding paraphrasing pair is as follows in steps A1-A2: "rough sound and color behavior due to anger. (1) light and flame emitted by the object when burned. (2) The gun ammunition (3) is characterized in that the fire gas (4) is in the form of red (5) metaphor for emergency (6) and the anger gas (7) metaphor for anger (8) prosperous; the last name is produced (9). "
Wherein, the (6) definition of the fire and the previous sentence form a synonym, and other definition can be cleaned by the second data cleaning.
For example, the definition of "fire" can be broken down into 9 sentences, and according to a preset matching model, the "angered and rough sound and color behavior" is shown "with the light and flame emitted when the" (1) object burns. "match, get the matching degree 1; matching the sound and color behaviors which are rough due to anger with the gun ammunition (2) to obtain the matching degree of 2; matching the sound and color behaviors which are rough and violent due to anger with the fire and gas (3) to obtain matching degree 3; matching the sound and color behaviors which are rough and violent due to anger with the red-looking shape of the (4) to obtain the matching degree of 4; matching the sound and color behavior which shows rough and violent due to anger with the metaphor of the (5) to obtain the matching degree of 5; matching the sound and color behaviors which are rough and violent due to anger with the anger (6) to obtain the matching degree of 6; matching the sound and color behavior which shows rough and violent due to anger with the metaphor (7) to obtain the matching degree 7; the AND (8) is vigorous for the sound and color behaviors which are rough due to anger; carrying out matching on the Xinglong to obtain a matching degree 8; matching the "sound and color behavior showing rough and violent due to anger" with the "(9) last name" to obtain the matching degree 9.
In the matching degrees 1-9, the matching degree 6 is the largest, the paraphrases corresponding to the matching degree 6 can be reserved, and other paraphrases are deleted, so that the synonym 'rough sound and color behaviors are shown due to anger'. Anger qi).
In an example, the matching model may be trained from a second training data set that includes paraphrasing pairs of different versions of a univocal word, and the matching model may be used to determine a neural network model of the degree of matching of each sentence of one paraphrasing pair with each sentence of the other paraphrasing.
For example, the paraphrasing pair obtained in step A2 may be screened, the paraphrasing pair composed of two single sentences is screened, that is, the paraphrasing pair composed of paraphrasing of different versions of the single word is screened, and these paraphrasing pairs composed of paraphrasing of different versions of the single word are used as the second training data set.
The network structure and the training method of the matching model may refer to the matching model above, and will not be described herein.
And obtaining a trained matching model, wherein the matching model can be used for obtaining the matching degree of each sentence of one paraphrase in the paraphrase pair and each sentence of the other paraphrase, and removing other sentences except the sentence with the largest matching degree from the paraphrase pair.
Thus, by performing the second data cleansing on the paraphrasing pair, the paraphrasing of a plurality of sentences having a plurality of meanings in the paraphrasing pair (paraphrasing of ambiguities) can be removed, resulting in more accurate synonyms.
In one possible implementation manner, step S12, the synonym generated according to the paraphrases of the same word in the dictionary of different versions in the second corpus may include steps B1 to B3: in step B1, the first data is cleaned for the second corpus to obtain a second corpus in a second target format, where the second target format is that each word has a corresponding paraphrase.
In step B2, according to the paraphrases in the dictionary of different versions in the second corpus in the second target format, the paraphrase pairs formed by the different paraphrases of the dictionary of different versions corresponding to the same word are determined.
In step B3, the paraphrase pair is subjected to second data cleaning to obtain a synonym.
In one possible implementation, step B1 may include: removing words and sentences containing preset identifiers from each vocabulary entry in each dictionary in the second corpus; performing format setting on the second corpus from which the words and sentences containing the preset identification are removed, and obtaining a second corpus in an initial format; according to a preset matching model, the matching degree of the words in the second corpus in the initial format and each sentence in the paraphrasing corresponding to the words is obtained, wherein the matching model can be obtained through training of a first training data set, and the first training data set comprises single words in a dictionary and single sentence paraphrasing corresponding to the single words; and removing other sentences except the sentences with the maximum matching degree from the paraphrasing corresponding to the words to obtain a second corpus in a second target format.
It should be understood that, in the step B1, the process of determining the second corpus in the second target format in the above step A1 is the same, and reference may be made to the above step A1, which is not repeated here. Further, in the case where the generation of the second corpus of the second target format has been performed, step B1 may be omitted, and the second corpus of the second target format in step A1 may be directly multiplexed.
In step B1, a second corpus in a second target format is obtained, and in step B2, paraphrasing pairs formed by different paraphrases of dictionaries in different versions corresponding to the same word may be determined according to the paraphrases in the dictionaries in different versions in the second corpus in the second target format.
Illustratively, assuming that for a certain "word 1", there is "word 1 paraphrase 1" in the second corpus dictionary 1 in the second target format, there is "word 1 paraphrase 2" in the second corpus dictionary 2 in the second target format, "word 1" is "paraphrase 1" in dictionary 1, and "paraphrase 2" is present in dictionary 2, a paraphrase pair "paraphrase 1 paraphrase 2" is formed.
In step B2, it is determined that the paraphrase pair consisting of different paraphrases of the dictionary of different versions corresponding to the same word, for example, "paraphrase 1 paraphrase 2", may be subjected to second data cleansing in step B3, to obtain a synonym.
In one possible implementation, step B3 may include: splitting the paraphrasing into a plurality of sentences in the case that any one of the paraphrasing pairs is a plurality of sentences having a plurality of meanings; according to a preset matching model, the matching degree of each sentence of one paraphrase in the paraphrase pair and each sentence in the other paraphrase is obtained, wherein the matching model can be obtained through training of a second training data set, and the second training data set comprises the paraphrase pair formed by paraphrases of different versions of single-meaning words; and removing other sentences except the sentence with the largest matching degree from the paraphrasing pair. It should be understood that the procedure of step B3 is the same as that of step A3 above, and reference may be made to step A3 above, which is not repeated here.
In this way, the synonym can be generated by using the way that the same word is interpreted in dictionaries with different versions, and in the process of generating the synonym, multiple data cleaning can be performed to obtain a more accurate synonym.
It should be appreciated that after obtaining the synonyms, words such as "fingers", "primordial fingers", "metaphors", etc. at the beginning of the sentence in the synonyms may also be washed away in order to further increase the accuracy of the synonyms.
In a possible implementation manner, step S12, generating synonyms according to the translated texts of different versions of the same paragraph in the third corpus may include steps C1 to C3: in step C1, performing first data cleaning on the third corpus to obtain a third corpus in a third target format, where the third target format corresponds to one paragraph for each paragraph.
In step C2, according to the translated text of different versions in the third corpus of the third target format, a pair of translated sentences formed by translations of different versions corresponding to the same sentence is determined.
In step C3, the translation sentence pair is subjected to second data cleaning, and synonyms after the second data cleaning are obtained.
In one possible implementation manner, step C1 performs first data cleansing on the third corpus to obtain a third corpus in a third target format, including: removing words and sentences containing preset identifiers in the third corpus; and performing format setting on the third corpus from which the words and sentences containing the preset identification are removed to obtain a third corpus in a third target format, wherein the third target format corresponds to one paragraph for each paragraph. For example, the format of "paragraph\t paragraph", where "\t" denotes the Tab key.
For example, assuming that the translation text 1 and the translation text 2 of the same foreign language exist in the third corpus, words and sentences containing preset identifiers, such as pinyin identifiers, part-of-speech identifiers, jump direction identifiers, use case identifiers, provenance identifiers and the like, can be removed, so as to obtain a translation text 1{ a first segment a { which does not contain the preset identifier words and sentences; a second section B; third segment C and translation text 2{ first segment a; a second section b; third segment c }. Then, text 1{ first segment a ] may be translated; a second section B; third segment C and translation text 2{ first segment a; a second section b; and a third section c } carries out format setting to obtain a third corpus with a third target format, namely: "first segment a", "first segment B", "first segment C".
By the method, the third corpus in the third target format can be obtained, and the subsequent efficient and accurate generation of synonyms is facilitated.
In a possible implementation manner, different versions of translation texts of the same paragraph in a third corpus in the third target format are obtained, wherein the different versions of translation texts comprise a first translation text and a second translation text; and forming one or more pairs of translation sentence pairs by the first translation text and the second translation text according to the sentence sequence.
Illustratively, assume that a certain english paragraph is { english sentence 1; english sentence 2; english sentence 3, which paragraph has a translated version in the third corpus in the third target format, namely: a first translation text { Chinese sentence A; chinese sentence B; chinese sentence C, the paragraph also has another translated version in the third corpus in the third target format, namely: a second translation text { Chinese sentence D; chinese sentence E; chinese sentence F }.
C1, acquiring a first translation text { Chinese sentence A; chinese sentence B; chinese sentence C and a second translated text { Chinese sentence D; chinese sentence E; chinese sentence F), in step C2, the first translated text { Chinese sentence A; chinese sentence B; chinese sentence C and a second translated text { Chinese sentence D; chinese sentence E; chinese sentence F, and multiple pairs of translation sentence pairs are formed according to sentence sequence, namely "Chinese sentence A Chinese sentence D", "Chinese sentence B Chinese sentence E", "Chinese sentence C Chinese sentence F".
And C3, performing second data cleaning on each pair of translation sentence pairs to obtain synonymous sentences. In an example, step C3 may include: obtaining the matching degree of the translation sentence pairs according to a preset matching model; and clearing the translation sentence pairs with the matching degree smaller than a preset threshold value to obtain synonyms after the second data are cleaned. The preset threshold value can be set according to experience and actual application scenes, and specific values of the preset threshold value are not disclosed and are not limited.
For example, the translation sentence pair "chinese sentence a is input into a matching model, and the matching model is used to perform matching processing on the" chinese sentence a "and the" chinese sentence D "to obtain the matching degree 1. If the matching degree 1 is greater than or equal to a preset threshold value, determining the Chinese sentence A and the Chinese sentence D as synonymous sentences, and storing the synonymous sentences in a format of the Chinese sentence A; if the matching degree 1 is smaller than the preset threshold value, the Chinese sentence A and the Chinese sentence D are not synonymous sentences, and the Chinese sentence A and the Chinese sentence D are deleted.
The translation sentence pair "Chinese sentence B and Chinese sentence E" can be input into a matching model, and the matching model is utilized to perform matching treatment on the "Chinese sentence B" and the "Chinese sentence E" so as to obtain the matching degree 1. If the matching degree 1 is greater than or equal to a preset threshold value, determining the Chinese sentence B and the Chinese sentence E as synonymous sentences, and storing the synonymous sentences in a format of the Chinese sentence B and the Chinese sentence E; if the matching degree 1 is smaller than the preset threshold value, the Chinese sentence B and the Chinese sentence E are not synonymous sentences, and the Chinese sentence B and the Chinese sentence E are deleted.
The translation sentence pair "Chinese sentence C and Chinese sentence F" can be input into a matching model, and the matching model is utilized to perform matching treatment on the "Chinese sentence C" and the "Chinese sentence F" so as to obtain the matching degree 1. If the matching degree 1 is greater than or equal to a preset threshold value, determining the Chinese sentence C and the Chinese sentence F as synonymous sentences, and storing the synonymous sentences in a format of the Chinese sentence C; if the matching degree 1 is smaller than the preset threshold value, the Chinese sentence C and the Chinese sentence F are not synonymous, and the Chinese sentence C and the Chinese sentence F are deleted.
The matching model can be obtained through training of a second training data set, the matching model is used for determining the adjacency degree of the adjacency sentence, and the second training data set comprises paraphrasing pairs formed by paraphrasing of different versions of the univocal word. It should be understood that the matching model used in step C3 may be the same as the matching model in step A3 above, and reference may be made to step A3 above, which is not described here again.
In this way, the synonyms can be generated by using the translation texts of different versions of the same sentence (such as the dialect text and the foreign language text), which is beneficial to further expanding the data volume of the synonyms. Moreover, in the process of generating the synonyms, by performing the second data cleansing on the translation sentence pair, more accurate synonyms can be obtained.
In one possible implementation, to improve efficiency, the matching model may not be trained using the first training data set and the second training data set, respectively, and the first training data set and the second training data set may be integrated into one training set, where the training set includes not only the single word in the dictionary and the single sentence paraphrasing corresponding to the single word, but also paraphrasing pairs composed of paraphrasing of different versions of the single word. It should be appreciated that the specific training method of the matching model may be referred to above, and will not be described herein.
The matching model obtained by training the integrated training set (comprising the first training data set and the second training data set) can be used for obtaining the matching degree of the words in the second corpus in the initial format and each sentence in the paraphrasing corresponding to the words, the matching degree of each sentence in one paraphrasing pair and each sentence in the other paraphrasing pair and the matching degree of the translation sentence pair. Thus, the trained matching model can be suitable for various matching scenes (word and sentence matching and sentence matching), and the data cleaning efficiency is improved.
In step S2, a synonym is generated according to at least one of the first corpus in the first target format, the second corpus in the second target format, and the third corpus in the third target format, and in step S3, a synonym library may be generated according to the synonym.
The synonyms obtained in the mode are more accurate in consideration of the authority of the dictionary according to synonyms generated by paraphrasing of the same word in dictionaries of different versions in a second corpus in a second target format. Therefore, if the usage scenario of the synonym library is insensitive to the data amount in the synonym library, the synonym library may be determined according to the synonym generated by the second corpus in the second target format. If the usage scenario of the synonym library needs massive data, all data in the corpus database can be traversed, and synonyms generated by the first corpus based on the first target format, the second corpus based on the second target format and the third corpus based on the third target format form the synonym library, so that the data volume of the synonym library is further improved.
In one possible implementation, a scripting language program is written; executing the script language program to generate a synonym library; wherein the scripting language program is for: obtaining a corpus database, wherein the corpus database comprises: the first corpus, the second corpus and the third corpus are corpora with different contents; generating synonymous sentences according to at least one of the first corpus, the second corpus and the third corpus; and generating a synonym library according to the synonyms. For example, a script program may be written according to the steps S1 to S3 described above, and deployed to a server, and the script program is run to generate a synonym library. Fig. 9 shows a schematic diagram of a library of synonyms, according to an embodiment of the present disclosure.
By the method, human resources are saved greatly, and the synonymous sentence library can be automatically generated by running the script program.
In summary, the method for generating the synonym library according to the embodiment of the disclosure may generate the synonym library according to synonyms generated by paraphrases of synonyms in a dictionary, synonyms generated by paraphrases of the same word in dictionaries of different versions, and synonyms generated by translation texts of different versions of the same paragraph. Compared with the prior art, the method has the advantages that the determining process of the synonyms is complicated, the labor cost is high, and the method can simply, conveniently, accurately and efficiently acquire a large number of synonyms without manually excavating the synonyms and form a synonym library from the large number of synonyms.
It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.
In addition, the present disclosure further provides a generating device, an electronic device, a computer readable storage medium, and a program for generating a synonym library, where the foregoing may be used to implement any one of the methods for generating a synonym library provided in the present disclosure, and the corresponding technical schemes and descriptions and corresponding descriptions referring to the method parts are not repeated.
Fig. 10 shows a block diagram of an apparatus for generating a library of synonyms, according to an embodiment of the present disclosure, as shown in fig. 10, the apparatus comprising: the obtaining module 61 is configured to obtain a corpus database, where the corpus database includes: the first corpus, the second corpus and the third corpus are corpora with different contents; a first generating module 62, configured to generate synonymous sentences according to at least one of the first corpus, the second corpus, and the third corpus; and the second generating module 63 is configured to generate a synonym library according to the synonym.
In one possible implementation, the first corpus consists of synonym pairs, the second corpus consists of a dictionary, and the first generation module 62 is configured to at least one of: generating synonyms according to paraphrases of at least one synonym in the first corpus in any version of dictionary in the second corpus; synonyms generated according to the paraphrases of the same word in dictionaries of different versions in the second corpus; and generating synonymous sentences according to different versions of translation texts in the third corpus of the same paragraph.
In one possible implementation, the first generating module 62 is configured to: respectively cleaning the first data of the first corpus, the second corpus and the third corpus to obtain the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format; generating paraphrasing pairs and/or translation sentence pairs according to at least one of the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format; and performing second data cleaning on the paraphrase pair and/or the translation sentence pair to obtain a synonym after second data cleaning.
In one possible implementation manner, performing the first data cleansing on the second corpus to obtain a second corpus in a second target format, including: removing words and sentences containing preset identifiers from each vocabulary entry in each dictionary in the second corpus; performing format setting on the second corpus from which the words and sentences containing the preset identification are removed, and obtaining a second corpus in an initial format; acquiring the matching degree of the words in the second corpus in the initial format and each sentence in the paraphrasing corresponding to the words according to a preset matching model; and removing other sentences except the sentences with the maximum matching degree from the paraphrases corresponding to the words to obtain a second corpus in a second target format, wherein the second target format is that each word has a corresponding paraphrase.
In one possible implementation manner, performing first data cleaning on the first corpus to obtain a first corpus in a first target format, including: removing words and sentences containing preset identifiers in the first corpus; and performing format setting on the first corpus from which the words and sentences containing the preset identification are removed to obtain a second corpus in a first target format, wherein the first target format corresponds to one word for each word.
In one possible implementation manner, performing the first data cleansing on the third corpus to obtain a third corpus in a third target format, including: removing words and sentences containing preset identifiers in the third corpus; and performing format setting on the third corpus from which the words and sentences containing the preset identification are removed to obtain a third corpus in a third target format, wherein the third target format corresponds to one paragraph for each paragraph.
In one possible implementation manner, paraphrasing pairs and/or translation sentence pairs are generated according to at least one of the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format, and the paraphrasing pairs and/or translation sentence pairs are generated in a manner including at least one of the following: determining a paraphrase pair corresponding to each synonym pair in the first corpus in the first target format according to the second corpus in the second target format; according to the paraphrases in the dictionaries of different versions in the second corpus of the second target format, determining paraphrase pairs formed by different paraphrases of the dictionaries of different versions corresponding to the same word; and determining translation sentence pairs formed by translations of different versions corresponding to the same sentence according to the translation texts of different versions in the third corpus of the third target format.
In one possible implementation manner, determining, according to the translated text of different versions in the third corpus in the third target format, a pair of translated sentences formed by translations of different versions corresponding to the same sentence includes: obtaining translation texts of different versions of the same paragraph in a third corpus of the third target format, wherein the translation texts of different versions comprise a first translation text and a second translation text; and forming one or more pairs of translation sentence pairs by the first translation text and the second translation text according to the sentence sequence.
In one possible implementation, performing a second data cleansing on the paraphrase pair to obtain a synonym after the second data cleansing, including: splitting the paraphrasing into a plurality of sentences in the case that any one of the paraphrasing pairs is a plurality of sentences having a plurality of meanings; according to a preset matching model, obtaining the matching degree of each sentence of one paraphrase in the paraphrasing pair and each sentence of the other paraphrasing pair; and removing other sentences except the sentence with the largest matching degree from the paraphrasing pair to obtain the synonym after the second data cleaning.
In one possible implementation manner, performing second data cleansing on the translation sentence pair to obtain a synonym after the second data cleansing, including: obtaining the matching degree of the translation sentence pairs according to a preset matching model; and clearing the translation sentence pairs with the matching degree smaller than a preset threshold value to obtain synonyms after the second data are cleaned.
In one possible implementation, the preset matching model is a trained matching model, and the first generating module 62 is further configured to: inputting training data in a training data set into a matching model in batches to obtain feature vectors corresponding to each corpus in the training data of each batch, wherein the training data of each batch comprises N pairs of synonymous corpora, each pair of synonymous corpora comprises 2 corpora which are synonymous with each other, and N is an integer larger than 1; respectively traversing each feature vector in N pairs of feature vectors, taking the feature vector of the synonymous corpus corresponding to the current feature vector as a positive sample, and taking 2N-2 feature vectors corresponding to the rest 2N-2 corpora as negative samples; inputting the current feature vector, 1 positive sample and 2N-2 negative samples into a loss function to obtain contrast loss, wherein the loss function is used for calculating the contrast between the positive sample similarity and the negative sample similarity, the positive sample similarity is the similarity between the current feature vector and the positive sample, and the negative sample similarity is the average similarity between the current feature vector and 2N-2 negative samples; and training the matching model according to the comparison loss to obtain a trained matching model.
In one possible implementation manner, the apparatus further includes a writing module, configured to: writing a script language program; executing the script language program to generate a synonym library; wherein the scripting language program is for: obtaining a corpus database, wherein the corpus database comprises: the first corpus, the second corpus and the third corpus are corpora with different contents; generating synonymous sentences according to at least one of the first corpus, the second corpus and the third corpus; and generating a synonym library according to the synonyms.
In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.
The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.
Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.
The electronic device may be provided as a terminal, server or other form of device.
Fig. 11 shows a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server or terminal device. Referring to FIG. 11, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.
The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. Electronic device 1900 may operate an operating system based on memory 1932, such as the Microsoft Server operating system (Windows Server) TM ) Apple Inc. developed graphical user interface based operating System (Mac OS X TM ) Multi-user multi-process computer operating system (Unix) TM ) Unix-like operating system (Linux) of free and open source code TM ) Unix-like operating system (FreeBSD) with open source code TM ) Or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.
The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.
The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.
It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.
If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information, and obtains independent consent of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (16)

1. The method for generating the synonym library is characterized by comprising the following steps of:
obtaining a corpus database, wherein the corpus database comprises: the first corpus, the second corpus and the third corpus are corpora with different contents;
generating synonymous sentences according to at least one of the first corpus, the second corpus and the third corpus;
and generating a synonym library according to the synonyms.
2. The method of claim 1, wherein the first corpus consists of synonym pairs, the second corpus consists of dictionaries, the third corpus consists of translated documents,
Generating synonymous sentences according to at least one of the first corpus, the second corpus and the third corpus, wherein the synonymous sentences comprise at least one of the following: generating synonyms according to paraphrases of at least one synonym in the first corpus in any version of dictionary in the second corpus; synonyms generated according to the paraphrases of the same word in dictionaries of different versions in the second corpus; and generating synonymous sentences according to different versions of translation texts in the third corpus of the same paragraph.
3. The method of claim 1, wherein the generating synonymous sentences from at least one of the first corpus, the second corpus, and the third corpus comprises:
respectively cleaning the first data of the first corpus, the second corpus and the third corpus to obtain the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format;
generating paraphrasing pairs and/or translation sentence pairs according to at least one of the first corpus in a first target format, the second corpus in a second target format and the third corpus in a third target format;
And performing second data cleaning on the paraphrase pair and/or the translation sentence pair to obtain a synonym after second data cleaning.
4. The method of claim 3, wherein performing the first data cleansing on the second corpus to obtain the second corpus in the second target format comprises:
removing words and sentences containing preset identifiers from each vocabulary entry in each dictionary in the second corpus;
performing format setting on the second corpus from which the words and sentences containing the preset identification are removed, and obtaining a second corpus in an initial format;
acquiring the matching degree of the words in the second corpus in the initial format and each sentence in the paraphrasing corresponding to the words according to a preset matching model;
and removing other sentences except the sentences with the maximum matching degree from the paraphrases corresponding to the words to obtain a second corpus in a second target format, wherein the second target format is that each word has a corresponding paraphrase.
5. The method of claim 3, wherein performing a first data cleansing on the first corpus to obtain a first corpus in a first target format comprises:
removing words and sentences containing preset identifiers in the first corpus;
And performing format setting on the first corpus from which the words and sentences containing the preset identification are removed to obtain a second corpus in a first target format, wherein the first target format corresponds to one word for each word.
6. The method of claim 3, wherein performing the first data cleansing on the third corpus to obtain the third corpus in the third target format comprises:
removing words and sentences containing preset identifiers in the third corpus;
and performing format setting on the third corpus from which the words and sentences containing the preset identification are removed to obtain a third corpus in a third target format, wherein the third target format corresponds to one paragraph for each paragraph.
7. A method according to claim 3, wherein paraphrasing pairs and/or translation sentence pairs are generated from at least one of the first corpus in a first target format, the second corpus in a second target format, and the third corpus in a third target format, the paraphrasing pairs and/or translation sentence pairs being generated in a manner comprising at least one of:
determining a paraphrase pair corresponding to each synonym pair in the first corpus in the first target format according to the second corpus in the second target format;
According to the paraphrases in the dictionaries of different versions in the second corpus of the second target format, determining paraphrase pairs formed by different paraphrases of the dictionaries of different versions corresponding to the same word;
and determining translation sentence pairs formed by translations of different versions corresponding to the same sentence according to the translation texts of different versions in the third corpus of the third target format.
8. The method of claim 7, wherein determining translation sentence pairs of translation constructs of different versions corresponding to the same sentence based on translation text of different versions in a third corpus in the third target format, comprises:
obtaining translation texts of different versions of the same paragraph in a third corpus of the third target format, wherein the translation texts of different versions comprise a first translation text and a second translation text;
and forming one or more pairs of translation sentence pairs by the first translation text and the second translation text according to the sentence sequence.
9. The method of claim 3, wherein performing a second data cleansing on the paraphrase pair to obtain a second data-cleansing synonym, comprising:
splitting the paraphrasing into a plurality of sentences in the case that any one of the paraphrasing pairs is a plurality of sentences having a plurality of meanings;
According to a preset matching model, obtaining the matching degree of each sentence of one paraphrase in the paraphrasing pair and each sentence of the other paraphrasing pair;
and removing other sentences except the sentence with the largest matching degree from the paraphrasing pair to obtain the synonym after the second data cleaning.
10. A method according to claim 3, wherein performing a second data cleansing on the translation sentence pair to obtain a second data-cleansed synonym, comprises:
obtaining the matching degree of the translation sentence pairs according to a preset matching model;
and clearing the translation sentence pairs with the matching degree smaller than a preset threshold value to obtain synonyms after the second data are cleaned.
11. The method according to any one of claims 4, 9, 10, wherein the predetermined matching model is a trained matching model, the method further comprising:
inputting training data in a training data set into a matching model in batches to obtain feature vectors corresponding to each corpus in the training data of each batch, wherein the training data of each batch comprises N pairs of synonymous corpora, each pair of synonymous corpora comprises 2 corpora which are synonymous with each other, and N is an integer larger than 1;
respectively traversing each feature vector in N pairs of feature vectors, taking the feature vector of the synonymous corpus corresponding to the current feature vector as a positive sample, and taking 2N-2 feature vectors corresponding to the rest 2N-2 corpora as negative samples;
Inputting the current feature vector, 1 positive sample and 2N-2 negative samples into a loss function to obtain contrast loss, wherein the loss function is used for calculating the contrast between the positive sample similarity and the negative sample similarity, the positive sample similarity is the similarity between the current feature vector and the positive sample, and the negative sample similarity is the average similarity between the current feature vector and 2N-2 negative samples;
and training the matching model according to the comparison loss to obtain a trained matching model.
12. The method according to any one of claims 1-10, further comprising:
writing a script language program;
executing the script language program to generate a synonym library;
wherein the scripting language program is for: obtaining a corpus database, wherein the corpus database comprises: the first corpus, the second corpus and the third corpus are corpora with different contents; generating synonymous sentences according to at least one of the first corpus, the second corpus and the third corpus; and generating a synonym library according to the synonyms.
13. A generation apparatus for a thesaurus, comprising:
The acquisition module is used for acquiring a corpus database, and the corpus database comprises: the first corpus, the second corpus and the third corpus are corpora with different contents;
the first generation module is used for generating synonymous sentences according to at least one of the first corpus, the second corpus and the third corpus;
and the second generation module is used for generating a synonym library according to the synonym.
14. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1 to 12.
15. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 12.
16. A computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs a method for implementing any one of claims 1 to 12.
CN202310371130.5A 2023-04-07 2023-04-07 Method and device for generating synonymous sentence library, electronic equipment and storage medium Active CN116562268B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310371130.5A CN116562268B (en) 2023-04-07 2023-04-07 Method and device for generating synonymous sentence library, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310371130.5A CN116562268B (en) 2023-04-07 2023-04-07 Method and device for generating synonymous sentence library, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116562268A true CN116562268A (en) 2023-08-08
CN116562268B CN116562268B (en) 2024-01-23

Family

ID=87499089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310371130.5A Active CN116562268B (en) 2023-04-07 2023-04-07 Method and device for generating synonymous sentence library, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116562268B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050125215A1 (en) * 2003-12-05 2005-06-09 Microsoft Corporation Synonymous collocation extraction using translation information
CN104239286A (en) * 2013-06-24 2014-12-24 阿里巴巴集团控股有限公司 Method and device for mining synonymous phrases and method and device for searching related contents
CN110532547A (en) * 2019-07-31 2019-12-03 厦门快商通科技股份有限公司 Building of corpus method, apparatus, electronic equipment and medium
CN111597800A (en) * 2019-02-19 2020-08-28 百度在线网络技术(北京)有限公司 Method, device, equipment and storage medium for obtaining synonyms
CN111859926A (en) * 2020-07-28 2020-10-30 中国平安人寿保险股份有限公司 Synonym sentence pair generation method and device, computer equipment and storage medium
CN115587590A (en) * 2022-10-13 2023-01-10 北京金山数字娱乐科技有限公司 Training corpus construction method, translation model training method and translation method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050125215A1 (en) * 2003-12-05 2005-06-09 Microsoft Corporation Synonymous collocation extraction using translation information
CN104239286A (en) * 2013-06-24 2014-12-24 阿里巴巴集团控股有限公司 Method and device for mining synonymous phrases and method and device for searching related contents
CN111597800A (en) * 2019-02-19 2020-08-28 百度在线网络技术(北京)有限公司 Method, device, equipment and storage medium for obtaining synonyms
CN110532547A (en) * 2019-07-31 2019-12-03 厦门快商通科技股份有限公司 Building of corpus method, apparatus, electronic equipment and medium
CN111859926A (en) * 2020-07-28 2020-10-30 中国平安人寿保险股份有限公司 Synonym sentence pair generation method and device, computer equipment and storage medium
CN115587590A (en) * 2022-10-13 2023-01-10 北京金山数字娱乐科技有限公司 Training corpus construction method, translation model training method and translation method

Also Published As

Publication number Publication date
CN116562268B (en) 2024-01-23

Similar Documents

Publication Publication Date Title
Chaturvedi et al. Bayesian network based extreme learning machine for subjectivity detection
EP3835996A1 (en) Method, apparatus, electronic device and storage medium for processing a semantic representation model
CN111274394B (en) Method, device and equipment for extracting entity relationship and storage medium
US10606946B2 (en) Learning word embedding using morphological knowledge
Dos Santos et al. Deep convolutional neural networks for sentiment analysis of short texts
CN112906392B (en) Text enhancement method, text classification method and related device
CN108628834B (en) Word expression learning method based on syntactic dependency relationship
Zhao et al. ZYJ123@ DravidianLangTech-EACL2021: Offensive language identification based on XLM-RoBERTa with DPCNN
CN112148862B (en) Method and device for identifying problem intention, storage medium and electronic equipment
CN116187282A (en) Training method of text review model, text review method and device
CN111858894A (en) Semantic missing recognition method and device, electronic equipment and storage medium
Kokane et al. Supervised word sense disambiguation with recurrent neural network model
Malik et al. NLP techniques, tools, and algorithms for data science
Zhao et al. Ia-icgcn: Integrating prior knowledge via intra-event association and inter-event causality for chinese causal event extraction
CN112446217A (en) Emotion analysis method and device and electronic equipment
CN116562268B (en) Method and device for generating synonymous sentence library, electronic equipment and storage medium
Mulki et al. Empirical evaluation of leveraging named entities for Arabic sentiment analysis
CN111666405A (en) Method and device for recognizing text implication relation
Ha et al. Supervised attention for answer selection in community question answering
Üveges Comprehensibility and Automation: Plain Language in the Era of Digitalization
Rutkowski et al. Estimating senses with sets of lexically related words for Polish word sense disambiguation
Babii et al. FastText-based methods for emotion identification in Russian internet discourse
Chen et al. Learning word embeddings from intrinsic and extrinsic views
Batyrshin et al. Advances in Computational Intelligence
Swaminathan Token-level identification of multiword expressions using pre-trained multilingual language models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant