CN112559711A - Synonymous text prompting method and device and electronic equipment - Google Patents

Synonymous text prompting method and device and electronic equipment Download PDF

Info

Publication number
CN112559711A
CN112559711A CN202011539680.6A CN202011539680A CN112559711A CN 112559711 A CN112559711 A CN 112559711A CN 202011539680 A CN202011539680 A CN 202011539680A CN 112559711 A CN112559711 A CN 112559711A
Authority
CN
China
Prior art keywords
candidate
word
input text
text
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011539680.6A
Other languages
Chinese (zh)
Inventor
任帅
王博弘
张振
蒋宏飞
宋旸
王瑞阳
王阳
赵慧娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zuoyebang Education Technology Beijing Co Ltd
Original Assignee
Zuoyebang Education Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zuoyebang Education Technology Beijing Co Ltd filed Critical Zuoyebang Education Technology Beijing Co Ltd
Priority to CN202011539680.6A priority Critical patent/CN112559711A/en
Publication of CN112559711A publication Critical patent/CN112559711A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3322Query formulation using system suggestions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of computers, and provides a synonymous text prompting method, a synonymous text prompting device and electronic equipment, wherein the method comprises the following steps: dividing an input text into word units; determining a target word unit from the word units according to the segmentation condition of the input text, acquiring candidate words corresponding to the target word unit through a preset model to form a candidate set, and sequencing the candidate words in the candidate set to obtain a comprehensive sequencing candidate set corresponding to the target word unit; and prompting the synonymous text of the input text according to the segmentation condition of the input text and the comprehensive sequencing candidate set. According to the method and the device, the user experience is improved while the synonymous text recognition rate is improved, and the user can select the target synonymous text according to the previous two synonymous texts which are prompted.

Description

Synonymous text prompting method and device and electronic equipment
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a synonymous text recognition technology for a computer, in particular to a synonymous text prompting method, a synonymous text prompting device, electronic equipment and a computer readable medium.
Background
With the development of computer technology and internet technology, synonymous information plays an indispensable role in application fields such as quality inspection systems, Web search, question and answer systems, knowledge map construction and the like. For example, a quality inspection platform looks for text that is synonymous with a user-entered keyword, phrase, etc., a search engine looks for text that is semantically identical or similar to the user-entered text, or a question and answer platform looks for a set of questions that are synonymous with new questions posed by the user, etc.
In the prior art, when identifying a synonymous text, auxiliary processing needs to be performed on the text by depending on auxiliary tools such as a word segmentation tool, part of speech analysis, sentence template extraction and the like to obtain core words, and whether the two core words are synonymous or not is determined by an editing distance between the two core words. The edit distance refers to the number of times a character string needs to be changed into another character string, and the edit distance may represent the difference between the two character strings. When the editing distance between two words is smaller than or equal to a preset value, determining the two words as synonyms; and when the edit distance between the two words is larger than a preset value, determining the two words as non-synonyms. In reality, the non-synonym pair with a small editing distance exists, so that the recognition accuracy of the synonym text is low.
Disclosure of Invention
Technical problem to be solved
The method and the device aim to solve the technical problem that the synonymy text recognition accuracy rate is low in the prior art.
(II) technical scheme
In order to solve the technical problem, an aspect of the present invention provides a method for prompting synonymous text, where the synonymous text refers to a text having the same or similar meaning as an input text, the synonymous text includes a text having the same meaning as the input text and/or a text having a similar meaning as the input text, and the text may be a single word or a text composed of at least two words. The method comprises the following steps:
dividing an input text into word units;
determining a target word unit from the word units according to the segmentation condition of the input text, wherein the segmentation condition of the input text comprises the following steps: the input text is segmented into only one word unit, and the input text is segmented into at least two word units;
obtaining candidate words corresponding to the target word unit through a preset model to form a candidate set, wherein the candidate words are synonyms or near-synonyms of the target word unit;
sorting the candidate words in the candidate set to obtain a comprehensive sorting candidate set corresponding to the target word unit;
and prompting the synonymous text of the input text according to the segmentation condition of the input text and the comprehensive sequencing candidate set.
According to a preferred embodiment of the present invention, the obtaining, by a preset model, a candidate word composition candidate set corresponding to the target word unit includes:
acquiring different corpora as a training set to train a plurality of word2vec models;
and obtaining candidate words corresponding to the target word unit through each trained word2vec model to form a candidate set.
According to a preferred embodiment of the present invention, the obtaining candidate words corresponding to the target word unit by using each trained word2vec model to form a candidate set includes:
inputting the target word unit into a trained word2vec model to obtain a word vector of the target word unit output by the word2vec model;
acquiring candidate word vectors with the similarity between the word vectors and the target word unit smaller than a threshold value;
and converting the candidate word vectors into corresponding candidate words to form a candidate set.
According to a preferred embodiment of the present invention, the sorting the candidate words in the candidate set includes:
acquiring a candidate set output by each word2vec model;
determining the total weight of the candidate words according to the preset weight of the candidate words and the arrangement positions of the candidate words in each candidate set;
and sorting the candidate words in the candidate set according to the total weight of the candidate words.
According to a preferred embodiment of the present invention, the input text is segmented into only one word unit, and the only one word unit is used as the target word unit.
According to a preferred embodiment of the present invention, the prompting the synonymous text of the input text according to the segmentation condition of the input text and the comprehensive ranking candidate set includes:
filtering out target words meeting conditions from the comprehensive ordering candidate set according to preset word length, preset part of speech and preset word frequency;
and prompting the target words according to the sequence of the target words in the comprehensive ranking candidate set.
According to a preferred embodiment of the present invention, the input text is segmented into at least two word units, and one of the at least two word units is selected as a target word unit according to a part of speech and/or a word frequency.
According to a preferred embodiment of the present invention, the prompting the synonymous text of the input text according to the segmentation condition of the input text and the comprehensive ranking candidate set includes:
acquiring candidate words N before the total weight in the comprehensive sorting candidate set;
merging other word units segmented from the input text with the candidate words N before the total weight to obtain N candidate texts;
sorting the candidate texts according to the historical occurrence frequency of the candidate texts;
and prompting the candidate texts according to the ordering of the candidate texts.
According to a preferred embodiment of the present invention, if the historical occurrence frequency of the candidate texts is zero, the candidate texts are ranked according to the similarity between the word vector corresponding to the candidate word in the candidate texts and the word vector corresponding to the target word unit.
A second aspect of the present invention provides a synonymous text presentation device, which is a device for presenting a synonymous text having a meaning equal to or similar to that of an input text, the device including:
the segmentation unit is used for segmenting the input text into word units;
a determining module, configured to determine a target word unit from the word units according to a segmentation condition of the input text, where the segmentation condition of the input text includes: the input text is segmented into only one word unit, and the input text is segmented into at least two word units;
the acquisition module is used for acquiring candidate words corresponding to the target word unit through a preset model to form a candidate set, wherein the candidate words are synonyms or near-synonyms of the target word unit;
the sorting module is used for sorting the candidate words in the candidate set to obtain a comprehensive sorting candidate set corresponding to the target word unit;
and the prompt module is used for prompting the synonymous text of the input text according to the segmentation condition of the input text and the comprehensive sorting candidate set.
A third aspect of the invention proposes an electronic device comprising a processor and a memory for storing a computer-executable program, which, when executed by the processor, performs the method.
The fourth aspect of the present invention also provides a computer-readable medium storing a computer-executable program, which when executed, implements the method.
(III) advantageous effects
According to the method, the candidate set of the target word unit is obtained through the preset model, and then the candidate words in the candidate set are ranked, so that the candidate words are ranked in the candidate set according to the synonymy recognition accuracy between the candidate words and the target word unit, and the synonymy recognition accuracy of the target word unit is improved. And finally, determining and prompting the synonymous text of the input text according to the segmentation condition of the input text and the sequence of the candidate words corresponding to the target word unit, thereby ensuring that the synonymous text is prompted according to the accuracy of synonymous identification. According to the method and the device, the user experience is improved while the synonymous text recognition rate is improved, and the user can select the target synonymous text according to the previous two synonymous texts which are prompted.
The method comprises the steps of training a plurality of word2vec models by obtaining different corpora as a training set; and obtaining candidate words corresponding to the target word unit through each trained word2vec model to form a candidate set. Thus, a target word unit has a plurality of candidate sets, and candidate words in each candidate set may be the same or different. Multiple word2vec models trained on different corpuses guarantee the comprehensiveness of the candidate words.
According to the method, a weighted sorting mode is adopted, and a candidate set output by each word2vec model is obtained; determining the total weight of the candidate words according to the preset weight of the candidate words and the arrangement positions of the candidate words in each candidate set; and sorting the candidate words in each candidate set according to the total weight of the candidate words. Therefore, the candidate words output by different word2vec models are sorted according to the synonymy recognition accuracy between the candidate words and the target word unit, and the synonymy recognition accuracy of the target word unit is improved.
Drawings
FIG. 1 is a schematic flow chart of a synonymous text prompting method according to the present invention;
FIG. 2 is a schematic diagram illustrating a candidate set formed by candidate words corresponding to a target word unit obtained by a preset model according to the present invention;
FIG. 3 is a schematic structural diagram of a synonymous text prompt device according to the present invention;
FIG. 4 is a schematic structural diagram of an electronic device of one embodiment of the invention;
fig. 5 is a schematic diagram of a computer-readable recording medium of an embodiment of the present invention.
Detailed Description
In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.
The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different network and/or processing unit devices and/or microcontroller devices.
The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.
In order to solve the technical problem, the invention provides a synonymy text prompting method, wherein the synonymy text refers to a text with the same or similar meaning as the input text, and the input text is divided into word units; determining a target word unit from the word units according to the segmentation condition of the input text, acquiring candidate words corresponding to the target word unit through a preset model to form a candidate set, and sequencing the candidate words in the candidate set to obtain a comprehensive sequencing candidate set corresponding to the target word unit; and prompting the synonymous text of the input text according to the segmentation condition of the input text and the comprehensive sequencing candidate set. Wherein, the segmentation condition of the input text comprises: the input text is segmented into only one word unit, and the input text is segmented into at least two word units; the candidate words are synonyms or near synonyms of the target word unit; according to the method, the candidate set of the target word unit is obtained through the preset model, and then the candidate words in the candidate set are ranked, so that the candidate words are ranked in the candidate set according to the synonymy recognition accuracy between the candidate words and the target word unit, and the synonymy recognition accuracy of the target word unit is improved. And finally, determining and prompting the synonymous text of the input text according to the segmentation condition of the input text and the sequence of the candidate words corresponding to the target word unit, thereby ensuring that the synonymous text is prompted according to the accuracy of synonymous identification. According to the method and the device, the user experience is improved while the synonymous text recognition rate is improved, and the user can select the target synonymous text according to the previous two synonymous texts which are prompted.
The method comprises the steps of training a plurality of word2vec models by obtaining different corpora as a training set; and obtaining candidate words corresponding to the target word unit through each trained word2vec model to form a candidate set. Thus, a target word unit has a plurality of candidate sets, and candidate words in each candidate set may be the same or different. Multiple word2vec models trained on different corpuses guarantee the comprehensiveness of the candidate words.
When each word2vec model obtains a candidate set of a target word unit, firstly inputting the target word unit into a trained word2vec model to obtain a word vector of the target word unit output by the word2vec model; acquiring candidate word vectors with the similarity between the word vectors and the target word unit smaller than a threshold value; and converting the candidate word vectors into corresponding candidate words to form a candidate set.
In a mode of sequencing candidate words in a candidate set, a weighted sequencing mode is adopted, and a candidate set output by each word2vec model is obtained; determining the total weight of the candidate words according to the preset weight of the candidate words and the arrangement positions of the candidate words in each candidate set; and sorting the candidate words in each candidate set according to the total weight of the candidate words. Therefore, the candidate words output by different word2vec models are sorted according to the synonymy recognition accuracy between the candidate words and the target word unit, and the synonymy recognition accuracy of the target word unit is improved.
In a case of segmenting an input text, the input text is segmented into only one word unit, and the only one word unit is used as a target word unit. Filtering out target words meeting the conditions from the comprehensive ordering candidate set according to preset word length, preset part of speech and preset word frequency; and then prompting the target words according to the sequence of the target words in the comprehensive sequencing candidate set.
In another segmentation condition of the input text, the input text is segmented into at least two word units, and one of the at least two word units is selected as a target word unit according to the part of speech and/or the word frequency. Acquiring candidate words N before the total weight in the comprehensive sorting candidate set; merging other word units segmented from the input text with the candidate words N before the total weight to obtain N candidate texts; sorting the candidate texts according to the historical occurrence frequency of the candidate texts; and finally prompting the candidate texts according to the sequence of the candidate texts.
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.
The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different network and/or processing unit devices and/or microcontroller devices.
The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.
Fig. 1 is a schematic flow chart of a method for prompting a synonymous text according to the present invention, where the synonymous text refers to a text having the same or similar meaning as an input text, and as shown in fig. 1, the method includes the following steps:
s1, dividing the input text into word units;
wherein the input text may be a single word, such as a keyword or the like input by the user. Or a phrase consisting of a plurality of words. The word unit can be specifically a Chinese word, an English word and the like. The invention can adopt word segmentation tools (such as a Hanlp word segmentation device, an Ansj word segmentation tool and the like) to segment the input text into word units.
S2, determining a target word unit from the word units according to the segmentation condition of the input text,
specifically, if the input text is a single word in step S1, the segmentation condition of the input text is as follows: and when the input text is segmented into only one word unit, taking the only one word unit as a target word unit.
If the input text is a phrase in step S1, the segmentation condition of the input text is: and when the input text is segmented into at least two word units, selecting one of the at least two word units as a target word unit according to the part of speech and/or the word frequency. The word frequency refers to the frequency of word units appearing in a predetermined corpus or a predetermined document. When the part of speech and the word frequency are simultaneously used for determining the target word unit, the priority of the part of speech and the priority of the word frequency are preset, for example, the priority of the part of speech is higher than the priority of the word frequency, the target word unit is selected according to the part of speech, and if the parts of speech of all the word units are the same, the target word unit is selected according to the word frequency. Taking the example of selecting the target word unit according to the part of speech, wherein the part of speech may include: nouns, adjectives, verbs, and words other than the three parts of speech, the present invention can preset the priority of each part of speech, such as: the priorities of nouns, adjectives, verbs, and words other than the three parts of speech increase in order. And then selecting the word unit with the highest word priority as the target word unit. For example, the user inputs "no time", which is divided into two word units of "no" and "time", and the part of speech priority of "no" is higher than that of "time", and then "no" is the target word unit.
S3, obtaining candidate words corresponding to the target word unit through a preset model to form a candidate set,
wherein the candidate word is a synonym or a synonym of the target word unit; the method comprises the steps of training a plurality of word2vec models by obtaining different corpora as a training set; and obtaining candidate words corresponding to the target word unit through each trained word2vec model to form a candidate set. Thus, a target word unit has a plurality of candidate sets, and candidate words in each candidate set may be the same or different. Multiple word2vec models trained on different corpuses guarantee the comprehensiveness of the candidate words.
When each word2vec model obtains a candidate set of a target word unit, firstly inputting the target word unit into a trained word2vec model to obtain a word vector of the target word unit output by the word2vec model; acquiring candidate word vectors with the similarity between the word vectors and the target word unit smaller than a threshold value; and converting the candidate word vectors into corresponding candidate words to form a candidate set. Where the word2vec model is a group of models used to generate word vectors. After training is complete, the word2vec model may be used to map each word to a vector, which may be used to represent word-to-word relationships. In the invention, the similarity of the word vectors can be calculated by the Euclidean distance, the cosine distance, the edit distance, the Hamming distance and the like between the two.
For example, as shown in fig. 2, taking an example of obtaining a candidate set of a target word unit by using three word2vec models, firstly, training the three word2vec models by using different corpora, for example: the method comprises the steps of training a first word2vec model by adopting open-source corpora, training a second word2vec model by adopting a first corpus (such as a juvenile subject corpus) inside an enterprise, and training a third word2vec model by adopting a second corpus (such as a high school subject corpus) inside the enterprise. If the text input by the user is a test paper, the test paper is a target word unit, and the test paper is input into each trained word2vec model. Each word2vec model outputs a word vector of the test paper, calculates Euclidean distances between the word vector of the test paper and other word vectors to obtain similarity between the word vector of the test paper and other word vectors, takes the other word vectors with the similarity smaller than a threshold value as candidate word vectors, and converts the candidate word vectors into candidate words to form a candidate word set. For example, the candidate word set of the "test paper" obtained by the first word2vec model is: paper, examination questions, test paper, true questions, examination, interim, early, end, and exercise questions. The candidate word set of the test paper obtained by the second word2vec model is as follows: examination questions, examination papers, examinees, examination paper reading, examination questions, composition, examination, pen test, question setting and examination room. The candidate word set of the test paper obtained by the third word2vec model is as follows: the test paper is composed of a small section, a paper, an end-of-term test paper, an interim test paper, a monthly exam paper, a test question, a test coupon.
S4, sorting the candidate words in the candidate set to obtain a comprehensive sorting candidate set corresponding to the target word unit;
in a mode of sequencing candidate words in a candidate set, a weighted sequencing mode is adopted, and a candidate set output by each word2vec model is obtained; determining the total weight of the candidate words according to the preset weight of the candidate words and the arrangement positions of the candidate words in each candidate set; and sorting the candidate words in each candidate set according to the total weight of the candidate words. Therefore, the candidate words output by different word2vec models are sorted according to the synonymy recognition accuracy between the candidate words and the target word unit, and the synonymy recognition accuracy of the target word unit is improved. The arrangement positions of the candidate words in each candidate set may be arranged according to the similarity between the word vector of the candidate word and the word vector of the target word unit.
Specifically, the total weight Pi of the candidate word i can be obtained by the following formula:
Pi=pi×[rank(Ai)+rank(Bi)+…+rank(Ni)]
wherein pi is a preset weight of the candidate word i, rank (ai) is an arrangement position of the candidate word i in the word2vec model output candidate set with the number of A, rank (Bi) is an arrangement position of the candidate word i in the word2vec model output candidate set with the number of B, and rank (Ni) is an arrangement position of the candidate word i in the word2vec model output candidate set with the number of N.
Taking the candidate word "test question" in fig. 2 as an example, assuming that the preset weight of the candidate word "test question" is 0.9, the candidate word "test question" is ranked second in the first word2vec model, first in the second word2vec model, and eighth in the third word2vec model, then the total weight of the candidate word "test question" is: 0.9 × (2+1+ 8).
And S5, prompting the synonymous text of the input text according to the segmentation condition of the input text and the comprehensive sorting candidate set.
Specifically, if the input text is a single word in step S1, the segmentation condition of the input text is as follows: the input text is segmented into only one word unit, and the step comprises the following steps:
s51, filtering out target words meeting conditions from the comprehensive sorting candidate set according to preset word length, preset part of speech and preset word frequency;
for example, the preset word length may be set as: greater than 1 byte and less than 4 bytes; the preset part of speech may be set to be the same as the part of speech of the target word unit, and the preset word frequency refers to the frequency of occurrence of the candidate word in a predetermined corpus or a predetermined document, and may be specifically set according to a training sample of the word2vec model.
And S52, prompting the target words according to the sequence of the target words in the comprehensive sorting candidate set.
Specifically, the target words may be sequentially displayed from front to back according to the order of the target words in the comprehensive ranking candidate set.
If the input text is a phrase in step S1, the segmentation condition of the input text is: the input text is segmented into at least two word units, and the step includes:
s501, acquiring candidate words N before the total weight in the comprehensive ranking candidate set;
wherein, N can be set according to the needs of users.
S502, combining other word units segmented from the input text with the candidate words N before the total weight to obtain N candidate texts;
taking the input text "what time" as an example, it is segmented into "what" and "time", the "what" being the target word unit, and the "time" being the other word units segmented from the input text. If "what" corresponds to the candidate word at the top 9 of the total weight in the comprehensive ranking candidate set: any, what, other, what, each, where, various, everything; then 9 candidate texts are obtained: any time, what time, other times, which times, each time, where time, various times, all times.
S503, sorting the candidate texts according to the historical occurrence frequency of the candidate texts;
the historical frequency of occurrence refers to the frequency of occurrence in the historical query documents. And if the historical occurrence frequency of a plurality of candidate texts is zero, sequencing the candidate texts with the historical occurrence frequency of zero according to the similarity between the word vectors corresponding to the candidate words and the word vectors corresponding to the target word unit.
S504, presenting the candidate texts according to the sequence of the candidate texts.
Fig. 3 is a schematic diagram of a framework of a synonymous text presentation device according to the present invention, where the synonymous text refers to a text having the same or similar meaning as an input text, and as shown in fig. 3, the device includes:
a segmentation unit 31 configured to segment an input text into word units;
a determining module 32, configured to determine a target word unit from the word units according to a segmentation condition of the input text, where the segmentation condition of the input text includes: the input text is segmented into only one word unit, and the input text is segmented into at least two word units;
the obtaining module 33 is configured to obtain candidate words corresponding to the target word unit through a preset model to form a candidate set, where the candidate words are synonyms or near-synonyms of the target word unit;
the sorting module 34 is configured to sort the candidate words in the candidate set to obtain a comprehensive sorting candidate set corresponding to the target word unit;
and the prompting module 35 is configured to prompt the synonymous text of the input text according to the segmentation condition of the input text and the comprehensive sorting candidate set.
In a specific embodiment, the obtaining module 33 includes:
the first acquisition module is used for acquiring different corpora as a training set to train a plurality of word2vec models;
and the second acquisition module is used for acquiring candidate words corresponding to the target word unit through each trained word2vec model to form a candidate set.
Further, the second obtaining module includes:
the input module is used for inputting the target word unit into a trained word2vec model to obtain a word vector of the target word unit output by the word2vec model;
the sub-acquisition module is used for acquiring candidate word vectors of which the similarity with the word vectors of the target word unit is smaller than a threshold value;
and the conversion module is used for converting the candidate word vectors into corresponding candidate words to form a candidate set.
The sorting module 34 includes:
a third obtaining module, configured to obtain a candidate set output by each word2vec model;
the sub-determination module is used for determining the total weight of the candidate words according to the preset weight of the candidate words and the arrangement positions of the candidate words in each candidate set;
and the sub-ordering module is used for ordering the candidate words in the candidate set according to the total weight of the candidate words.
In one example, the input text is segmented into only one word unit, and the only one word unit is used as a target word unit. The prompt module 35 includes:
the filtering module is used for filtering out target words meeting conditions from the comprehensive sorting candidate set according to preset word length, preset word property and preset word frequency;
and the first prompting module is used for prompting the target words according to the sequence of the target words in the comprehensive sorting candidate set.
In another example, the input text is segmented into at least two word units, and one of the at least two word units is selected as a target word unit according to the part of speech and/or the word frequency. The prompt module 35 includes:
a fourth obtaining module, configured to obtain candidate words N before the total weight in the comprehensive ranking candidate set;
the merging module is used for merging other word units segmented from the input text with the candidate words N before the total weight to obtain N candidate texts;
the first sequencing module is used for sequencing the candidate texts according to the historical occurrence frequency of the candidate texts; if the historical occurrence frequency of the candidate texts is zero, the first ordering module orders the candidate texts according to the similarity between word vectors corresponding to candidate words in the candidate texts and word vectors corresponding to the target word unit.
And the second prompting module is used for prompting the candidate texts according to the sequence of the candidate texts.
Those skilled in the art will appreciate that the modules in the above-described embodiments of the apparatus may be distributed as described in the apparatus, and may be correspondingly modified and distributed in one or more apparatuses other than the above-described embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device includes a processor and a memory, where the memory is used to store a computer-executable program, and when the computer program is executed by the processor, the processor executes a synonymous text prompting method.
As shown in fig. 4, the electronic device is in the form of a general purpose computing device. The processor can be one or more and can work together. The invention also does not exclude that distributed processing is performed, i.e. the processors may be distributed over different physical devices. The electronic device of the present invention is not limited to a single entity, and may be a sum of a plurality of entity devices.
The memory stores a computer executable program, typically machine readable code. The computer readable program may be executed by the processor to enable an electronic device to perform the method of the invention, or at least some of the steps of the method.
The memory may include volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may also be non-volatile memory, such as read-only memory (ROM).
Optionally, in this embodiment, the electronic device further includes an I/O interface, which is used for data exchange between the electronic device and an external device. The I/O interface may be a local bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, and/or a memory storage device using any of a variety of bus architectures.
It should be understood that the electronic device shown in fig. 4 is only one example of the present invention, and elements or components not shown in the above example may be further included in the electronic device of the present invention. For example, some electronic devices further include a display unit such as a display screen, and some electronic devices further include a human-computer interaction element such as a button, a keyboard, and the like. Electronic devices are considered to be covered by the present invention as long as the electronic devices are capable of executing a computer-readable program in a memory to implement the method of the present invention or at least a part of the steps of the method.
Fig. 5 is a schematic diagram of a computer-readable recording medium of an embodiment of the present invention. As shown in fig. 5, a computer-readable recording medium stores therein a computer-executable program, which when executed, implements the synonymous text presentation method described above according to the present invention. The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
From the above description of the embodiments, those skilled in the art will readily appreciate that the present invention can be implemented by hardware capable of executing a specific computer program, such as the system of the present invention, and electronic processing units, servers, clients, mobile phones, control units, processors, etc. included in the system, and the present invention can also be implemented by a vehicle including at least a part of the above system or components. The invention can also be implemented by computer software executing the method of the invention, for example, by control software executed by a microprocessor, an electronic control unit, a client, a server, etc. of a live device. It should be noted that the computer software for executing the method of the present invention is not limited to be executed by one or a specific hardware entity, but may also be implemented in a distributed manner by hardware entities without specific details, and for the computer software, the software product may be stored in a computer readable storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or may be stored in a distributed manner on a network, as long as it can enable an electronic device to execute the method according to the present invention.
While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims (10)

1. A method for prompting synonymous texts is characterized by comprising the following steps:
dividing an input text into word units;
determining a target word unit from the word units according to the segmentation condition of the input text, wherein the segmentation condition of the input text comprises the following steps: the input text is segmented into only one word unit, and the input text is segmented into at least two word units;
obtaining candidate words corresponding to the target word unit through a preset model to form a candidate set, wherein the candidate words are synonyms or near-synonyms of the target word unit;
sorting the candidate words in the candidate set to obtain a comprehensive sorting candidate set corresponding to the target word unit;
and prompting the synonymous text of the input text according to the segmentation condition of the input text and the comprehensive sequencing candidate set.
2. The method for prompting synonymous text according to claim 1, wherein the obtaining of the candidate word composition candidate set corresponding to the target word unit through a preset model comprises:
acquiring different corpora as a training set to train a plurality of word2vec models;
and obtaining candidate words corresponding to the target word unit through each trained word2vec model to form a candidate set.
3. The method for prompting synonymous text according to claim 1 or 2, wherein the obtaining of the candidate words corresponding to the target word unit by the trained word2vec model to form a candidate set comprises:
inputting the target word unit into a trained word2vec model to obtain a word vector of the target word unit output by the word2vec model;
acquiring candidate word vectors with the similarity between the word vectors and the target word unit smaller than a threshold value;
and converting the candidate word vectors into corresponding candidate words to form a candidate set.
4. The method for hinting synonymous text according to any one of claims 1-3, wherein the ranking the candidate words in the candidate set comprises:
acquiring a candidate set output by each word2vec model;
determining the total weight of the candidate words according to the preset weight of the candidate words and the arrangement positions of the candidate words in each candidate set;
sorting the candidate words in the candidate set according to the total weight of the candidate words; optionally, the input text is segmented into only one word unit, and the only one word unit is used as a target word unit.
5. The method for prompting synonymous text according to any one of claims 1-4, wherein the prompting of the synonymous text of the input text according to the segmentation condition of the input text and the comprehensive ranking candidate set comprises:
filtering out target words meeting conditions from the comprehensive ordering candidate set according to preset word length, preset part of speech and preset word frequency;
and prompting the target words according to the sequence of the target words in the comprehensive ranking candidate set.
6. The method according to claim 4, wherein the input text is segmented into at least two word units, and one of the at least two word units is selected as the target word unit according to part of speech and/or word frequency.
7. The method for prompting synonymous text according to claim 7, wherein the prompting of the synonymous text of the input text according to the segmentation condition of the input text and the comprehensive ranking candidate set comprises:
acquiring candidate words N before the total weight in the comprehensive sorting candidate set;
merging other word units segmented from the input text with the candidate words N before the total weight to obtain N candidate texts;
sorting the candidate texts according to the historical occurrence frequency of the candidate texts;
and optionally prompting the candidate texts according to the ordering of the candidate texts, and if the historical occurrence frequency of the candidate texts is zero, ordering the candidate texts according to the similarity between the word vector corresponding to the candidate word in the candidate texts and the word vector corresponding to the target word unit.
8. A synonymous text presentation device characterized in that the synonymous text refers to a text having the same or similar meaning as an input text, the device comprising:
the segmentation unit is used for segmenting the input text into word units;
a determining module, configured to determine a target word unit from the word units according to a segmentation condition of the input text, where the segmentation condition of the input text includes: the input text is segmented into only one word unit, and the input text is segmented into at least two word units;
the acquisition module is used for acquiring candidate words corresponding to the target word unit through a preset model to form a candidate set, wherein the candidate words are synonyms or near-synonyms of the target word unit;
the sorting module is used for sorting the candidate words in the candidate set to obtain a comprehensive sorting candidate set corresponding to the target word unit;
and the prompt module is used for prompting the synonymous text of the input text according to the segmentation condition of the input text and the comprehensive sorting candidate set.
9. An electronic device comprising a processor and a memory, the memory for storing a computer-executable program, characterized in that:
the computer program, when executed by the processor, performs the method of any of claims 1-9.
10. A computer-readable medium storing a computer-executable program, wherein the computer-executable program, when executed, implements the method of any of claims 1-7.
CN202011539680.6A 2020-12-23 2020-12-23 Synonymous text prompting method and device and electronic equipment Pending CN112559711A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011539680.6A CN112559711A (en) 2020-12-23 2020-12-23 Synonymous text prompting method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011539680.6A CN112559711A (en) 2020-12-23 2020-12-23 Synonymous text prompting method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN112559711A true CN112559711A (en) 2021-03-26

Family

ID=75032297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011539680.6A Pending CN112559711A (en) 2020-12-23 2020-12-23 Synonymous text prompting method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN112559711A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113971216A (en) * 2021-10-22 2022-01-25 北京百度网讯科技有限公司 Data processing method and device, electronic equipment and memory
CN114416213A (en) * 2022-03-29 2022-04-29 北京沃丰时代数据科技有限公司 Word vector file loading method and device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294396A (en) * 2015-05-20 2017-01-04 北京大学 Keyword expansion method and keyword expansion system
CN107357776A (en) * 2017-06-16 2017-11-17 北京奇艺世纪科技有限公司 A kind of related term method for digging and device
CN107451126A (en) * 2017-08-21 2017-12-08 广州多益网络股份有限公司 A kind of near synonym screening technique and system
CN108628821A (en) * 2017-03-21 2018-10-09 腾讯科技(深圳)有限公司 A kind of vocabulary mining method and device
CN109388803A (en) * 2018-10-12 2019-02-26 北京搜狐新动力信息技术有限公司 Chinese word cutting method and system
CN110276010A (en) * 2019-06-24 2019-09-24 腾讯科技(深圳)有限公司 A kind of weight model training method and relevant apparatus
CN111274353A (en) * 2020-01-14 2020-06-12 百度在线网络技术(北京)有限公司 Text word segmentation method, device, equipment and medium
KR20200123544A (en) * 2019-04-22 2020-10-30 넷마블 주식회사 Mehtod for extracting synonyms

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294396A (en) * 2015-05-20 2017-01-04 北京大学 Keyword expansion method and keyword expansion system
CN108628821A (en) * 2017-03-21 2018-10-09 腾讯科技(深圳)有限公司 A kind of vocabulary mining method and device
CN107357776A (en) * 2017-06-16 2017-11-17 北京奇艺世纪科技有限公司 A kind of related term method for digging and device
CN107451126A (en) * 2017-08-21 2017-12-08 广州多益网络股份有限公司 A kind of near synonym screening technique and system
CN109388803A (en) * 2018-10-12 2019-02-26 北京搜狐新动力信息技术有限公司 Chinese word cutting method and system
KR20200123544A (en) * 2019-04-22 2020-10-30 넷마블 주식회사 Mehtod for extracting synonyms
CN110276010A (en) * 2019-06-24 2019-09-24 腾讯科技(深圳)有限公司 A kind of weight model training method and relevant apparatus
CN111274353A (en) * 2020-01-14 2020-06-12 百度在线网络技术(北京)有限公司 Text word segmentation method, device, equipment and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
_KEVIN_DUAN: "Word2Vec训练同义词模型", Retrieved from the Internet <URL:www.bing.com> *
张乐: "词向量语义扩展技术在图书馆智能咨询系统的应用与实现", 图书情报工作, vol. 64, no. 18, 20 September 2020 (2020-09-20), pages 126 - 136 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113971216A (en) * 2021-10-22 2022-01-25 北京百度网讯科技有限公司 Data processing method and device, electronic equipment and memory
CN114416213A (en) * 2022-03-29 2022-04-29 北京沃丰时代数据科技有限公司 Word vector file loading method and device and storage medium
CN114416213B (en) * 2022-03-29 2022-06-28 北京沃丰时代数据科技有限公司 Word vector file loading method and device and storage medium

Similar Documents

Publication Publication Date Title
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
US11775760B2 (en) Man-machine conversation method, electronic device, and computer-readable medium
CN108717406B (en) Text emotion analysis method and device and storage medium
US11693894B2 (en) Conversation oriented machine-user interaction
CN111930792B (en) Labeling method and device for data resources, storage medium and electronic equipment
CN111401033A (en) Event extraction method, event extraction device and electronic equipment
US11531693B2 (en) Information processing apparatus, method and non-transitory computer readable medium
CN110941951B (en) Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
US11461613B2 (en) Method and apparatus for multi-document question answering
US20210042391A1 (en) Generating summary content using supervised sentential extractive summarization
CN111144120A (en) Training sentence acquisition method and device, storage medium and electronic equipment
CN111259262A (en) Information retrieval method, device, equipment and medium
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
Xiong et al. Extended HMM and ranking models for Chinese spelling correction
CN114141384A (en) Method, apparatus and medium for retrieving medical data
CN108268443B (en) Method and device for determining topic point transfer and acquiring reply text
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN110750967B (en) Pronunciation labeling method and device, computer equipment and storage medium
CN116562278B (en) Word similarity detection method and system
CN110929499B (en) Text similarity obtaining method, device, medium and electronic equipment
CN111968624A (en) Data construction method and device, electronic equipment and storage medium
JP6495124B2 (en) Term semantic code determination device, term semantic code determination model learning device, method, and program
CN112507082B (en) Method and device for intelligently identifying improper text interaction and electronic equipment
CN114842982A (en) Knowledge expression method, device and system for medical information system
CN114328860A (en) Interactive consultation method and device based on multi-model matching and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination