CN110413737B - Synonym determination method, synonym determination device, server and readable storage medium - Google Patents

Synonym determination method, synonym determination device, server and readable storage medium Download PDF

Info

Publication number
CN110413737B
CN110413737B CN201910699704.5A CN201910699704A CN110413737B CN 110413737 B CN110413737 B CN 110413737B CN 201910699704 A CN201910699704 A CN 201910699704A CN 110413737 B CN110413737 B CN 110413737B
Authority
CN
China
Prior art keywords
search text
text sequence
attention
synonym
click
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910699704.5A
Other languages
Chinese (zh)
Other versions
CN110413737A (en
Inventor
康战辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910699704.5A priority Critical patent/CN110413737B/en
Publication of CN110413737A publication Critical patent/CN110413737A/en
Application granted granted Critical
Publication of CN110413737B publication Critical patent/CN110413737B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a method, a device, a server and a readable storage medium for determining synonyms, wherein the method can be applied to the machine learning technology in the field of artificial intelligence, and comprises the following steps: acquiring a co-click search text sequence set comprising a plurality of co-click search text sequences, wherein each search text sequence in each co-click search text sequence has a related search result; determining attention distribution probability of each field in each search text sequence based on an attention model; training the synonym discrimination model according to the attention distribution probability of each field to obtain a synonym discrimination model introducing an attention mechanism; inputting the co-click search text sequence to be distinguished into a synonym distinguishing model introduced into an attention mechanism, and determining synonym pairs in the co-click search text sequence to be distinguished. By the embodiment, the accuracy of determining the synonym pair is improved, and the semantic range of the synonym pair is expanded.

Description

Synonym determination method, synonym determination device, server and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for determining synonyms, a server, and a readable storage medium.
Background
At present, the classical synonym dictionary construction technologies comprise at least two technologies, one is that a linguist manually finishes through the explanation of a modern Chinese dictionary, the other is that a computer automatic alignment technology is carried out through co-clicking a search text sequence set by means of a modern search engine, potential candidate synonym pairs are further obtained, and finally, a large amount of synonym dictionaries are formed through manual deletion and selection by means of various statistical language features.
However, when the sequence of the cut fields in the co-click search text sequence is completely aligned, the method can only determine the synonym pair in the co-click search text sequence, which results in the accuracy angle for determining the synonym pair and the disadvantage that the determined synonym pair is not wide enough in semantic range.
Therefore, how to improve the accuracy of determining the synonym pair and expand the semantic range of the synonym pair becomes a problem to be solved urgently.
Disclosure of Invention
The embodiment of the invention provides a synonym determining method, a synonym determining device, a server and a readable storage medium, wherein synonym pairs in a co-click search text sequence are determined based on a synonym discrimination model introducing an attention mechanism, so that synonym pairs corresponding to text sequences with different field sequences can be determined, the accuracy of determining the synonym pairs is improved, and the semantic range of the synonym pairs is enlarged.
In a first aspect, an embodiment of the present invention provides a method for determining synonyms, including:
acquiring a co-click search text sequence set, wherein the co-click search text sequence set comprises a plurality of co-click search text sequences, and each search text sequence in each co-click search text sequence corresponds to a related search result;
determining attention distribution probability of each field contained in each search text sequence based on an attention model;
training a synonym discrimination model according to the attention distribution probability of each field contained in each search text sequence to obtain a synonym discrimination model introducing an attention mechanism;
and inputting the co-click search text sequence to be distinguished into the synonym distinguishing model introducing the attention mechanism so as to determine the synonym pair in the co-click search text sequence to be distinguished.
In a second aspect, an embodiment of the present invention provides a synonym determination device, including:
the acquisition module is used for acquiring a co-click search text sequence set, wherein the co-click search text sequence set comprises a plurality of co-click search text sequences, and each search text sequence in each co-click search text sequence corresponds to a search result with correlation;
the distribution module is used for determining attention distribution probability of each field contained in each search text sequence based on an attention model;
the training module is used for training the synonym discrimination model according to the attention distribution probability of each field contained in each search text sequence to obtain the synonym discrimination model introducing the attention mechanism;
and the determining module is used for inputting the co-click search text sequence to be determined into the synonym determination model introducing the attention mechanism so as to determine the synonym pair in the co-click search text sequence to be determined.
In a third aspect, an embodiment of the present invention further provides a server, including: a processor and a storage device; the storage device is used for storing program instructions; the processor calls the program instructions to perform: acquiring a co-click search text sequence set, wherein the co-click search text sequence set comprises a plurality of co-click search text sequences, and each search text sequence in each co-click search text sequence corresponds to a related search result; determining attention distribution probability of each field contained in each search text sequence based on an attention model; training a synonym discrimination model according to the attention distribution probability of each field contained in each search text sequence to obtain a synonym discrimination model introducing an attention mechanism; and inputting the co-click search text sequence to be distinguished into the synonym distinguishing model introducing the attention mechanism so as to determine the synonym pair in the co-click search text sequence to be distinguished.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where program instructions are stored, and when the program instructions are executed, the computer-readable storage medium is configured to implement the method described in the first aspect.
In the embodiment of the invention, a common click search text sequence set can be obtained, the attention distribution probability of each field contained in each search text sequence in the common click search text sequence set is determined based on an attention model, a synonym discrimination model is trained according to the attention distribution probability of each field contained in each search text sequence to obtain a synonym discrimination model with an attention mechanism, and the common click search text sequence to be discriminated is input into the synonym discrimination model with the attention mechanism to determine a synonym pair in the common click search text sequence to be discriminated. By the embodiment, the accuracy of determining the synonym pair is improved, and the semantic range of the synonym pair is expanded.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic structural diagram of a synonym discrimination model for an attention mechanism according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a framework of a synonym discrimination model for an attention mechanism according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a thermodynamic diagram of probability of attention allocation provided by an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for determining synonyms according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a synonym determination apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention.
Detailed Description
With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. Among them, the determination technique of synonyms is an important technique for machine learning applications.
The synonym determination scheme provided by the embodiment of the invention relates to an artificial intelligence machine learning technology, and can be used for training a synonym discrimination model according to the attention distribution probability of each field contained in each search text sequence in a co-click search text sequence set by obtaining the co-click search text sequence set to obtain a synonym discrimination model introducing an attention mechanism, and inputting the co-click search text sequence to be discriminated into the synonym discrimination model introducing the attention mechanism to determine a synonym pair in the co-click search text sequence to be discriminated, so that the accuracy of determining the synonym pair is improved, and the semantic range of the synonym pair is expanded. The following examples are intended to illustrate in particular:
the technical solutions in the embodiments of the present invention will be described clearly below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
The method for determining synonyms provided in the embodiment of the present invention may be executed by a server, and in particular, may be executed by a synonym determination device in the server.
The synonym pair determined by the current synonym pair determination method has the typical defects that the synonym sense range is small and new words cannot be timely included; however, the method for obtaining the synonym pair based on the co-click search text sequence and the sequence-based complete word alignment technology can only determine the synonym pair under the condition that the sequence of each word in the co-click search text sequence is aligned, so that the synonym pair has the defect of not wide semantic scope.
For example, assuming that two search text sequences of "how to roast chicken wings and" how to roast chicken wings are good "exist in the search engine log, the two search text sequences are cut into words to obtain" how/roast/chicken wings/good "and" how/roast/chicken wings/good ", and the two words of" how "and" how "are finally obtained as synonyms through a sequential complete word alignment technology. However, for a search text sequence with no aligned sequence of the two words, such as "how to do transcription and good taste" and "chaos how to do good taste", after the word is cut, because the sequence of the contexts of the potential synonyms in the two search text sequences is disturbed, the synonym pair cannot be obtained by using a complete word alignment technology based on the sequence.
In addition, when the Chinese word of 'wonton' is translated in the current alignment technology, the contribution of the word in each text sequence to the translation target word 'transcriber' is the same, obviously, the 'wonton' is more important to the translation into the 'transcriber', the problem is not great when the input sentence is shorter in a model without attention, but if the input sentence is longer, all the semantics are completely expressed by an intermediate semantic vector, the information of the word disappears, and therefore a lot of detailed information is lost.
In order to solve the above problems, an embodiment of the present invention provides a method for determining a synonym pair through a synonym discrimination model introducing an attention mechanism, in which a co-click search text sequence set is obtained, the co-click search text sequence set includes a plurality of co-click search text sequences, and each search text sequence in each co-click search text sequence has a corresponding search result; determining the attention distribution probability of each field contained in each search text sequence based on an attention model, training a synonym discrimination model according to the attention distribution probability of each field contained in each search text sequence to obtain a synonym discrimination model introducing the attention mechanism, and inputting the co-click search text sequence to be discriminated into the synonym discrimination model introducing the attention mechanism to determine synonym pairs in the co-click search text sequence to be discriminated. By the implementation mode, the synonym pair can be determined from the search text sequences with completely different field sequences, the semantic range of the synonym is expanded, and the accuracy of determining the synonym pair is improved.
In one embodiment, the synonym discriminant model for attentional mechanism proposed by the present invention may include, but is not limited to, a sequence-to-sequence Seq2Seq model based on attentional mechanism; in certain embodiments, the seq2seq is fully termed Sequence to Sequence.
In some embodiments, the structure of the synonym discrimination model for an attentive mechanism is shown in fig. 1, where fig. 1 is a schematic structural diagram of a synonym discrimination model for an attentive mechanism according to an embodiment of the present invention. As shown in fig. 1, the synonym discriminant model for the Attention-focusing mechanism includes an encoding end 11, a decoding end 12, and an Attention model 13, i.e., attention, and is a common encoder-decoder framework, which can be used in scenes such as machine translation, text summarization, session modeling, image captions, and the like. In one embodiment, a first text sequence is input via the encoding end 11 and a second text sequence corresponding to the text sequence is output via the decoding end 12.
In one embodiment, the encoding end 11 converts a variable-length first text sequence into a fixed-length vector representation, and the decoding end 12 converts the fixed-length vector into a variable-length second text sequence. In some embodiments, it is equivalent to translating a first text sequence into a second text sequence of the same semantic meaning and finding synonym pairs therein. For example, assuming that the first text sequence is "how to do transcription and good taste", and the second text sequence is "chaos how to do good taste", if the encoding terminal 11 changes a variable-length first text sequence "how to do transcription and good taste" into a fixed-length vector expression, and the decoding terminal 12 changes the fixed-length vector into a variable-length second text sequence "chaos how to do good taste", it can be determined that the synonym pair in the first text sequence and the second text sequence is transcription and chaos.
In one embodiment, after a text sequence of "how to make a transcription good for eating" is input into the attention-inducing synonym determination model, the text sequence of "how to make a transcription good for eating" may be split by the attention-inducing synonym determination model to obtain "how to make", "do", "transcription", "good for eating", and determine the attention allocation probability of each field (i.e. word) of "how to make", "do", "transcription", "good for eating", and determine the field of which the attention allocation probability is greater than a preset probability threshold value as a synonym pair.
For example, assuming that the synonym discrimination model with the attention-drawing mechanism determines that the attention distribution probability of each field in the text sequence "how to do a hand, make good taste" is (how, 0.1), (do, 0.1), (hand, 0.7), (good taste, 0.1), and that the attention distribution probability of each field in the output sequence "chaos-do good taste" is (chaos, 0.7), (how, 0.1), (do, 0.1), (good taste, 0.1), and assuming that the preset probability threshold is 0.5, it can be determined that "hand making" and "chaos" are a synonym pair.
In an embodiment, the synonym discrimination model for an attentive mechanism may be illustrated by fig. 2, where fig. 2 is a schematic diagram of a framework of the synonym discrimination model for an attentive mechanism according to an embodiment of the present invention. As shown in fig. 2, by the synonym discrimination model of the attention-attracting mechanism, a text sequence "X1X2X3X4" is input from the encoding end 21, the text sequence "X1X2X3X4" is encoded by the encoding end 21 to obtain a vector "C1C2C3", and the vector "C1C2C3" is decoded by the decoding end 22 to output a text sequence "Y1Y2Y3". And determining the words with the attention distribution probability larger than a preset probability threshold value as synonym pairs through the attention distribution probability of each word in the 'X1X 2X3X 4' and the attention distribution probability in the 'Y1Y 2Y 3'.
In one embodiment, the attention allocation probability may be represented by a thermodynamic diagram as shown in fig. 3, and fig. 3 is a schematic diagram of a thermodynamic diagram of an attention allocation probability provided by an embodiment of the present invention. As shown in fig. 3, the abscissa indicates the fields corresponding to the input text sequence "how to do a transcription and good taste", and the ordinate indicates the fields corresponding to the output text sequence "chaos how to do a transcription and good taste", and the fields are "chaos", "how to do", "good taste", respectively. In the thermodynamic diagram shown in fig. 3, the darker the color, the greater the probability of attention assignment, and as can be seen from fig. 3, the region 31 with the darkest color is a region in which the abscissa "hand-making" corresponds to the ordinate "chaos". Therefore, the synonym pair of 'hand-taking' and 'chaos' can be determined according to the colors in the thermodynamic diagram.
In other embodiments, the synonym discrimination Model introduced in the attention mechanism provided by the present invention may also use other machine translation models, such as IBM Model1, IBM Model2, etc., to complete construction and mining of the synonym dictionary, which is not specifically limited herein.
The method for determining synonyms provided by the embodiment of the invention is schematically described below with reference to the drawings.
Specifically, referring to fig. 4, fig. 4 is a schematic flowchart of a method for determining synonyms according to an embodiment of the present invention, where the method may be executed by a synonym determination device in a server, and a specific explanation of the server is as described above. Specifically, the method of the embodiment of the present invention includes the following steps.
S401: acquiring a co-click search text sequence set, wherein the co-click search text sequence set comprises a plurality of co-click search text sequences, and each search text sequence in each co-click search text sequence has a corresponding search result.
In the embodiment of the invention, a server can obtain a common click search text sequence set, the common click search text sequence set comprises a plurality of common click search text sequences, and each search text sequence in each common click search text sequence has a related search result correspondingly.
For example, it is assumed that the co-click search text sequence set includes 2 co-click search text sequences, which are respectively "a recipe of tomato" and "a recipe of tomato", where a search result corresponding to the "recipe of tomato" is "a tomato scrambled egg", and the "tomato scrambled egg" and "tomato scrambled egg" are associated search results.
In some embodiments, the co-click search text sequence may be a co-click search text sequence in which a large number of associated search results are obtained in a WeChat search, for example, a search result in which a first search text sequence is associated with a second search text sequence is clicked in a WeChat search, for example, a first search text sequence and a second search text sequence in which the associated search results are clicked more than 50 times may be determined to be semantically similar.
In one embodiment, the associated search results may be determined as associated search results by calculating similarity between the search results, and determining the search results with the similarity greater than a similarity threshold.
In one embodiment, when acquiring the co-click search text sequence set, the server may determine the number of times that the different search text sequences have associated search results, and determine the co-click search text sequence set according to the number of times that the different search text sequences have associated search results.
In one embodiment, the different search text sequences include a first search text sequence and a second search text sequence; and when the server determines the common click search text sequence set according to the times of the associated search results corresponding to the different search text sequences, the server can acquire the search result of the first search text sequence and the search result of the second search text sequence. The server may determine the number of times that there is an associated search result in the search result of the first search text sequence and the search result of the second search text sequence, and if the number of times of the associated search result is greater than a preset number threshold, may determine that the first search text sequence and the second search text sequence are the set of co-click search text sequences.
For example, assuming that the first search text sequence is "recipe of tomato", the second search text sequence is "recipe of tomato", the number of times of obtaining the search result of the first search text sequence "recipe of tomato" is "tomato eggs stir-baked" is 20, and the number of times of obtaining the search result of the second search text sequence "recipe of tomato" is "tomato eggs stir-baked" is 25, if the preset number threshold is 18, the number of times of obtaining the search result of the "recipe of tomato" is "tomato eggs stir-baked" is 20, which is greater than the preset number threshold 18, and the number of times of obtaining the search result of the "recipe of tomato" is "tomato eggs stir-baked" is 25, which is greater than the preset number threshold 18, it may be determined that the "recipe of tomato" and "recipe of tomato" are the co-click search text sequence set.
S402: determining attention allocation probabilities for fields contained in the search text sequences based on an attention model.
In this embodiment of the present invention, the server may determine the attention allocation probability of each field included in each search text sequence based on the attention model.
In an embodiment, when the server determines the attention allocation probability of each field included in each search text sequence based on an attention model, the server may split each search text sequence to obtain at least one field corresponding to each search text sequence, determine an initial attention allocation probability of each field in the at least one field, and update the initial attention allocation probability according to the number of times that each field appears in each search text sequence to determine the attention allocation probability of each field.
In an embodiment, when determining the initial attention allocation probability of each field in the at least one field, the server may determine a part of speech of each field in the at least one field, and determine the initial attention allocation probability corresponding to the part of speech of each field according to a preset correspondence between the part of speech and the attention allocation probability.
For example, the search text sequence "recipe of tomatoes" is divided into three fields of "tomato", "of" and "recipe", wherein the part of speech of "tomato" is a noun and the part of speech of "is an adverb, and" recipe "is a noun, and the initial attention allocation probability corresponding to the noun" tomato "is determined to be 0.45, the initial attention allocation probability corresponding to the adverb" is determined to be 0.1, and the initial attention allocation probability corresponding to the noun "recipe" is determined to be 0.45, based on the correspondence between the preset part of speech and the attention allocation probability.
In one embodiment, when determining the initial attention allocation probability of each field in the at least one field, the server may determine the number of fields in each search text sequence, and set an equal initial attention allocation probability for each field according to the number of fields in each search text sequence, where the sum of the initial attention allocation probabilities of each field is 1.
For example, the search text sequence "recipe of tomato" is split into three fields "tomato", "of" and "recipe", and equal initial attention distribution probabilities are set for the three fields as 1/3, and 1/3, respectively.
In one embodiment, when determining the initial attention allocation probability of each field in the at least one field, the server may further set the initial attention allocation probability of each field in each search text sequence to 0.
S403: and training the synonym discrimination model according to the attention distribution probability of each field contained in each search text sequence to obtain the synonym discrimination model introducing the attention mechanism.
In the embodiment of the present invention, the server may train the synonym discrimination model according to the attention distribution probability of each field included in each search text sequence, so as to obtain the synonym discrimination model introducing the attention mechanism.
In an embodiment, when the server trains the synonym discrimination model according to the attention distribution probability of each field included in each search text sequence to obtain the synonym discrimination model introducing the attention mechanism, the server may train the synonym discrimination model according to the attention distribution probability of each field in the co-click search text sequence to obtain the attention distribution probability of each field. If the attention distribution probability of the synonym pair in each field of the co-click search text sequence is not greater than a preset probability threshold, adjusting corresponding parameters in the synonym discrimination model, training the synonym discrimination model according to the attention distribution probability of each field of the co-click search text sequence after adjusting the parameters, and obtaining the synonym discrimination model introducing the attention mechanism when the attention distribution probability of the synonym in each field is greater than the preset probability threshold.
S404: and inputting the co-click search text sequence to be distinguished into the synonym distinguishing model introducing the attention mechanism so as to determine the synonym pair in the co-click search text sequence to be distinguished.
In the embodiment of the invention, the server can input the co-click search text sequence to be distinguished into the synonym distinguishing model introducing the attention mechanism so as to determine the synonym pair in the co-click search text sequence to be distinguished.
In an embodiment, the server inputs the co-click search text sequence to be determined into the synonym determination model for introducing the attention mechanism to determine a synonym pair in the co-click search text sequence to be determined, the co-click search text sequence to be determined may be input into the synonym determination model for introducing the attention mechanism to obtain the attention distribution probability corresponding to each field in the co-click search text sequence to be determined, and determine a field, whose attention distribution probability is greater than a preset probability threshold, corresponding to each field as a synonym pair.
For example, assuming that the co-click search text sequence to be discriminated is "how to do to make a transcription and eat well", the synonym discrimination model introducing the attention mechanism obtains (how, 0.1), (do, 0.1), (transcription, 0.7) and (good eat, 0.1) of the attention distribution probability of each field in the text sequence "how to make a transcription and eat well", and obtains (chaos, 0.7), (how, 0.1), (do, 0.1) and (good eat, 0.1) of the attention distribution probability of each field in the output sequence "chaos how to make a good eat", and assuming that the preset probability threshold is 0.5, it can be determined that "transcription" and "chaos" are a synonym pair.
Therefore, the method can determine the synonym pairs in the search text sequences to be distinguished in different field sequences, expand the range of the synonyms and improve the accuracy of determining the synonyms.
In the embodiment of the invention, a server can obtain a co-click search text sequence set, determine the attention distribution probability of each field contained in each search text sequence in the co-click search text sequence set based on an attention model, train a synonym discrimination model according to the attention distribution probability of each field contained in each search text sequence to obtain a synonym discrimination model introducing an attention mechanism, and input the co-click search text sequence to be discriminated into the synonym discrimination model introducing the attention mechanism to determine a synonym pair in the co-click search text sequence to be discriminated. By the embodiment, the accuracy of determining the synonym pair is improved, and the semantic range of the synonym pair is expanded.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a synonym determination apparatus according to an embodiment of the present invention. Specifically, the apparatus comprises: an acquisition module 501, a distribution module 502, a training module 503 and a determination module 504;
an obtaining module 501, configured to obtain a co-click search text sequence set, where the co-click search text sequence set includes multiple co-click search text sequences, and each search text sequence in each co-click search text sequence has a corresponding search result;
an assigning module 502, configured to determine, based on an attention model, attention assignment probabilities of fields included in the search text sequences;
a training module 503, configured to train a synonym discrimination model according to the attention allocation probability of each field included in each search text sequence, to obtain a synonym discrimination model introducing an attention mechanism;
a determining module 504, configured to input the co-click search text sequence to be determined into the synonym determination model of the attention introducing mechanism, so as to determine a synonym pair in the co-click search text sequence to be determined.
Further, when the obtaining module 501 obtains the co-click search text sequence set, the obtaining module is specifically configured to:
determining the times of the search results with correlation corresponding to different search text sequences;
and determining the co-click search text sequence set according to the times of the associated search results corresponding to the different search text sequences.
Further, the different search text sequences include a first search text sequence and a second search text sequence; the obtaining module 501 is specifically configured to, when determining the set of co-click search text sequences according to the number of times that the different search text sequences correspond to the search result having the associated relationship, determine that:
obtaining a search result of the first search text sequence;
obtaining a search result of the second search text sequence;
determining the number of times that the search result of the first search text sequence and the search result of the second search text sequence have an associated search result;
and if the times of the associated search results are greater than a preset time threshold, determining that the first search text sequence and the second search text sequence are the common click search text sequence set.
Further, when the assignment module 502 determines the attention assignment probability of each field included in each search text sequence based on the attention model, it is specifically configured to:
splitting each search text sequence to obtain at least one field corresponding to each search text sequence;
determining an initial attention allocation probability for each of the at least one field;
and updating the initial attention distribution probability according to the times of the fields appearing in the search text sequences so as to determine the attention distribution probability of the fields.
Further, when the assignment module 502 determines the initial attention assignment probability of each field of the at least one field, it is specifically configured to:
determining the part of speech of each field in the at least one field;
and determining the initial attention distribution probability corresponding to the part of speech of each field according to the corresponding relation between the preset part of speech and the attention distribution probability.
Further, when the allocating module 502 determines the initial attention allocation probability of each field of the at least one field, it is specifically configured to:
determining the number of fields in each search text sequence;
and setting equal initial attention distribution probability for each field according to the number of the fields in each search text sequence, wherein the sum of the initial attention distribution probabilities of each field is 1.
Further, the determining module 504 inputs the co-click search text sequence to be determined into the synonym determination model introducing the attention mechanism, so as to determine the synonym pair in the co-click search text sequence to be determined, which is specifically used for:
inputting the co-click search text sequence to be judged into the synonym judgment model introducing the attention mechanism to obtain the attention distribution probability corresponding to each field in the co-click search text sequence to be judged;
and determining the fields with the attention distribution probability larger than a preset probability threshold value corresponding to the fields as synonym pairs.
The method and the device can obtain a co-click search text sequence set, determine the attention distribution probability of each field contained in each search text sequence in the co-click search text sequence set based on an attention model, train a synonym discrimination model according to the attention distribution probability of each field contained in each search text sequence to obtain a synonym discrimination model introducing an attention mechanism, and input the co-click search text sequence to be discriminated into the synonym discrimination model introducing the attention mechanism to determine a synonym pair in the co-click search text sequence to be discriminated. By the embodiment, the accuracy of determining the synonym pair is improved, and the semantic range of the synonym pair is expanded.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a server according to an embodiment of the present invention. Specifically, the server includes: memory 601, processor 602.
In one embodiment, the server further comprises a data interface 603, the data interface 603 being configured to communicate data information between the synonym determination device and other devices.
The memory 601 may include a volatile memory (volatile memory); the memory 601 may also include a non-volatile memory (non-volatile memory); the memory 601 may also comprise a combination of the above kinds of memories. The processor 602 may be a Central Processing Unit (CPU). The processor 602 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), or any combination thereof.
The memory 601 is used for storing program instructions, and the processor 602 can call the program instructions stored in the memory 601 for executing the following steps:
acquiring a common click search text sequence set, wherein the common click search text sequence set comprises a plurality of common click search text sequences, and each search text sequence in each common click search text sequence has a related search result correspondingly;
determining attention distribution probability of each field contained in each search text sequence based on an attention model;
training a synonym discrimination model according to the attention distribution probability of each field contained in each search text sequence to obtain a synonym discrimination model introducing an attention mechanism;
and inputting the co-click search text sequence to be distinguished into the synonym distinguishing model introducing the attention mechanism so as to determine the synonym pair in the co-click search text sequence to be distinguished.
Further, when the processor 602 acquires the co-click search text sequence set, it is specifically configured to:
determining the times of corresponding associated search results of different search text sequences;
and determining the co-click search text sequence set according to the times of the associated search results corresponding to the different search text sequences.
Further, the different search text sequences include a first search text sequence and a second search text sequence; the processor 602, when determining the co-click search text sequence set according to the number of times that the different search text sequences have associated search results, is specifically configured to:
obtaining a search result of the first search text sequence;
obtaining a search result of the second search text sequence;
determining the number of times that the search result of the first search text sequence and the search result of the second search text sequence have an associated search result;
and if the times of the associated search results are greater than a preset time threshold, determining that the first search text sequence and the second search text sequence are the co-click search text sequence set.
Further, when the processor 602 determines the attention allocation probability of each field included in each search text sequence based on the attention model, it is specifically configured to:
splitting each search text sequence to obtain at least one field corresponding to each search text sequence;
determining an initial attention allocation probability for each of the at least one field;
and updating the initial attention distribution probability according to the times of the fields appearing in the search text sequences so as to determine the attention distribution probability of the fields.
Further, when the processor 602 determines the initial attention allocation probability of each field of the at least one field, it is specifically configured to:
determining the part of speech of each field in the at least one field;
and determining the initial attention distribution probability corresponding to the part of speech of each field according to the corresponding relation between the preset part of speech and the attention distribution probability.
Further, when the processor 602 determines the initial attention allocation probability of each field of the at least one field, it is specifically configured to:
determining the number of fields in each search text sequence;
and setting equal initial attention distribution probability for each field according to the number of the fields in each search text sequence, wherein the sum of the initial attention distribution probabilities of each field is 1.
Further, the processor 602 inputs the co-click search text sequence to be determined into the synonym determination model of the attention mechanism, so as to determine a synonym pair in the co-click search text sequence to be determined, which is specifically used for:
inputting the co-click search text sequence to be judged into the synonym judgment model introducing the attention mechanism to obtain the attention distribution probability corresponding to each field in the co-click search text sequence to be judged;
and determining the fields with the attention distribution probability larger than a preset probability threshold value corresponding to the fields as synonym pairs.
In the embodiment of the invention, a server can obtain a co-click search text sequence set, determine the attention distribution probability of each field contained in each search text sequence in the co-click search text sequence set based on an attention model, train a synonym discrimination model according to the attention distribution probability of each field contained in each search text sequence to obtain a synonym discrimination model introducing an attention mechanism, and input the co-click search text sequence to be discriminated into the synonym discrimination model introducing the attention mechanism to determine a synonym pair in the co-click search text sequence to be discriminated. By the embodiment, the accuracy of determining the synonym pair is improved, and the semantic range of the synonym pair is expanded.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the method described in the embodiment corresponding to fig. 4 of the present invention, and may also implement the apparatus described in the embodiment corresponding to fig. 5 of the present invention, which is not described herein again.
The computer readable storage medium may be an internal storage unit of the device according to any of the foregoing embodiments, for example, a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may also include both an internal storage unit and an external storage device of the device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the terminal. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the invention has been described with reference to a number of embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A method for determining synonyms, comprising:
acquiring a co-click search text sequence set, wherein the co-click search text sequence set comprises a plurality of co-click search text sequences, and each search text sequence in each co-click search text sequence corresponds to a related search result;
determining attention distribution probability of each field contained in each search text sequence based on an attention model;
training a synonym discrimination model according to the attention distribution probability of each field contained in each search text sequence to obtain a synonym discrimination model introducing an attention mechanism;
inputting the co-click search text sequence to be distinguished into the synonym distinguishing model introducing the attention mechanism to determine the synonym pair in the co-click search text sequence to be distinguished.
2. The method of claim 1, wherein obtaining the set of co-click search text sequences comprises:
determining the times of corresponding associated search results of different search text sequences;
and determining the co-click search text sequence set according to the times of the associated search results corresponding to the different search text sequences.
3. The method of claim 2, wherein the different search text sequences comprise a first search text sequence and a second search text sequence; determining the co-click search text sequence set according to the times of the associated search results corresponding to the different search text sequences, including:
obtaining a search result of the first search text sequence;
obtaining a search result of the second search text sequence;
determining the number of times that the search result of the first search text sequence and the search result of the second search text sequence have an associated search result;
and if the times of the associated search results are greater than a preset time threshold, determining that the first search text sequence and the second search text sequence are the co-click search text sequence set.
4. The method of claim 1, wherein said determining an attention allocation probability for each field included in each search text sequence based on an attention model comprises:
splitting each search text sequence to obtain at least one field corresponding to each search text sequence;
determining an initial attention allocation probability for each of the at least one field;
and updating the initial attention distribution probability according to the times of the fields appearing in the search text sequences so as to determine the attention distribution probability of the fields.
5. The method of claim 4, wherein determining an initial attention allocation probability for each of the at least one field comprises:
determining the part of speech of each field in the at least one field;
and determining the initial attention distribution probability corresponding to the part of speech of each field according to the corresponding relation between the preset part of speech and the attention distribution probability.
6. The method of claim 4, wherein determining an initial attention allocation probability for each of the at least one field comprises:
determining the number of fields in each search text sequence;
and setting equal initial attention distribution probability for each field according to the number of the fields in each search text sequence, wherein the sum of the initial attention distribution probabilities of each field is 1.
7. The method according to claim 1, wherein the inputting the co-click search text sequence to be distinguished into the synonym distinguishing model of the attention-introducing mechanism to determine the synonym pair in the co-click search text sequence to be distinguished comprises:
inputting the co-click search text sequence to be distinguished into the synonym distinguishing model introducing the attention mechanism to obtain the attention distribution probability corresponding to each field in the co-click search text sequence to be distinguished;
and determining the fields with the attention distribution probability larger than a preset probability threshold value corresponding to each field as synonym pairs.
8. An apparatus for determining synonyms, the apparatus comprising:
the acquisition module is used for acquiring a common click search text sequence set, wherein the common click search text sequence set comprises a plurality of common click search text sequences, and each search text sequence in each common click search text sequence has a corresponding search result;
the distribution module is used for determining the attention distribution probability of each field contained in each search text sequence based on an attention model;
the training module is used for training the synonym discrimination model according to the attention distribution probability of each field contained in each search text sequence to obtain the synonym discrimination model introducing an attention mechanism;
and the determining module is used for inputting the co-click search text sequence to be determined into the synonym determination model introducing the attention mechanism so as to determine the synonym pair in the co-click search text sequence to be determined.
9. A server, characterized in that it comprises a processor and a storage device, said processor and storage device being connected to each other, wherein said storage device is used to store a computer program comprising program instructions, said processor being configured to invoke said program instructions to execute the method according to any of claims 1-7.
10. A computer-readable storage medium, having stored thereon program instructions for implementing the method of any one of claims 1-7 when executed.
CN201910699704.5A 2019-07-29 2019-07-29 Synonym determination method, synonym determination device, server and readable storage medium Active CN110413737B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910699704.5A CN110413737B (en) 2019-07-29 2019-07-29 Synonym determination method, synonym determination device, server and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910699704.5A CN110413737B (en) 2019-07-29 2019-07-29 Synonym determination method, synonym determination device, server and readable storage medium

Publications (2)

Publication Number Publication Date
CN110413737A CN110413737A (en) 2019-11-05
CN110413737B true CN110413737B (en) 2022-10-14

Family

ID=68364503

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910699704.5A Active CN110413737B (en) 2019-07-29 2019-07-29 Synonym determination method, synonym determination device, server and readable storage medium

Country Status (1)

Country Link
CN (1) CN110413737B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046966B (en) * 2019-12-18 2022-04-05 江南大学 Image subtitle generating method based on measurement attention mechanism
CN111881255B (en) * 2020-06-24 2023-10-27 百度在线网络技术(北京)有限公司 Synonymous text acquisition method and device, electronic equipment and storage medium

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226532A (en) * 2007-12-28 2008-07-23 腾讯科技(北京)有限公司 Method and system for extracting homoionym in network
CN102750282A (en) * 2011-04-19 2012-10-24 北京百度网讯科技有限公司 Synonym template mining method and device as well as synonym mining method and device
CN102760127A (en) * 2011-04-26 2012-10-31 北京百度网讯科技有限公司 Method, device and equipment for determining resource type based on extended text information
CN105095203A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Methods for determining and searching synonym, and server
CN105279252A (en) * 2015-10-12 2016-01-27 广州神马移动信息科技有限公司 Related word mining method, search method and search system
CN105426379A (en) * 2014-10-22 2016-03-23 武汉理工大学 Keyword weight calculation method based on position of word
CN106202382A (en) * 2016-07-08 2016-12-07 南京缘长信息科技有限公司 Link instance method and system
CN107491518A (en) * 2017-08-15 2017-12-19 北京百度网讯科技有限公司 Method and apparatus, server, storage medium are recalled in one kind search
CN107679030A (en) * 2017-09-04 2018-02-09 北京京东尚科信息技术有限公司 Method and apparatus based on user's operation behavior data extraction synonym
CN108052520A (en) * 2017-11-01 2018-05-18 平安科技(深圳)有限公司 Conjunctive word analysis method, electronic device and storage medium based on topic model
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information
CN109241294A (en) * 2018-08-29 2019-01-18 国信优易数据有限公司 A kind of entity link method and device
CN109902154A (en) * 2018-11-30 2019-06-18 华为技术有限公司 Information processing method, device, service equipment and computer readable storage medium
CN109902144A (en) * 2019-01-11 2019-06-18 杭州电子科技大学 A kind of entity alignment schemes based on improvement WMD algorithm
CN109918661A (en) * 2019-03-04 2019-06-21 腾讯科技(深圳)有限公司 Synonym acquisition methods and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160004766A1 (en) * 2006-10-10 2016-01-07 Abbyy Infopoisk Llc Search technology using synonims and paraphrasing
US9998481B2 (en) * 2015-09-16 2018-06-12 Mastercard International Incorporated Systems and methods for use in scoring entities in connection with preparedness of the entities for cyber-attacks

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226532A (en) * 2007-12-28 2008-07-23 腾讯科技(北京)有限公司 Method and system for extracting homoionym in network
CN102750282A (en) * 2011-04-19 2012-10-24 北京百度网讯科技有限公司 Synonym template mining method and device as well as synonym mining method and device
CN102760127A (en) * 2011-04-26 2012-10-31 北京百度网讯科技有限公司 Method, device and equipment for determining resource type based on extended text information
CN105095203A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Methods for determining and searching synonym, and server
CN105426379A (en) * 2014-10-22 2016-03-23 武汉理工大学 Keyword weight calculation method based on position of word
CN105279252A (en) * 2015-10-12 2016-01-27 广州神马移动信息科技有限公司 Related word mining method, search method and search system
CN106202382A (en) * 2016-07-08 2016-12-07 南京缘长信息科技有限公司 Link instance method and system
CN107491518A (en) * 2017-08-15 2017-12-19 北京百度网讯科技有限公司 Method and apparatus, server, storage medium are recalled in one kind search
CN107679030A (en) * 2017-09-04 2018-02-09 北京京东尚科信息技术有限公司 Method and apparatus based on user's operation behavior data extraction synonym
CN108509474A (en) * 2017-09-15 2018-09-07 腾讯科技(深圳)有限公司 Search for the synonym extended method and device of information
CN108052520A (en) * 2017-11-01 2018-05-18 平安科技(深圳)有限公司 Conjunctive word analysis method, electronic device and storage medium based on topic model
CN109241294A (en) * 2018-08-29 2019-01-18 国信优易数据有限公司 A kind of entity link method and device
CN109902154A (en) * 2018-11-30 2019-06-18 华为技术有限公司 Information processing method, device, service equipment and computer readable storage medium
CN109902144A (en) * 2019-01-11 2019-06-18 杭州电子科技大学 A kind of entity alignment schemes based on improvement WMD algorithm
CN109918661A (en) * 2019-03-04 2019-06-21 腾讯科技(深圳)有限公司 Synonym acquisition methods and device

Also Published As

Publication number Publication date
CN110413737A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN110427461B (en) Intelligent question and answer information processing method, electronic equipment and computer readable storage medium
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN110096698B (en) Topic-considered machine reading understanding model generation method and system
CN112131366A (en) Method, device and storage medium for training text classification model and text classification
CN111046679B (en) Quality information acquisition method and device of translation model and computer equipment
Udagawa et al. A natural language corpus of common grounding under continuous and partially-observable context
CN114757176B (en) Method for acquiring target intention recognition model and intention recognition method
CN112131876A (en) Method and system for determining standard problem based on similarity
CN108509539B (en) Information processing method and electronic device
CN112131881A (en) Information extraction method and device, electronic equipment and storage medium
CN110413737B (en) Synonym determination method, synonym determination device, server and readable storage medium
CN108763211A (en) The automaticabstracting and system of knowledge are contained in fusion
CN116561538A (en) Question-answer scoring method, question-answer scoring device, electronic equipment and storage medium
CN112101042A (en) Text emotion recognition method and device, terminal device and storage medium
CN110633456B (en) Language identification method, language identification device, server and storage medium
CN113761220A (en) Information acquisition method, device, equipment and storage medium
CN110929532B (en) Data processing method, device, equipment and storage medium
Kennington et al. Situated incremental natural language understanding using Markov Logic Networks
CN113849623A (en) Text visual question answering method and device
CN111368531B (en) Translation text processing method and device, computer equipment and storage medium
CN114372454A (en) Text information extraction method, model training method, device and storage medium
Gregg Perceptual structures and semantic relations
CN112307754A (en) Statement acquisition method and device
CN114241279A (en) Image-text combined error correction method and device, storage medium and computer equipment
CN113407683A (en) Text information processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant