WO2020074022A1

WO2020074022A1 - Synonym search method and device

Info

Publication number: WO2020074022A1
Application number: PCT/CN2019/124513
Authority: WO
Inventors: 赵荣生; 宋再伟; 刘爽; 马悦; 周旻
Original assignee: 北京大学第三医院; 北京诺道认知医学科技有限公司
Priority date: 2018-10-11
Filing date: 2019-12-11
Publication date: 2020-04-16
Also published as: CN109543175A; CN109543175B

Abstract

A synonym search method and device, said method comprising: inputting into an optimized word vector matrix a component word to be matched, said optimized word vector matrix being obtained using pre-set models, said pre-set models including a Word2vec model used for obtaining the component word and a Skip-Gram model trained using said component word as a training sample, said component word to be matched being a component word in a pre-set word library (S101); obtaining from the optimized word vector matrix a target word vector corresponding to the component word to be matched; calculating separately the cosine distances between the target word vector and other vectors in the optimized word vector matrix (S102); obtaining n synonyms of the component word to be matched according to all cosine distances and the pre-set word library (S103). Said device implements the method, and enhances the accuracy of synonym search.

Description

Method and device for searching synonyms

Cross-reference of related applications

This application requires the priority of a Chinese patent application filed on October 11, 2018 with the application number 2018111816859 and the invention titled "a method and device for finding synonyms", which is fully incorporated by reference into this disclosure.

Technical field

Embodiments of the present disclosure relate to the field of word processing technology, and in particular, to a method and device for searching synonyms.

Background technique

Synonym search is an important research topic. Existing synonyms search methods analyze the number of occurrences of each word in the current text and the number of occurrences in the entire text collection, and then use these word frequency information to model the text as a vector, and then use one-hot-encoding encoding algorithm Or tf-idf and other algorithms, and use the cosine similarity between vectors, jaccard similarity and other methods to calculate the similarity between words, that is, the existing technology is based on the similarity method of word frequency information to find synonyms.

However, when studying the semantics of words, it is actually necessary to figure out how to use a certain word when people describe objective things and express their own ideas: where to use, when to use, and which words to use . In other words, if people want to communicate meaningfully, when discussing or describing something, in addition to the thing itself, a certain context must be added, through the interaction of things and other elements in the context, to Express pre-set semantics. However, in the prior art, synonym search is performed only by word frequency, and the accuracy of the found synonym is not high.

Therefore, how to avoid the above defects and improve the accuracy of synonym searching has become an urgent problem to be solved.

Summary of the invention

In response to the problems in the prior art, embodiments of the present disclosure provide a method and device for searching synonyms.

In a first aspect, an embodiment of the present disclosure provides a method for finding synonyms, the method includes:

Input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, and performs The trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset lexicon;

Obtaining the target word vector corresponding to the word segment to be searched in the optimized word vector matrix; and calculating the cosine distance of the target word vector and other vectors in the optimized word vector matrix separately;

According to all cosine distances and the preset lexicon, obtain n synonyms of the word segmentation to be searched.

In a second aspect, an embodiment of the present disclosure provides a device for finding synonyms, the device includes:

The input unit is used to input the word segmentation to be searched to the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring a word segmentation and the word segmentation as a Training samples, and the SKIP-GRAM model for training; the word segmentation to be searched is a word segmentation in a preset vocabulary;

A calculation unit, configured to obtain a target word vector corresponding to the word segmentation to be found in the optimized word vector matrix; and calculate the cosine distance of the target word vector and other vectors in the optimized word vector matrix, respectively;

The searching unit is configured to obtain n synonyms of the word segmentation to be searched based on all cosine distances and the preset lexicon.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: a processor, a memory, and a bus, wherein,

The processor and the memory complete communication with each other through the bus;

The memory stores program instructions executable by the processor, and the processor can execute the following methods by calling the program instructions:

Obtaining the target word vector corresponding to the word segmentation to be searched in the optimized word vector matrix; and calculating the cosine distance of the target word vector and other vectors in the optimized word vector matrix separately;

According to a fourth aspect, an embodiment of the present disclosure provides a non-transitory computer-readable storage medium, including:

The non-transitory computer-readable storage medium stores computer instructions that cause the computer to perform the following methods:

The method and device for searching synonyms provided by the embodiments of the present disclosure obtain the optimized word vector matrix through the Word2vec model and the SKIP-GRAM model, and calculate the cosine distance between the target vector and other vectors of the word segmentation to be searched for in the optimized word vector matrix. All cosine distances are combined with the preset thesaurus to remove some unrelated participles, so as to obtain n synonyms, which can improve the accuracy of the search for synonyms.

BRIEF DESCRIPTION

In order to more clearly explain the embodiments of the present disclosure or the technical solutions in the prior art, the following will briefly introduce the drawings used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are some embodiments of the present disclosure. For those of ordinary skill in the art, without paying any creative work, other drawings can also be obtained based on these drawings.

FIG. 1 is a schematic flowchart of a method for finding synonyms in an embodiment of the present disclosure;

FIG. 2 is a screenshot of sliding window word extraction according to an embodiment of the present disclosure;

FIG. 3 is a graph of word segmentation search results according to an embodiment of the present disclosure;

4 is a schematic structural diagram of a device for searching synonyms according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present disclosure.

detailed description

To make the objectives, technical solutions, and advantages of the embodiments of the present disclosure more clear, the technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments It is a part of the embodiments of the present disclosure, but not all the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the protection scope of the present disclosure.

FIG. 1 is a schematic flowchart of a method for searching for synonyms in an embodiment of the present disclosure. As shown in FIG. 1, a method for searching for synonyms in an embodiment of the present disclosure includes the following steps:

S101: Input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, And the trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset word library.

Specifically, the device inputs the word segmentation to be searched to the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as training A sample and a trained SKIP-GRAM model; the word segmentation to be searched for is a word segmentation in a preset word library. The preset thesaurus may be a medical thesaurus containing medical professional words. The obtaining of the optimized word vector matrix may include: segmenting the corpus. Further, the jieba library may be used to segment the corpus. The corpus includes not limited to preset words The word segmentation in the library; obtaining the target word segment included in the preset word library in the obtained word segmentation; merging the target word segmentation according to the preset word library to obtain the merged word; wherein, the preset The thesaurus includes the correspondence between preset merged words and preset word segmentation; an initial word vector matrix is constructed according to the merged word and the remaining merged word segmentation; wherein the initial word vector matrix is an N × M matrix, where, N is the total number of participles, M is the vector dimension corresponding to each participle, and the total number of participles is the sum of the merged words and the remaining merged participles. Obtain training samples; use the SKIP-GRAM model to train the training samples to obtain an optimized word vector matrix based on the initial word vector matrix. The explanations are as follows: Example sentences: Objective To study the adverse effects of high-dose methotrexate (hd-mtx, 5g / m2) plus calcium tetrahydrofolate (cf) and rescue plan for treatment of childhood acute lymphoblastic leukemia (all). Word segmentation results:

['Purpose', 'Research', 'High dose', 'Methotrexate', '(', 'hd', '-', 'mtx', '5g', '/', 'm2', ' ) ',' Add ',' Tetrahydro ',' Folic Acid ',' Calcium ',' (',' cf ',') ',' Rescue ',' Scheme ',' Treatment ',' Children ',' Acute ',' Lymphocyte ',' leukemia ',' (',' all ',') ',' '', 'adverse reaction'].

The preset thesaurus contains the corresponding relationship between 'tetrahydro', 'folate', 'calcium' and 'calcium tetrahydrofolate', then the target participles are 'tetrahydro', 'folate', 'calcium', get merged The words 'calcium tetrahydrofolate', 'hd', '-', 'mtx' are not repeated here. Then use the following content (including merged words and unmerged residual participles) to build an initial word vector matrix. Examples are as follows:

['Objective', 'Research', 'Big Dose', 'Methotrexate', '(', 'hd-mtx', '5g', '/', 'm2', ')', 'Plus' , 'Calcium tetrahydrofolate', '(', 'cf', ')', 'Rescue', 'Scheme', 'Treatment', 'Children', 'Acute', 'Lymphocyte', 'Leukemia', ' (',' all ',') ',' '', 'adverse reaction']. The vector dimension can be set independently according to the actual situation, optional 128, the vector element can be a random number between [-1,1]. FIG. 2 is a screenshot of a sliding window for taking words according to an embodiment of the present disclosure. The window width is 2, and the sliding window process is shown in FIG. 2.

The training process is a mature technology in the field: the word segmentation in the context can be defined as a positive sample. Assuming that 64 negative samples are defined, the principle of negative sample selection is: randomly select 64 from the remaining word segments that do not include the context word segmentation as negative samples. When optimizing the loss function, the principle to be followed is to make the probability of positive samples appear higher and higher, and the probability of negative samples appear lower and lower, thereby reducing the amount of calculation and speeding up model training. Through sliding window traversal of all word segmentation, through the SKIP-GRAM model, train the optimized word vector, and get the final optimized word vector matrix.

It should be noted that: Based on the principle of the SKIP-GRAM model, the prediction results take into account the probability of context word segmentation, thereby improving the accuracy of finding synonyms. The Word2vec model obtains the word segmentation, and then gets the word segmentation vector, and then trains the word segmentation vector.

S102: Obtain a target word vector corresponding to the word segmentation to be found in the optimized word vector matrix; and calculate the cosine distance of the target word vector and other vectors in the optimized word vector matrix, respectively.

Specifically, the device obtains the target word vector corresponding to the word segmentation to be searched in the optimized word vector matrix; and calculates the cosine distance of the target word vector and other vectors in the optimized word vector matrix, respectively. An example is as follows: For example, if the participle to be searched is a cell, if the participle "cell" corresponds to the tenth row word segmentation cell of the optimized word vector matrix, then the 128-dimensional word vector corresponding to the tenth row segmentation cell is the target word vector, assuming the optimized word If the vector matrix has N rows, N-1 cosine distances between the target word vector and other N-1 row vectors are calculated respectively. The specific cosine distance calculation method is a mature technology in the art and will not be repeated here.

S103: Acquire n synonyms of the word segmentation to be searched based on all cosine distances and the preset lexicon.

Specifically, the device obtains the n synonyms of the word segmentation to be searched based on all cosine distances and the preset lexicon. Specifically, it may include: sorting the other vectors corresponding to all cosine distances in the order of small to large values of all cosine distances; obtaining the word segmentation corresponding to the first vector in the sorting, and determining whether the word segmentation corresponding to the first vector is in In the preset thesaurus; if it is determined to be yes, the word segment corresponding to the first vector is used as a synonym, and then it is determined whether the word segment corresponding to the second vector is in the preset thesaurus, and repeated execution Until n synonyms are obtained. If it is determined to be not, the word segmentation corresponding to the first vector is eliminated; then it is determined whether the word segmentation corresponding to the second vector is in the preset lexicon and repeated execution until n synonyms are obtained.

FIG. 3 is a graph of the result of word segmentation search according to an embodiment of the present disclosure. Referring to the above example, the order is vector A ..., the value of n can be set independently, and the value can be selected as 5, to determine whether the word segment corresponding to vector A is in the preset In, then the participle corresponding to vector A is used as a synonym for cell, such as lymphocyte in FIG. 3, where n is 1, and then determine whether the participle corresponding to vector B is in the preset lexicon, if it is, then the corresponding part of vector B The word segmentation is a synonym for cell, such as stem cell in Figure 3, where n is 2, and then determine whether the vector C is in the preset thesaurus. If it is not, that is, it does not belong to a medical professional word, the word segment corresponding to the vector C cannot be used as a cell A synonym (not shown in Figure 3), at this time n is still 2, repeat the above steps until 5 synonyms are found, from Figure 3 can also be seen overlapping cells-lymphocytes, tumor-osteosarcoma, leukemia -Lymphoma, that is, the closer the point corresponding to the word segmentation in Figure 3, the closer the meaning of the word.

It should be noted that the preset model used in the embodiment of the present disclosure can accurately search for synonyms through fewer vector dimensions, such as 128 dimensions, compared to the model used in the prior art, to accurately find the required vector The number of dimensions has been greatly reduced. Therefore, the method of the embodiments of the present disclosure also has the technical effect of saving computing resources and improving computing efficiency.

After this step, the method may further include: reducing the vector dimensions corresponding to the n synonyms to two dimensions, and displaying the n synonyms in a plane. Vector dimensionality reduction can be performed through PCA. Referring to FIG. 3, the degree of synonym between word segments can be seen more intuitively.

The method for finding synonyms provided by an embodiment of the present disclosure obtains the optimized word vector matrix through the Word2vec model and the SKIP-GRAM model, and calculates the cosine distance between the target vector of the word segmentation to be found in the optimized word vector matrix and other vectors, based on all cosines Distance, combined with the preset thesaurus to eliminate part of the unrelated word segmentation, so as to obtain n synonyms, which can improve the accuracy of the search for synonyms.

Based on the above embodiments, the obtaining n synonyms of the word segmentation to be searched for based on all cosine distances and the preset lexicon includes:

The other vectors corresponding to all cosine distances are sorted in order of the values of all cosine distances from small to large.

Specifically, the device sorts the other vectors corresponding to all cosine distances in the order of the values of all cosine distances from small to large. Reference may be made to the above embodiment, and no further description will be given.

The word segmentation corresponding to the first vector in the sorting is acquired, and it is determined whether the word segmentation corresponding to the first vector is in the preset thesaurus.

Specifically, the device obtains the word segmentation corresponding to the first vector in the sorting, and determines whether the word segmentation corresponding to the first vector is in the preset word library. Reference may be made to the above embodiment, and no further description will be given.

If it is determined to be true, the word segment corresponding to the first vector is used as a synonym, and then it is determined whether the word segment corresponding to the second vector is in the preset thesaurus, and repeated execution until n synonyms are acquired.

Specifically, if the device determines that it is yes, the word segment corresponding to the first vector is used as a synonym, and then determines whether the word segment corresponding to the second vector is in the preset lexicon, and is repeatedly executed until it is obtained n synonyms. Reference may be made to the above embodiment, and no further description will be given.

The method for searching synonyms provided by the embodiments of the present disclosure can further improve the accuracy of searching for synonyms.

Based on the above embodiment, the method further includes:

If it is determined to be not, the word segmentation corresponding to the first vector is eliminated; then it is determined whether the word segmentation corresponding to the second vector is in the preset lexicon and repeated execution until n synonyms are obtained.

Specifically, if the device determines that it is not, it removes the word segmentation corresponding to the first vector; then determines whether the word segmentation corresponding to the second vector is in the preset vocabulary, and repeats execution until n number of Synonyms. Reference may be made to the above embodiment, and no further description will be given.

The method for searching for synonyms provided by the embodiments of the present disclosure can further improve the accuracy of searching for synonyms by excluding irrelevant participles.

Based on the above embodiment, after the step of obtaining n synonyms of the word segmentation to be searched for, the method further includes:

All vector dimensions corresponding to the n synonyms are reduced to two dimensions, and the n synonyms are displayed in a plane.

Specifically, the device reduces the vector dimensions corresponding to the n synonyms to two dimensions, and displays the n synonyms in a plane. Reference may be made to the above embodiment, and no further description will be given.

The method for finding synonyms provided by the embodiments of the present disclosure can visually display synonyms.

Based on the above embodiment, the obtaining of the optimized word vector matrix includes:

Segment the corpus.

Specifically, the device performs word segmentation on the corpus. Reference may be made to the above embodiment, and no further description will be given.

Obtain the target word segment included in the preset word library from the obtained word segmentation.

Specifically, the device obtains the target word segment included in the preset word library from the obtained word segmentation. Reference may be made to the above embodiment, and no further description will be given.

Merging the target word segmentation according to the preset word library to obtain a merged word; wherein the preset word library includes a correspondence between the preset merged word and the preset word segmentation.

Specifically, the device merges the target word segmentation according to the preset word library to obtain a merged word; wherein, the preset word library includes a correspondence between the preset merged word and the preset word segmentation. Reference may be made to the above embodiment, and no further description will be given.

An initial word vector matrix is constructed based on the merged words and the unmerged residual participles; wherein the initial word vector matrix is an N × M matrix, where N is the total number of participles, and M is the vector dimension corresponding to each participle, so The total number of participles is the sum of the merged words and the remaining uncombined words.

Specifically, the device constructs an initial word vector matrix based on the merged words and the unmerged remaining participles; wherein the initial word vector matrix is an N × M matrix, where N is the total number of participles, and M is the vector corresponding to each participle Dimension, the total number of participles is the sum of the merged words and the unmerged remaining participles. Reference may be made to the above embodiment, and no further description will be given.

Use the Word2vec model to perform sliding window word extraction on the corpus to obtain training samples.

Specifically, the device uses the Word2vec model to perform sliding window word retrieval on the corpus to obtain training samples. Reference may be made to the above embodiment, and no further description will be given.

The SKIP-GRAM model is used to train the training samples to obtain an optimized word vector matrix based on the initial word vector matrix.

Specifically, the device uses the SKIP-GRAM model to train the training samples to obtain an optimized word vector matrix based on the initial word vector matrix. Reference may be made to the above embodiment, and no further description will be given.

The method for finding synonyms provided by the embodiment of the present disclosure ensures that the method is performed normally by reasonably obtaining the optimized word vector matrix.

Based on the above embodiments, the word segmentation of the corpus includes:

Use jieba library to segment the corpus.

Specifically, the device uses the jieba library to segment the corpus. Reference may be made to the above embodiment, and no further description will be given.

The method for searching synonyms provided by the embodiments of the present disclosure can efficiently segment the corpus.

Based on the above embodiment, the preset lexicon is a medical lexicon containing medical professional words.

Specifically, the preset thesaurus in the device is a medical thesaurus containing medical professional words. Reference may be made to the above embodiment, and no further description will be given.

The method for searching synonyms provided by the embodiments of the present disclosure can improve the accuracy of searching synonyms related to medical professional words.

4 is a schematic structural diagram of an apparatus for searching synonyms according to an embodiment of the present disclosure. As shown in FIG. 4, an embodiment of the present disclosure provides an apparatus for searching synonyms, which includes an input unit 401, a calculation unit 402, and a search unit 403, where:

The input unit 401 is used to input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring a word segmentation and the word segmentation as Training samples, and trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset vocabulary; the calculation unit 402 is used to obtain the target word corresponding to the word segmentation to be searched in the optimized word vector matrix Vector; and separately calculate the cosine distance of the target word vector and the other vectors in the optimized word vector matrix; the search unit 403 is used to obtain the n of the word segmentation to be searched based on all cosine distances and the preset word bank Synonyms.

Specifically, the input unit 401 is used to input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and a The word segmentation is used as a training sample and the trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset vocabulary; the calculation unit 402 is used to obtain the corresponding word segmentation to be found in the optimized word vector matrix Target word vector; and calculate the cosine distance of the target word vector and the other vectors in the optimized word vector matrix separately; the search unit 403 is used to obtain the to-be-searched based on all cosine distances and the preset word library N synonyms of participle.

The device for searching synonyms provided by an embodiment of the present disclosure obtains the optimized word vector matrix through the Word2vec model and the SKIP-GRAM model, and calculates the cosine distance between the target vector and other vectors of the word segmentation to be searched for in the optimized word vector matrix, according to all cosines Distance, combined with the preset thesaurus to eliminate part of the unrelated word segmentation, so as to obtain n synonyms, which can improve the accuracy of the search for synonyms.

The device for searching synonyms provided in the embodiments of the present disclosure may be specifically used to execute the processing flow of each method embodiment described above, and the functions thereof are not repeated here, and reference may be made to the detailed description of the method embodiments described above.

FIG. 5 is a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present disclosure. As shown in FIG. 5, the electronic device includes: a processor (processor) 501, a memory (memory) 502, and a bus 503;

Wherein, the processor 501 and the memory 502 communicate with each other through the bus 503;

The processor 501 is used to call program instructions in the memory 502 to execute the methods provided in the above method embodiments, for example, including: inputting a word segmentation to be searched into an optimized word vector matrix; the optimized word vector matrix is used Obtained by a preset model; the preset model includes a Word2vec model for obtaining word segmentation and a SKIP-GRAM model for training the word segmentation as a training sample; the word segmentation to be searched is in a preset word library The word segmentation of; obtain the target word vector corresponding to the word segment to be found in the optimized word vector matrix; and calculate the cosine distance of the target word vector and other vectors in the optimized word vector matrix separately; according to all cosines The distance and the preset lexicon are used to obtain n synonyms of the word segmentation to be searched.

This embodiment discloses a computer program product. The computer program product includes a computer program stored on a non-transitory computer-readable storage medium. The computer program includes program instructions. When the program instructions are executed by the computer, the computer The method provided by the above method embodiments can be performed, for example, including: inputting a word segmentation to be searched into an optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a The Word2vec model and the SKIP-GRAM model for training the word segmentation as a training sample; the word segmentation to be searched is a word segmentation in a preset vocabulary; obtained in the optimized word vector matrix and the word segmentation to be searched The target word vector corresponding to the word segmentation; and calculate the cosine distances of the target word vector and other vectors in the optimized word vector matrix separately; according to all cosine distances and the preset word library, obtain the n of the word segmentation to be found Synonyms.

This embodiment provides a non-transitory computer-readable storage medium. The non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to perform the methods provided by the foregoing method embodiments, for example, including : Input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, and SKIP-GRAM model for training; the word segmentation to be searched is a word segmentation in a preset lexicon; the target word vector corresponding to the word segmentation to be searched is obtained in the optimized word vector matrix; and the target words are calculated separately The cosine distance of the vector and other vectors in the optimized word vector matrix; according to all cosine distances and the preset lexicon, n synonyms of the word segmentation to be searched for are obtained.

Those of ordinary skill in the art may understand that all or part of the steps to implement the above method embodiments may be completed by program instructions related hardware. The foregoing program may be stored in a computer-readable storage medium, and when the program is executed, The steps of the above method embodiments are included; and the foregoing storage media include various media that can store program codes, such as ROM, RAM, magnetic disks, or optical disks.

The above-described embodiments of the electronic device and the like are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is It can be located in one place, or it can be distributed on multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art can understand and implement without paying creative labor.

Through the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions can be embodied in the form of software products in essence or part that contributes to the existing technology, and the computer software products can be stored in computer-readable storage media, such as ROM / RAM, magnetic Discs, optical discs, etc., include several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in the various embodiments or some parts of the embodiments.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the embodiments of the present disclosure, rather than limiting them; although the embodiments of the present disclosure have been described in detail with reference to the foregoing embodiments, the ordinary The skilled person should understand that they can still modify the technical solutions described in the foregoing embodiments, or equivalently replace some or all of the technical features; and these modifications or replacements do not deviate from the essence of the corresponding technical solutions. The scope of the technical solutions of the various embodiments.

Claims

A method for finding synonyms is characterized by:

Input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, and performs The trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset lexicon;

Obtaining the target word vector corresponding to the word segment to be searched in the optimized word vector matrix; and calculating the cosine distance of the target word vector and other vectors in the optimized word vector matrix separately;

According to all cosine distances and the preset lexicon, obtain n synonyms of the word segmentation to be searched.
The method according to claim 1, wherein the obtaining n synonyms of the word segmentation to be searched for based on all cosine distances and the preset lexicon includes:

Sort the other vectors corresponding to all cosine distances in the order of small to large values of all cosine distances;

Obtaining the word segmentation corresponding to the first vector in the sorting, and determining whether the word segmentation corresponding to the first vector is in the preset word library;

If it is determined to be true, the word segment corresponding to the first vector is used as a synonym, and then it is determined whether the word segment corresponding to the second vector is in the preset thesaurus, and repeated execution until n synonyms are acquired.
The method according to claim 2, wherein the method further comprises:

If it is determined to be not, the word segmentation corresponding to the first vector is eliminated; then it is determined whether the word segmentation corresponding to the second vector is in the preset lexicon and repeated execution until n synonyms are obtained.
The method according to any one of claims 1 to 3, wherein after the step of acquiring n synonyms of the word segmentation to be searched for, the method further comprises:

All vector dimensions corresponding to the n synonyms are reduced to two dimensions, and the n synonyms are displayed in a plane.
The method according to any one of claims 1 to 3, wherein the obtaining of the optimized word vector matrix includes:

Segment the corpus;

Obtaining the target word segment included in the preset word library from the obtained word segmentation;

Merging the target word segmentation according to the preset thesaurus to obtain a merged word; wherein, the preset thesaurus includes the correspondence between the preset merged word and the preset word segmentation;

An initial word vector matrix is constructed based on the merged words and the unmerged residual participles; wherein the initial word vector matrix is an N × M matrix, where N is the total number of participles, and M is the vector dimension corresponding to each participle, so The total number of participles is the sum of the merged words and the remaining uncombined words.

Using the Word2vec model to perform sliding window word extraction on the corpus to obtain training samples;

The SKIP-GRAM model is used to train the training samples to obtain an optimized word vector matrix based on the initial word vector matrix.
The method according to claim 5, wherein the segmentation of the corpus includes:

Use jieba library to segment the corpus.
The method according to claim 1, wherein the preset lexicon is a medical lexicon containing medical professional words.
A device for searching synonyms is characterized by including:

The input unit is used to input the word segmentation to be searched to the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring a word segmentation and the word segmentation as a Training samples, and the SKIP-GRAM model for training; the word segmentation to be searched is a word segmentation in a preset vocabulary;

A calculation unit, configured to obtain a target word vector corresponding to the word segmentation to be found in the optimized word vector matrix; and calculate the cosine distance of the target word vector and other vectors in the optimized word vector matrix, respectively;

The searching unit is configured to obtain n synonyms of the word segmentation to be searched based on all cosine distances and the preset lexicon.
An electronic device is characterized by comprising: a processor, a memory and a bus, wherein,

The processor and the memory complete communication with each other through the bus;

The memory stores program instructions executable by the processor, and the processor invokes the program instructions to perform the method according to any one of claims 1 to 7.
A non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions that cause the computer to perform the method according to any one of claims 1 to 7. .