WO2020074022A1 - Synonym search method and device - Google Patents

Synonym search method and device Download PDF

Info

Publication number
WO2020074022A1
WO2020074022A1 PCT/CN2019/124513 CN2019124513W WO2020074022A1 WO 2020074022 A1 WO2020074022 A1 WO 2020074022A1 CN 2019124513 W CN2019124513 W CN 2019124513W WO 2020074022 A1 WO2020074022 A1 WO 2020074022A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
preset
word segmentation
vector
synonyms
Prior art date
Application number
PCT/CN2019/124513
Other languages
French (fr)
Chinese (zh)
Inventor
赵荣生
宋再伟
刘爽
马悦
周旻
Original Assignee
北京大学第三医院
北京诺道认知医学科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大学第三医院, 北京诺道认知医学科技有限公司 filed Critical 北京大学第三医院
Publication of WO2020074022A1 publication Critical patent/WO2020074022A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Definitions

  • Embodiments of the present disclosure relate to the field of word processing technology, and in particular, to a method and device for searching synonyms.
  • Synonym search is an important research topic.
  • Existing synonyms search methods analyze the number of occurrences of each word in the current text and the number of occurrences in the entire text collection, and then use these word frequency information to model the text as a vector, and then use one-hot-encoding encoding algorithm Or tf-idf and other algorithms, and use the cosine similarity between vectors, jaccard similarity and other methods to calculate the similarity between words, that is, the existing technology is based on the similarity method of word frequency information to find synonyms.
  • embodiments of the present disclosure provide a method and device for searching synonyms.
  • an embodiment of the present disclosure provides a method for finding synonyms, the method includes:
  • the optimized word vector matrix is obtained by using a preset model;
  • the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, and performs The trained SKIP-GRAM model;
  • the word segmentation to be searched is a word segmentation in a preset lexicon;
  • an embodiment of the present disclosure provides a device for finding synonyms, the device includes:
  • the input unit is used to input the word segmentation to be searched to the optimized word vector matrix;
  • the optimized word vector matrix is obtained by using a preset model;
  • the preset model includes a Word2vec model for acquiring a word segmentation and the word segmentation as a Training samples, and the SKIP-GRAM model for training;
  • the word segmentation to be searched is a word segmentation in a preset vocabulary;
  • a calculation unit configured to obtain a target word vector corresponding to the word segmentation to be found in the optimized word vector matrix; and calculate the cosine distance of the target word vector and other vectors in the optimized word vector matrix, respectively;
  • the searching unit is configured to obtain n synonyms of the word segmentation to be searched based on all cosine distances and the preset lexicon.
  • an embodiment of the present disclosure provides an electronic device, including: a processor, a memory, and a bus, wherein,
  • the processor and the memory complete communication with each other through the bus;
  • the memory stores program instructions executable by the processor, and the processor can execute the following methods by calling the program instructions:
  • the optimized word vector matrix is obtained by using a preset model;
  • the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, and performs The trained SKIP-GRAM model;
  • the word segmentation to be searched is a word segmentation in a preset lexicon;
  • an embodiment of the present disclosure provides a non-transitory computer-readable storage medium, including:
  • the non-transitory computer-readable storage medium stores computer instructions that cause the computer to perform the following methods:
  • the optimized word vector matrix is obtained by using a preset model;
  • the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, and performs The trained SKIP-GRAM model;
  • the word segmentation to be searched is a word segmentation in a preset lexicon;
  • the method and device for searching synonyms provided by the embodiments of the present disclosure obtain the optimized word vector matrix through the Word2vec model and the SKIP-GRAM model, and calculate the cosine distance between the target vector and other vectors of the word segmentation to be searched for in the optimized word vector matrix. All cosine distances are combined with the preset thesaurus to remove some unrelated participles, so as to obtain n synonyms, which can improve the accuracy of the search for synonyms.
  • FIG. 1 is a schematic flowchart of a method for finding synonyms in an embodiment of the present disclosure
  • FIG. 2 is a screenshot of sliding window word extraction according to an embodiment of the present disclosure
  • FIG. 3 is a graph of word segmentation search results according to an embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of a device for searching synonyms according to an embodiment of the present disclosure
  • FIG. 5 is a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present disclosure.
  • FIG. 1 is a schematic flowchart of a method for searching for synonyms in an embodiment of the present disclosure. As shown in FIG. 1, a method for searching for synonyms in an embodiment of the present disclosure includes the following steps:
  • S101 Input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, And the trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset word library.
  • the device inputs the word segmentation to be searched to the optimized word vector matrix;
  • the optimized word vector matrix is obtained by using a preset model;
  • the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as training A sample and a trained SKIP-GRAM model;
  • the word segmentation to be searched for is a word segmentation in a preset word library.
  • the preset thesaurus may be a medical thesaurus containing medical professional words.
  • the obtaining of the optimized word vector matrix may include: segmenting the corpus. Further, the jieba library may be used to segment the corpus.
  • the corpus includes not limited to preset words The word segmentation in the library; obtaining the target word segment included in the preset word library in the obtained word segmentation; merging the target word segmentation according to the preset word library to obtain the merged word; wherein, the preset The thesaurus includes the correspondence between preset merged words and preset word segmentation; an initial word vector matrix is constructed according to the merged word and the remaining merged word segmentation; wherein the initial word vector matrix is an N ⁇ M matrix, where, N is the total number of participles, M is the vector dimension corresponding to each participle, and the total number of participles is the sum of the merged words and the remaining merged participles.
  • Example sentences Objective To study the adverse effects of high-dose methotrexate (hd-mtx, 5g / m2) plus calcium tetrahydrofolate (cf) and rescue plan for treatment of childhood acute lymphoblastic leukemia (all).
  • Word segmentation results :
  • the preset thesaurus contains the corresponding relationship between 'tetrahydro', 'folate', 'calcium' and 'calcium tetrahydrofolate', then the target participles are 'tetrahydro', 'folate', 'calcium', get merged
  • the words 'calcium tetrahydrofolate', 'hd', '-', 'mtx' are not repeated here. Then use the following content (including merged words and unmerged residual participles) to build an initial word vector matrix. Examples are as follows:
  • FIG. 2 is a screenshot of a sliding window for taking words according to an embodiment of the present disclosure.
  • the window width is 2, and the sliding window process is shown in FIG. 2.
  • the training process is a mature technology in the field: the word segmentation in the context can be defined as a positive sample. Assuming that 64 negative samples are defined, the principle of negative sample selection is: randomly select 64 from the remaining word segments that do not include the context word segmentation as negative samples. When optimizing the loss function, the principle to be followed is to make the probability of positive samples appear higher and higher, and the probability of negative samples appear lower and lower, thereby reducing the amount of calculation and speeding up model training. Through sliding window traversal of all word segmentation, through the SKIP-GRAM model, train the optimized word vector, and get the final optimized word vector matrix.
  • the prediction results take into account the probability of context word segmentation, thereby improving the accuracy of finding synonyms.
  • the Word2vec model obtains the word segmentation, and then gets the word segmentation vector, and then trains the word segmentation vector.
  • S102 Obtain a target word vector corresponding to the word segmentation to be found in the optimized word vector matrix; and calculate the cosine distance of the target word vector and other vectors in the optimized word vector matrix, respectively.
  • the device obtains the target word vector corresponding to the word segmentation to be searched in the optimized word vector matrix; and calculates the cosine distance of the target word vector and other vectors in the optimized word vector matrix, respectively.
  • An example is as follows: For example, if the participle to be searched is a cell, if the participle "cell" corresponds to the tenth row word segmentation cell of the optimized word vector matrix, then the 128-dimensional word vector corresponding to the tenth row segmentation cell is the target word vector, assuming the optimized word If the vector matrix has N rows, N-1 cosine distances between the target word vector and other N-1 row vectors are calculated respectively.
  • the specific cosine distance calculation method is a mature technology in the art and will not be repeated here.
  • S103 Acquire n synonyms of the word segmentation to be searched based on all cosine distances and the preset lexicon.
  • the device obtains the n synonyms of the word segmentation to be searched based on all cosine distances and the preset lexicon. Specifically, it may include: sorting the other vectors corresponding to all cosine distances in the order of small to large values of all cosine distances; obtaining the word segmentation corresponding to the first vector in the sorting, and determining whether the word segmentation corresponding to the first vector is in In the preset thesaurus; if it is determined to be yes, the word segment corresponding to the first vector is used as a synonym, and then it is determined whether the word segment corresponding to the second vector is in the preset thesaurus, and repeated execution Until n synonyms are obtained. If it is determined to be not, the word segmentation corresponding to the first vector is eliminated; then it is determined whether the word segmentation corresponding to the second vector is in the preset lexicon and repeated execution until n synonyms are obtained.
  • FIG. 3 is a graph of the result of word segmentation search according to an embodiment of the present disclosure.
  • the order is vector A ...
  • the value of n can be set independently, and the value can be selected as 5, to determine whether the word segment corresponding to vector A is in the preset In, then the participle corresponding to vector A is used as a synonym for cell, such as lymphocyte in FIG. 3, where n is 1, and then determine whether the participle corresponding to vector B is in the preset lexicon, if it is, then the corresponding part of vector B
  • the word segmentation is a synonym for cell, such as stem cell in Figure 3, where n is 2, and then determine whether the vector C is in the preset thesaurus.
  • the preset model used in the embodiment of the present disclosure can accurately search for synonyms through fewer vector dimensions, such as 128 dimensions, compared to the model used in the prior art, to accurately find the required vector
  • the number of dimensions has been greatly reduced. Therefore, the method of the embodiments of the present disclosure also has the technical effect of saving computing resources and improving computing efficiency.
  • the method may further include: reducing the vector dimensions corresponding to the n synonyms to two dimensions, and displaying the n synonyms in a plane.
  • Vector dimensionality reduction can be performed through PCA. Referring to FIG. 3, the degree of synonym between word segments can be seen more intuitively.
  • the method for finding synonyms obtained by an embodiment of the present disclosure obtains the optimized word vector matrix through the Word2vec model and the SKIP-GRAM model, and calculates the cosine distance between the target vector of the word segmentation to be found in the optimized word vector matrix and other vectors, based on all cosines Distance, combined with the preset thesaurus to eliminate part of the unrelated word segmentation, so as to obtain n synonyms, which can improve the accuracy of the search for synonyms.
  • the obtaining n synonyms of the word segmentation to be searched for based on all cosine distances and the preset lexicon includes:
  • the other vectors corresponding to all cosine distances are sorted in order of the values of all cosine distances from small to large.
  • the device sorts the other vectors corresponding to all cosine distances in the order of the values of all cosine distances from small to large. Reference may be made to the above embodiment, and no further description will be given.
  • the word segmentation corresponding to the first vector in the sorting is acquired, and it is determined whether the word segmentation corresponding to the first vector is in the preset thesaurus.
  • the device obtains the word segmentation corresponding to the first vector in the sorting, and determines whether the word segmentation corresponding to the first vector is in the preset word library.
  • the word segment corresponding to the first vector is used as a synonym, and then it is determined whether the word segment corresponding to the second vector is in the preset thesaurus, and repeated execution until n synonyms are acquired.
  • the device determines that it is yes, the word segment corresponding to the first vector is used as a synonym, and then determines whether the word segment corresponding to the second vector is in the preset lexicon, and is repeatedly executed until it is obtained n synonyms.
  • the word segment corresponding to the first vector is used as a synonym, and then determines whether the word segment corresponding to the second vector is in the preset lexicon, and is repeatedly executed until it is obtained n synonyms.
  • the method for searching synonyms provided by the embodiments of the present disclosure can further improve the accuracy of searching for synonyms.
  • the method further includes:
  • the word segmentation corresponding to the first vector is eliminated; then it is determined whether the word segmentation corresponding to the second vector is in the preset lexicon and repeated execution until n synonyms are obtained.
  • the device determines that it is not, it removes the word segmentation corresponding to the first vector; then determines whether the word segmentation corresponding to the second vector is in the preset vocabulary, and repeats execution until n number of Synonyms.
  • the method for searching for synonyms provided by the embodiments of the present disclosure can further improve the accuracy of searching for synonyms by excluding irrelevant participles.
  • the method further includes:
  • the device reduces the vector dimensions corresponding to the n synonyms to two dimensions, and displays the n synonyms in a plane.
  • the method for finding synonyms provided by the embodiments of the present disclosure can visually display synonyms.
  • the obtaining of the optimized word vector matrix includes:
  • the device performs word segmentation on the corpus.
  • word segmentation on the corpus.
  • the device obtains the target word segment included in the preset word library from the obtained word segmentation.
  • the device obtains the target word segment included in the preset word library from the obtained word segmentation.
  • the device merges the target word segmentation according to the preset word library to obtain a merged word; wherein, the preset word library includes a correspondence between the preset merged word and the preset word segmentation.
  • An initial word vector matrix is constructed based on the merged words and the unmerged residual participles; wherein the initial word vector matrix is an N ⁇ M matrix, where N is the total number of participles, and M is the vector dimension corresponding to each participle, so The total number of participles is the sum of the merged words and the remaining uncombined words.
  • the device constructs an initial word vector matrix based on the merged words and the unmerged remaining participles; wherein the initial word vector matrix is an N ⁇ M matrix, where N is the total number of participles, and M is the vector corresponding to each participle Dimension, the total number of participles is the sum of the merged words and the unmerged remaining participles.
  • Word2vec model uses the Word2vec model to perform sliding window word extraction on the corpus to obtain training samples.
  • the device uses the Word2vec model to perform sliding window word retrieval on the corpus to obtain training samples.
  • Word2vec model uses the Word2vec model to perform sliding window word retrieval on the corpus to obtain training samples.
  • the SKIP-GRAM model is used to train the training samples to obtain an optimized word vector matrix based on the initial word vector matrix.
  • the device uses the SKIP-GRAM model to train the training samples to obtain an optimized word vector matrix based on the initial word vector matrix.
  • the method for finding synonyms provided by the embodiment of the present disclosure ensures that the method is performed normally by reasonably obtaining the optimized word vector matrix.
  • the word segmentation of the corpus includes:
  • the device uses the jieba library to segment the corpus.
  • the device uses the jieba library to segment the corpus.
  • the method for searching synonyms provided by the embodiments of the present disclosure can efficiently segment the corpus.
  • the preset lexicon is a medical lexicon containing medical professional words.
  • the preset thesaurus in the device is a medical thesaurus containing medical professional words.
  • the method for searching synonyms provided by the embodiments of the present disclosure can improve the accuracy of searching synonyms related to medical professional words.
  • FIG. 4 is a schematic structural diagram of an apparatus for searching synonyms according to an embodiment of the present disclosure. As shown in FIG. 4, an embodiment of the present disclosure provides an apparatus for searching synonyms, which includes an input unit 401, a calculation unit 402, and a search unit 403, where:
  • the input unit 401 is used to input the word segmentation to be searched into the optimized word vector matrix;
  • the optimized word vector matrix is obtained by using a preset model;
  • the preset model includes a Word2vec model for acquiring a word segmentation and the word segmentation as Training samples, and trained SKIP-GRAM model;
  • the word segmentation to be searched is a word segmentation in a preset vocabulary;
  • the calculation unit 402 is used to obtain the target word corresponding to the word segmentation to be searched in the optimized word vector matrix Vector; and separately calculate the cosine distance of the target word vector and the other vectors in the optimized word vector matrix;
  • the search unit 403 is used to obtain the n of the word segmentation to be searched based on all cosine distances and the preset word bank Synonyms.
  • the input unit 401 is used to input the word segmentation to be searched into the optimized word vector matrix;
  • the optimized word vector matrix is obtained by using a preset model;
  • the preset model includes a Word2vec model for acquiring word segmentation and a The word segmentation is used as a training sample and the trained SKIP-GRAM model;
  • the word segmentation to be searched is a word segmentation in a preset vocabulary;
  • the calculation unit 402 is used to obtain the corresponding word segmentation to be found in the optimized word vector matrix Target word vector; and calculate the cosine distance of the target word vector and the other vectors in the optimized word vector matrix separately;
  • the search unit 403 is used to obtain the to-be-searched based on all cosine distances and the preset word library N synonyms of participle.
  • the device for searching synonyms obtained by an embodiment of the present disclosure obtains the optimized word vector matrix through the Word2vec model and the SKIP-GRAM model, and calculates the cosine distance between the target vector and other vectors of the word segmentation to be searched for in the optimized word vector matrix, according to all cosines Distance, combined with the preset thesaurus to eliminate part of the unrelated word segmentation, so as to obtain n synonyms, which can improve the accuracy of the search for synonyms.
  • the device for searching synonyms provided in the embodiments of the present disclosure may be specifically used to execute the processing flow of each method embodiment described above, and the functions thereof are not repeated here, and reference may be made to the detailed description of the method embodiments described above.
  • FIG. 5 is a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present disclosure. As shown in FIG. 5, the electronic device includes: a processor (processor) 501, a memory (memory) 502, and a bus 503;
  • processor processor
  • memory memory
  • bus 503 a bus
  • processor 501 and the memory 502 communicate with each other through the bus 503;
  • the processor 501 is used to call program instructions in the memory 502 to execute the methods provided in the above method embodiments, for example, including: inputting a word segmentation to be searched into an optimized word vector matrix; the optimized word vector matrix is used Obtained by a preset model; the preset model includes a Word2vec model for obtaining word segmentation and a SKIP-GRAM model for training the word segmentation as a training sample; the word segmentation to be searched is in a preset word library The word segmentation of; obtain the target word vector corresponding to the word segment to be found in the optimized word vector matrix; and calculate the cosine distance of the target word vector and other vectors in the optimized word vector matrix separately; according to all cosines The distance and the preset lexicon are used to obtain n synonyms of the word segmentation to be searched.
  • the computer program product includes a computer program stored on a non-transitory computer-readable storage medium.
  • the computer program includes program instructions.
  • the program instructions When the program instructions are executed by the computer, the computer
  • the method provided by the above method embodiments can be performed, for example, including: inputting a word segmentation to be searched into an optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a The Word2vec model and the SKIP-GRAM model for training the word segmentation as a training sample; the word segmentation to be searched is a word segmentation in a preset vocabulary; obtained in the optimized word vector matrix and the word segmentation to be searched The target word vector corresponding to the word segmentation; and calculate the cosine distances of the target word vector and other vectors in the optimized word vector matrix separately; according to all cosine distances and the preset word library, obtain the n of the word segmentation to be found Synonyms.
  • This embodiment provides a non-transitory computer-readable storage medium.
  • the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to perform the methods provided by the foregoing method embodiments, for example, including : Input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, and SKIP-GRAM model for training; the word segmentation to be searched is a word segmentation in a preset lexicon; the target word vector corresponding to the word segmentation to be searched is obtained in the optimized word vector matrix; and the target words are calculated separately
  • the cosine distance of the vector and other vectors in the optimized word vector matrix; according to all cosine distances and the preset lexicon, n synonyms of the word segmentation to be searched for are obtained.
  • each embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware.
  • the above technical solutions can be embodied in the form of software products in essence or part that contributes to the existing technology, and the computer software products can be stored in computer-readable storage media, such as ROM / RAM, magnetic Discs, optical discs, etc., include several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in the various embodiments or some parts of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

A synonym search method and device, said method comprising: inputting into an optimized word vector matrix a component word to be matched, said optimized word vector matrix being obtained using pre-set models, said pre-set models including a Word2vec model used for obtaining the component word and a Skip-Gram model trained using said component word as a training sample, said component word to be matched being a component word in a pre-set word library (S101); obtaining from the optimized word vector matrix a target word vector corresponding to the component word to be matched; calculating separately the cosine distances between the target word vector and other vectors in the optimized word vector matrix (S102); obtaining n synonyms of the component word to be matched according to all cosine distances and the pre-set word library (S103). Said device implements the method, and enhances the accuracy of synonym search.

Description

一种查找同义词的方法及装置Method and device for searching synonyms
相关申请的交叉引用Cross-reference of related applications
本申请要求于2018年10月11日提交的申请号为2018111816859,发明名称为“一种查找同义词的方法及装置”的中国专利申请的优先权,其通过引用方式全部并入本公开。This application requires the priority of a Chinese patent application filed on October 11, 2018 with the application number 2018111816859 and the invention titled "a method and device for finding synonyms", which is fully incorporated by reference into this disclosure.
技术领域Technical field
本公开实施例涉及词处理技术领域,具体涉及一种查找同义词的方法及装置。Embodiments of the present disclosure relate to the field of word processing technology, and in particular, to a method and device for searching synonyms.
背景技术Background technique
同义词查找是一个重要研究课题。现有的同义词查找方法通过分析每个词在当前文本中出现的次数以及在整个文本集合中出现的次数,进而利用这些词频信息将文本建模为一个向量,然后采用one-hot-encoding编码算法或者tf-idf等算法,并利用向量间的余弦相似度、jaccard相似度等方法计算词汇之间的相似度,即现有技术是基于词频信息的相似度方法进行同义词查找。Synonym search is an important research topic. Existing synonyms search methods analyze the number of occurrences of each word in the current text and the number of occurrences in the entire text collection, and then use these word frequency information to model the text as a vector, and then use one-hot-encoding encoding algorithm Or tf-idf and other algorithms, and use the cosine similarity between vectors, jaccard similarity and other methods to calculate the similarity between words, that is, the existing technology is based on the similarity method of word frequency information to find synonyms.
然而,在研究词语语义的时候,实际上要弄清楚在人们在描述客观事物、表达自己的想法的时候,是如何使用某个词语的:在哪使用,在什么时候使用,和哪些词一起使用。也就是说,如果人们要进行有意义的交流,那么在讨论、描述某个事物的时候,除事物本身以外,须另外附加上某个语境,通过事物和语境中其他元素的互动,来表达事先设定的语义。而现有技术仅仅通过词频进行同义词查找,查找出的同义词的准确性不高。However, when studying the semantics of words, it is actually necessary to figure out how to use a certain word when people describe objective things and express their own ideas: where to use, when to use, and which words to use . In other words, if people want to communicate meaningfully, when discussing or describing something, in addition to the thing itself, a certain context must be added, through the interaction of things and other elements in the context, to Express pre-set semantics. However, in the prior art, synonym search is performed only by word frequency, and the accuracy of the found synonym is not high.
因此,如何避免上述缺陷,能够提高同义词查找的准确性,成为亟须解决的问题。Therefore, how to avoid the above defects and improve the accuracy of synonym searching has become an urgent problem to be solved.
发明内容Summary of the invention
针对现有技术存在的问题,本公开实施例提供一种查找同义词的方法及装置。In response to the problems in the prior art, embodiments of the present disclosure provide a method and device for searching synonyms.
第一方面,本公开实施例提供一种查找同义词的方法,所述方法包括:In a first aspect, an embodiment of the present disclosure provides a method for finding synonyms, the method includes:
输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词的Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词;Input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, and performs The trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset lexicon;
在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦距离;Obtaining the target word vector corresponding to the word segment to be searched in the optimized word vector matrix; and calculating the cosine distance of the target word vector and other vectors in the optimized word vector matrix separately;
根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。According to all cosine distances and the preset lexicon, obtain n synonyms of the word segmentation to be searched.
第二方面,本公开实施例提供一种查找同义词的装置,所述装置包括:In a second aspect, an embodiment of the present disclosure provides a device for finding synonyms, the device includes:
输入单元,用于输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词的Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词;The input unit is used to input the word segmentation to be searched to the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring a word segmentation and the word segmentation as a Training samples, and the SKIP-GRAM model for training; the word segmentation to be searched is a word segmentation in a preset vocabulary;
计算单元,用于在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦距离;A calculation unit, configured to obtain a target word vector corresponding to the word segmentation to be found in the optimized word vector matrix; and calculate the cosine distance of the target word vector and other vectors in the optimized word vector matrix, respectively;
查找单元,用于根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。The searching unit is configured to obtain n synonyms of the word segmentation to be searched based on all cosine distances and the preset lexicon.
第三方面,本公开实施例提供一种电子设备,包括:处理器、存储器和总线,其中,In a third aspect, an embodiment of the present disclosure provides an electronic device, including: a processor, a memory, and a bus, wherein,
所述处理器和所述存储器通过所述总线完成相互间的通信;The processor and the memory complete communication with each other through the bus;
所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用所述程序指令能够执行如下方法:The memory stores program instructions executable by the processor, and the processor can execute the following methods by calling the program instructions:
输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词的Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词;Input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, and performs The trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset lexicon;
在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦 距离;Obtaining the target word vector corresponding to the word segmentation to be searched in the optimized word vector matrix; and calculating the cosine distance of the target word vector and other vectors in the optimized word vector matrix separately;
根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。According to all cosine distances and the preset lexicon, obtain n synonyms of the word segmentation to be searched.
第四方面,本公开实施例提供一种非暂态计算机可读存储介质,包括:According to a fourth aspect, an embodiment of the present disclosure provides a non-transitory computer-readable storage medium, including:
所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令使所述计算机执行如下方法:The non-transitory computer-readable storage medium stores computer instructions that cause the computer to perform the following methods:
输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词的Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词;Input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, and performs The trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset lexicon;
在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦距离;Obtaining the target word vector corresponding to the word segment to be searched in the optimized word vector matrix; and calculating the cosine distance of the target word vector and other vectors in the optimized word vector matrix separately;
根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。According to all cosine distances and the preset lexicon, obtain n synonyms of the word segmentation to be searched.
本公开实施例提供的查找同义词的方法及装置,通过Word2vec模型和SKIP-GRAM模型获取优化词向量矩阵,并计算待查找分词在该优化词向量矩阵中的目标向量与其他向量的余弦距离,根据所有余弦距离,再结合预设词库剔除部分无关的分词,从而获取n个同义词,能够提高同义词查找的准确性。The method and device for searching synonyms provided by the embodiments of the present disclosure obtain the optimized word vector matrix through the Word2vec model and the SKIP-GRAM model, and calculate the cosine distance between the target vector and other vectors of the word segmentation to be searched for in the optimized word vector matrix. All cosine distances are combined with the preset thesaurus to remove some unrelated participles, so as to obtain n synonyms, which can improve the accuracy of the search for synonyms.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the embodiments of the present disclosure or the technical solutions in the prior art, the following will briefly introduce the drawings used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are some embodiments of the present disclosure. For those of ordinary skill in the art, without paying any creative work, other drawings can also be obtained based on these drawings.
图1为本公开实施例查找同义词的方法流程示意图;FIG. 1 is a schematic flowchart of a method for finding synonyms in an embodiment of the present disclosure;
图2为本公开实施例滑窗取词的截图;FIG. 2 is a screenshot of sliding window word extraction according to an embodiment of the present disclosure;
图3为本公开实施例分词查找结果图;FIG. 3 is a graph of word segmentation search results according to an embodiment of the present disclosure;
图4为本公开实施例查找同义词的装置结构示意图;4 is a schematic structural diagram of a device for searching synonyms according to an embodiment of the present disclosure;
图5为本公开实施例提供的电子设备实体结构示意图。FIG. 5 is a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present disclosure.
具体实施方式detailed description
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。To make the objectives, technical solutions, and advantages of the embodiments of the present disclosure more clear, the technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments It is a part of the embodiments of the present disclosure, but not all the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the protection scope of the present disclosure.
图1为本公开实施例查找同义词的方法流程示意图,如图1所示,本公开实施例提供的一种查找同义词的方法,包括以下步骤:FIG. 1 is a schematic flowchart of a method for searching for synonyms in an embodiment of the present disclosure. As shown in FIG. 1, a method for searching for synonyms in an embodiment of the present disclosure includes the following steps:
S101:输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词的Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词。S101: Input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, And the trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset word library.
具体的,装置输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词的Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词。预设词库可以为包含有医学专业词的医学词库,所述优化词向量矩阵的获取,可以包括:对语料库进行分词,进一步可以采用jieba库对语料库进行分词,语料库包括不限于预设词库中的分词;在得到的分词中获取包含在所述预设词库中的目标分词;根据所述预设词库对所述目标分词进行合并,以获取合并词;其中,所述预设词库包括预设合并词与预设分词之间的对应关系;根据所述合并词和未合并的剩余分词构建初始词向量矩阵;其中,所述初始词向量矩阵为N×M矩阵,其中,N为分词总数、M为每一个分词对应的向量维数,所述分词总数为所述合并词和未合并的剩余分词之和;采用所述Word2vec模型对所述语料库进行滑窗取词,以获取训练样本;采用所述SKIP-GRAM模型对所述训练样本进行训练,以获取基于所述初始词向量矩阵的优化词向量矩阵。说明如下:例句:目的研究大剂量甲氨喋呤(hd-mtx,5g/m2)加四氢叶酸钙(cf),解救方案治疗儿童急性淋巴细胞白血病(all)的不良反应。分词结果:Specifically, the device inputs the word segmentation to be searched to the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as training A sample and a trained SKIP-GRAM model; the word segmentation to be searched for is a word segmentation in a preset word library. The preset thesaurus may be a medical thesaurus containing medical professional words. The obtaining of the optimized word vector matrix may include: segmenting the corpus. Further, the jieba library may be used to segment the corpus. The corpus includes not limited to preset words The word segmentation in the library; obtaining the target word segment included in the preset word library in the obtained word segmentation; merging the target word segmentation according to the preset word library to obtain the merged word; wherein, the preset The thesaurus includes the correspondence between preset merged words and preset word segmentation; an initial word vector matrix is constructed according to the merged word and the remaining merged word segmentation; wherein the initial word vector matrix is an N × M matrix, where, N is the total number of participles, M is the vector dimension corresponding to each participle, and the total number of participles is the sum of the merged words and the remaining merged participles. Obtain training samples; use the SKIP-GRAM model to train the training samples to obtain an optimized word vector matrix based on the initial word vector matrix. The explanations are as follows: Example sentences: Objective To study the adverse effects of high-dose methotrexate (hd-mtx, 5g / m2) plus calcium tetrahydrofolate (cf) and rescue plan for treatment of childhood acute lymphoblastic leukemia (all). Word segmentation results:
['目的','研究','大剂量','甲氨喋呤','(','hd','-','mtx','5g','/','m2',')','加',' 四氢','叶酸','钙','(','cf',')','解救','方案','治疗','儿童','急性','淋巴细胞','白血病','(','all',')','的','不良反应']。['Purpose', 'Research', 'High dose', 'Methotrexate', '(', 'hd', '-', 'mtx', '5g', '/', 'm2', ' ) ',' Add ',' Tetrahydro ',' Folic Acid ',' Calcium ',' (',' cf ',') ',' Rescue ',' Scheme ',' Treatment ',' Children ',' Acute ',' Lymphocyte ',' leukemia ',' (',' all ',') ',' '', 'adverse reaction'].
其中预设词库中包含有'四氢','叶酸','钙'和'四氢叶酸钙'的对应关系,则目标分词为'四氢','叶酸','钙',获取合并词'四氢叶酸钙','hd','-','mtx'不再赘述。然后采用如下内容(包括有合并词和未合并的剩余分词)构建初始词向量矩阵。举例如下:The preset thesaurus contains the corresponding relationship between 'tetrahydro', 'folate', 'calcium' and 'calcium tetrahydrofolate', then the target participles are 'tetrahydro', 'folate', 'calcium', get merged The words 'calcium tetrahydrofolate', 'hd', '-', 'mtx' are not repeated here. Then use the following content (including merged words and unmerged residual participles) to build an initial word vector matrix. Examples are as follows:
['目的','研究','大剂量','甲氨喋呤','(','hd-mtx','5g','/','m2',')','加','四氢叶酸钙','(','cf',')','解救','方案','治疗','儿童','急性','淋巴细胞','白血病','(','all',')','的','不良反应']。向量维数可以根据实际情况自主设置,可选为128,向量元素可以是[-1,1]之间的随机数。图2为本公开实施例滑窗取词的截图,窗宽为2,滑窗过程如图2所示。['Objective', 'Research', 'Big Dose', 'Methotrexate', '(', 'hd-mtx', '5g', '/', 'm2', ')', 'Plus' , 'Calcium tetrahydrofolate', '(', 'cf', ')', 'Rescue', 'Scheme', 'Treatment', 'Children', 'Acute', 'Lymphocyte', 'Leukemia', ' (',' all ',') ',' '', 'adverse reaction']. The vector dimension can be set independently according to the actual situation, optional 128, the vector element can be a random number between [-1,1]. FIG. 2 is a screenshot of a sliding window for taking words according to an embodiment of the present disclosure. The window width is 2, and the sliding window process is shown in FIG. 2.
训练过程为本领域成熟技术:可以定义上下文中的分词为正样本,假设定义负样本64个,则负样本选取原则为:从不包括上下文分词的剩余分词中随机选取64个作为负样本,在优化损失函数时,遵循的原则为使正样本出现的概率越来越高,使负样本出现的概率越来越低,从而减少计算量,加快模型训练速度。通过滑窗遍历所有分词,不断通过SKIP-GRAM模型,训练优化词向量,得到最终的优化词向量矩阵。The training process is a mature technology in the field: the word segmentation in the context can be defined as a positive sample. Assuming that 64 negative samples are defined, the principle of negative sample selection is: randomly select 64 from the remaining word segments that do not include the context word segmentation as negative samples. When optimizing the loss function, the principle to be followed is to make the probability of positive samples appear higher and higher, and the probability of negative samples appear lower and lower, thereby reducing the amount of calculation and speeding up model training. Through sliding window traversal of all word segmentation, through the SKIP-GRAM model, train the optimized word vector, and get the final optimized word vector matrix.
需要说明的是:基于SKIP-GRAM模型的原理,即预测结果考虑到上下文分词出现的概率,从而提高查找同义词的准确性。Word2vec模型获取分词,进而得到分词向量,再对分词向量进行训练。It should be noted that: Based on the principle of the SKIP-GRAM model, the prediction results take into account the probability of context word segmentation, thereby improving the accuracy of finding synonyms. The Word2vec model obtains the word segmentation, and then gets the word segmentation vector, and then trains the word segmentation vector.
S102:在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦距离。S102: Obtain a target word vector corresponding to the word segmentation to be found in the optimized word vector matrix; and calculate the cosine distance of the target word vector and other vectors in the optimized word vector matrix, respectively.
具体的,装置在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦距离。举例说明如下:例如待查找分词为细胞,假如该分词“细胞”对应优化词向量矩阵的第10行分词细胞,则第10行分词细胞对应的128维词向量为目标词向量,假设该优化词向量矩阵有N行,则分别计算该目标词向量与其他N-1行向量的N-1个余弦距离,具体余弦距离的计算方法为本领域成熟技术,不再赘述。Specifically, the device obtains the target word vector corresponding to the word segmentation to be searched in the optimized word vector matrix; and calculates the cosine distance of the target word vector and other vectors in the optimized word vector matrix, respectively. An example is as follows: For example, if the participle to be searched is a cell, if the participle "cell" corresponds to the tenth row word segmentation cell of the optimized word vector matrix, then the 128-dimensional word vector corresponding to the tenth row segmentation cell is the target word vector, assuming the optimized word If the vector matrix has N rows, N-1 cosine distances between the target word vector and other N-1 row vectors are calculated respectively. The specific cosine distance calculation method is a mature technology in the art and will not be repeated here.
S103:根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。S103: Acquire n synonyms of the word segmentation to be searched based on all cosine distances and the preset lexicon.
具体的,装置根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。具体可以包括:按照所有余弦距离的数值由小到大的顺序对与所有余弦距离对应的其他向量进行排序;获取排序中的第一个向量对应的分词,确定第一个向量对应的分词是否在所述预设词库中;若确定为是,则将所述第一个向量对应的分词作为一个同义词,再确定第二个向量对应的分词是否在所述预设词库中,并重复执行,直到获取到n个同义词。若确定为不是,则将所述第一个向量对应的分词剔除;再确定第二个向量对应的分词是否在所述预设词库中,并重复执行,直到获取到n个同义词。Specifically, the device obtains the n synonyms of the word segmentation to be searched based on all cosine distances and the preset lexicon. Specifically, it may include: sorting the other vectors corresponding to all cosine distances in the order of small to large values of all cosine distances; obtaining the word segmentation corresponding to the first vector in the sorting, and determining whether the word segmentation corresponding to the first vector is in In the preset thesaurus; if it is determined to be yes, the word segment corresponding to the first vector is used as a synonym, and then it is determined whether the word segment corresponding to the second vector is in the preset thesaurus, and repeated execution Until n synonyms are obtained. If it is determined to be not, the word segmentation corresponding to the first vector is eliminated; then it is determined whether the word segmentation corresponding to the second vector is in the preset lexicon and repeated execution until n synonyms are obtained.
图3为本公开实施例分词查找结果图,参照上述举例,排序为向量A…,n的数值可自主设定,可选为5,确定向量A对应的分词是否在预设词库中,如果在,则向量A对应的分词作为细胞一个同义词,如图3中的淋巴细胞,此时n为1,再确定向量B对应的分词是否在预设词库中,如果在,则向量B对应的分词作为细胞一个同义词,例如图3中的干细胞,此时n为2,再确定向量C是否在预设词库中,如果不在,即不属于医学专业词,则向量C对应的分词不能作为细胞一个同义词(图3未示出),此时n还是为2,再重复上述步骤,直到查找到5个同义词,从图3中还可以看出重叠的细胞-淋巴细胞、肿瘤-骨肉瘤、白血病-淋巴瘤,即图3中分词对应的点越近,说明词义越相近。FIG. 3 is a graph of the result of word segmentation search according to an embodiment of the present disclosure. Referring to the above example, the order is vector A ..., the value of n can be set independently, and the value can be selected as 5, to determine whether the word segment corresponding to vector A is in the preset In, then the participle corresponding to vector A is used as a synonym for cell, such as lymphocyte in FIG. 3, where n is 1, and then determine whether the participle corresponding to vector B is in the preset lexicon, if it is, then the corresponding part of vector B The word segmentation is a synonym for cell, such as stem cell in Figure 3, where n is 2, and then determine whether the vector C is in the preset thesaurus. If it is not, that is, it does not belong to a medical professional word, the word segment corresponding to the vector C cannot be used as a cell A synonym (not shown in Figure 3), at this time n is still 2, repeat the above steps until 5 synonyms are found, from Figure 3 can also be seen overlapping cells-lymphocytes, tumor-osteosarcoma, leukemia -Lymphoma, that is, the closer the point corresponding to the word segmentation in Figure 3, the closer the meaning of the word.
需要说明的是:本公开实施例采用的预设模型,通过较少的向量维数,例如128维,即可以准确查找同义词,相比于现有技术中采用的模型,准确查找所需的向量维数已大幅降低,因此,本公开实施例的方法还具有节省计算资源,提高运算效率的技术效果。It should be noted that the preset model used in the embodiment of the present disclosure can accurately search for synonyms through fewer vector dimensions, such as 128 dimensions, compared to the model used in the prior art, to accurately find the required vector The number of dimensions has been greatly reduced. Therefore, the method of the embodiments of the present disclosure also has the technical effect of saving computing resources and improving computing efficiency.
在该步骤之后,该方法还可以包括:将所述n个同义词对应的向量维数都降至二维,以平面显示所述n个同义词。可以通过PCA进行向量降维,参照图3,可以更加直观看出分词之间的同义程度。After this step, the method may further include: reducing the vector dimensions corresponding to the n synonyms to two dimensions, and displaying the n synonyms in a plane. Vector dimensionality reduction can be performed through PCA. Referring to FIG. 3, the degree of synonym between word segments can be seen more intuitively.
本公开实施例提供的查找同义词的方法,通过Word2vec模型和SKIP-GRAM模型获取优化词向量矩阵,并计算待查找分词在该优化词向 量矩阵中的目标向量与其他向量的余弦距离,根据所有余弦距离,再结合预设词库剔除部分无关的分词,从而获取n个同义词,能够提高同义词查找的准确性。The method for finding synonyms provided by an embodiment of the present disclosure obtains the optimized word vector matrix through the Word2vec model and the SKIP-GRAM model, and calculates the cosine distance between the target vector of the word segmentation to be found in the optimized word vector matrix and other vectors, based on all cosines Distance, combined with the preset thesaurus to eliminate part of the unrelated word segmentation, so as to obtain n synonyms, which can improve the accuracy of the search for synonyms.
在上述实施例的基础上,所述根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词,包括:Based on the above embodiments, the obtaining n synonyms of the word segmentation to be searched for based on all cosine distances and the preset lexicon includes:
按照所有余弦距离的数值由小到大的顺序对与所有余弦距离对应的其他向量进行排序。The other vectors corresponding to all cosine distances are sorted in order of the values of all cosine distances from small to large.
具体的,装置按照所有余弦距离的数值由小到大的顺序对与所有余弦距离对应的其他向量进行排序。可参照上述实施例,不再赘述。Specifically, the device sorts the other vectors corresponding to all cosine distances in the order of the values of all cosine distances from small to large. Reference may be made to the above embodiment, and no further description will be given.
获取排序中的第一个向量对应的分词,确定第一个向量对应的分词是否在所述预设词库中。The word segmentation corresponding to the first vector in the sorting is acquired, and it is determined whether the word segmentation corresponding to the first vector is in the preset thesaurus.
具体的,装置获取排序中的第一个向量对应的分词,确定第一个向量对应的分词是否在所述预设词库中。可参照上述实施例,不再赘述。Specifically, the device obtains the word segmentation corresponding to the first vector in the sorting, and determines whether the word segmentation corresponding to the first vector is in the preset word library. Reference may be made to the above embodiment, and no further description will be given.
若确定为是,则将所述第一个向量对应的分词作为一个同义词,再确定第二个向量对应的分词是否在所述预设词库中,并重复执行,直到获取到n个同义词。If it is determined to be true, the word segment corresponding to the first vector is used as a synonym, and then it is determined whether the word segment corresponding to the second vector is in the preset thesaurus, and repeated execution until n synonyms are acquired.
具体的,装置若确定为是,则将所述第一个向量对应的分词作为一个同义词,再确定第二个向量对应的分词是否在所述预设词库中,并重复执行,直到获取到n个同义词。可参照上述实施例,不再赘述。Specifically, if the device determines that it is yes, the word segment corresponding to the first vector is used as a synonym, and then determines whether the word segment corresponding to the second vector is in the preset lexicon, and is repeatedly executed until it is obtained n synonyms. Reference may be made to the above embodiment, and no further description will be given.
本公开实施例提供的查找同义词的方法,进一步能够提高同义词查找的准确性。The method for searching synonyms provided by the embodiments of the present disclosure can further improve the accuracy of searching for synonyms.
在上述实施例的基础上,所述方法还包括:Based on the above embodiment, the method further includes:
若确定为不是,则将所述第一个向量对应的分词剔除;再确定第二个向量对应的分词是否在所述预设词库中,并重复执行,直到获取到n个同义词。If it is determined to be not, the word segmentation corresponding to the first vector is eliminated; then it is determined whether the word segmentation corresponding to the second vector is in the preset lexicon and repeated execution until n synonyms are obtained.
具体的,装置若确定为不是,则将所述第一个向量对应的分词剔除;再确定第二个向量对应的分词是否在所述预设词库中,并重复执行,直到获取到n个同义词。可参照上述实施例,不再赘述。Specifically, if the device determines that it is not, it removes the word segmentation corresponding to the first vector; then determines whether the word segmentation corresponding to the second vector is in the preset vocabulary, and repeats execution until n number of Synonyms. Reference may be made to the above embodiment, and no further description will be given.
本公开实施例提供的查找同义词的方法,通过剔除不相关分词,进一步能够提高同义词查找的准确性。The method for searching for synonyms provided by the embodiments of the present disclosure can further improve the accuracy of searching for synonyms by excluding irrelevant participles.
在上述实施例的基础上,在所述获取所述待查找分词的n个同义词的步骤之后,所述方法还包括:Based on the above embodiment, after the step of obtaining n synonyms of the word segmentation to be searched for, the method further includes:
将所述n个同义词对应的向量维数都降至二维,以平面显示所述n个同义词。All vector dimensions corresponding to the n synonyms are reduced to two dimensions, and the n synonyms are displayed in a plane.
具体的,装置将所述n个同义词对应的向量维数都降至二维,以平面显示所述n个同义词。可参照上述实施例,不再赘述。Specifically, the device reduces the vector dimensions corresponding to the n synonyms to two dimensions, and displays the n synonyms in a plane. Reference may be made to the above embodiment, and no further description will be given.
本公开实施例提供的查找同义词的方法,能够直观显示同义词。The method for finding synonyms provided by the embodiments of the present disclosure can visually display synonyms.
在上述实施例的基础上,所述优化词向量矩阵的获取,包括:Based on the above embodiment, the obtaining of the optimized word vector matrix includes:
对语料库进行分词。Segment the corpus.
具体的,装置对语料库进行分词。可参照上述实施例,不再赘述。Specifically, the device performs word segmentation on the corpus. Reference may be made to the above embodiment, and no further description will be given.
在得到的分词中获取包含在所述预设词库中的目标分词。Obtain the target word segment included in the preset word library from the obtained word segmentation.
具体的,装置在得到的分词中获取包含在所述预设词库中的目标分词。可参照上述实施例,不再赘述。Specifically, the device obtains the target word segment included in the preset word library from the obtained word segmentation. Reference may be made to the above embodiment, and no further description will be given.
根据所述预设词库对所述目标分词进行合并,以获取合并词;其中,所述预设词库包括预设合并词与预设分词之间的对应关系。Merging the target word segmentation according to the preset word library to obtain a merged word; wherein the preset word library includes a correspondence between the preset merged word and the preset word segmentation.
具体的,装置根据所述预设词库对所述目标分词进行合并,以获取合并词;其中,所述预设词库包括预设合并词与预设分词之间的对应关系。可参照上述实施例,不再赘述。Specifically, the device merges the target word segmentation according to the preset word library to obtain a merged word; wherein, the preset word library includes a correspondence between the preset merged word and the preset word segmentation. Reference may be made to the above embodiment, and no further description will be given.
根据所述合并词和未合并的剩余分词构建初始词向量矩阵;其中,所述初始词向量矩阵为N×M矩阵,其中,N为分词总数、M为每一个分词对应的向量维数,所述分词总数为所述合并词和未合并的剩余分词之和。An initial word vector matrix is constructed based on the merged words and the unmerged residual participles; wherein the initial word vector matrix is an N × M matrix, where N is the total number of participles, and M is the vector dimension corresponding to each participle, so The total number of participles is the sum of the merged words and the remaining uncombined words.
具体的,装置根据所述合并词和未合并的剩余分词构建初始词向量矩阵;其中,所述初始词向量矩阵为N×M矩阵,其中,N为分词总数、M为每一个分词对应的向量维数,所述分词总数为所述合并词和未合并的剩余分词之和。可参照上述实施例,不再赘述。Specifically, the device constructs an initial word vector matrix based on the merged words and the unmerged remaining participles; wherein the initial word vector matrix is an N × M matrix, where N is the total number of participles, and M is the vector corresponding to each participle Dimension, the total number of participles is the sum of the merged words and the unmerged remaining participles. Reference may be made to the above embodiment, and no further description will be given.
采用所述Word2vec模型对所述语料库进行滑窗取词,以获取训练样本。Use the Word2vec model to perform sliding window word extraction on the corpus to obtain training samples.
具体的,装置采用所述Word2vec模型对所述语料库进行滑窗取词,以获取训练样本。可参照上述实施例,不再赘述。Specifically, the device uses the Word2vec model to perform sliding window word retrieval on the corpus to obtain training samples. Reference may be made to the above embodiment, and no further description will be given.
采用所述SKIP-GRAM模型对所述训练样本进行训练,以获取基于所 述初始词向量矩阵的优化词向量矩阵。The SKIP-GRAM model is used to train the training samples to obtain an optimized word vector matrix based on the initial word vector matrix.
具体的,装置采用所述SKIP-GRAM模型对所述训练样本进行训练,以获取基于所述初始词向量矩阵的优化词向量矩阵。可参照上述实施例,不再赘述。Specifically, the device uses the SKIP-GRAM model to train the training samples to obtain an optimized word vector matrix based on the initial word vector matrix. Reference may be made to the above embodiment, and no further description will be given.
本公开实施例提供的查找同义词的方法,通过合理地获取优化词向量矩阵,保证了该方法正常进行。The method for finding synonyms provided by the embodiment of the present disclosure ensures that the method is performed normally by reasonably obtaining the optimized word vector matrix.
在上述实施例的基础上,所述对语料库进行分词,包括:Based on the above embodiments, the word segmentation of the corpus includes:
采用jieba库对语料库进行分词。Use jieba library to segment the corpus.
具体的,装置采用jieba库对语料库进行分词。可参照上述实施例,不再赘述。Specifically, the device uses the jieba library to segment the corpus. Reference may be made to the above embodiment, and no further description will be given.
本公开实施例提供的查找同义词的方法,能够高效对语料库进行分词。The method for searching synonyms provided by the embodiments of the present disclosure can efficiently segment the corpus.
在上述实施例的基础上,所述预设词库为包含有医学专业词的医学词库。Based on the above embodiment, the preset lexicon is a medical lexicon containing medical professional words.
具体的,装置中的所述预设词库为包含有医学专业词的医学词库。可参照上述实施例,不再赘述。Specifically, the preset thesaurus in the device is a medical thesaurus containing medical professional words. Reference may be made to the above embodiment, and no further description will be given.
本公开实施例提供的查找同义词的方法,能够提高医学专业词相关的同义词查找的准确性。The method for searching synonyms provided by the embodiments of the present disclosure can improve the accuracy of searching synonyms related to medical professional words.
图4为本公开实施例查找同义词的装置结构示意图,如图4所示,本公开实施例提供了一种查找同义词的装置,包括输入单元401、计算单元402和查找单元403,其中:4 is a schematic structural diagram of an apparatus for searching synonyms according to an embodiment of the present disclosure. As shown in FIG. 4, an embodiment of the present disclosure provides an apparatus for searching synonyms, which includes an input unit 401, a calculation unit 402, and a search unit 403, where:
输入单元401用于输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词的Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词;计算单元402用于在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦距离;查找单元403用于根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。The input unit 401 is used to input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring a word segmentation and the word segmentation as Training samples, and trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset vocabulary; the calculation unit 402 is used to obtain the target word corresponding to the word segmentation to be searched in the optimized word vector matrix Vector; and separately calculate the cosine distance of the target word vector and the other vectors in the optimized word vector matrix; the search unit 403 is used to obtain the n of the word segmentation to be searched based on all cosine distances and the preset word bank Synonyms.
具体的,输入单元401用于输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词 的Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词;计算单元402用于在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦距离;查找单元403用于根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。Specifically, the input unit 401 is used to input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and a The word segmentation is used as a training sample and the trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset vocabulary; the calculation unit 402 is used to obtain the corresponding word segmentation to be found in the optimized word vector matrix Target word vector; and calculate the cosine distance of the target word vector and the other vectors in the optimized word vector matrix separately; the search unit 403 is used to obtain the to-be-searched based on all cosine distances and the preset word library N synonyms of participle.
本公开实施例提供的查找同义词的装置,通过Word2vec模型和SKIP-GRAM模型获取优化词向量矩阵,并计算待查找分词在该优化词向量矩阵中的目标向量与其他向量的余弦距离,根据所有余弦距离,再结合预设词库剔除部分无关的分词,从而获取n个同义词,能够提高同义词查找的准确性。The device for searching synonyms provided by an embodiment of the present disclosure obtains the optimized word vector matrix through the Word2vec model and the SKIP-GRAM model, and calculates the cosine distance between the target vector and other vectors of the word segmentation to be searched for in the optimized word vector matrix, according to all cosines Distance, combined with the preset thesaurus to eliminate part of the unrelated word segmentation, so as to obtain n synonyms, which can improve the accuracy of the search for synonyms.
本公开实施例提供的查找同义词的装置具体可以用于执行上述各方法实施例的处理流程,其功能在此不再赘述,可以参照上述方法实施例的详细描述。The device for searching synonyms provided in the embodiments of the present disclosure may be specifically used to execute the processing flow of each method embodiment described above, and the functions thereof are not repeated here, and reference may be made to the detailed description of the method embodiments described above.
图5为本公开实施例提供的电子设备实体结构示意图,如图5所示,所述电子设备包括:处理器(processor)501、存储器(memory)502和总线503;FIG. 5 is a schematic diagram of the physical structure of an electronic device provided by an embodiment of the present disclosure. As shown in FIG. 5, the electronic device includes: a processor (processor) 501, a memory (memory) 502, and a bus 503;
其中,所述处理器501、存储器502通过总线503完成相互间的通信;Wherein, the processor 501 and the memory 502 communicate with each other through the bus 503;
所述处理器501用于调用所述存储器502中的程序指令,以执行上述各方法实施例所提供的方法,例如包括:输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词的Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词;在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦距离;根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。The processor 501 is used to call program instructions in the memory 502 to execute the methods provided in the above method embodiments, for example, including: inputting a word segmentation to be searched into an optimized word vector matrix; the optimized word vector matrix is used Obtained by a preset model; the preset model includes a Word2vec model for obtaining word segmentation and a SKIP-GRAM model for training the word segmentation as a training sample; the word segmentation to be searched is in a preset word library The word segmentation of; obtain the target word vector corresponding to the word segment to be found in the optimized word vector matrix; and calculate the cosine distance of the target word vector and other vectors in the optimized word vector matrix separately; according to all cosines The distance and the preset lexicon are used to obtain n synonyms of the word segmentation to be searched.
本实施例公开一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行上述各方法实施例所提供的方法,例如包括:输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词的 Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词;在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦距离;根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。This embodiment discloses a computer program product. The computer program product includes a computer program stored on a non-transitory computer-readable storage medium. The computer program includes program instructions. When the program instructions are executed by the computer, the computer The method provided by the above method embodiments can be performed, for example, including: inputting a word segmentation to be searched into an optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a The Word2vec model and the SKIP-GRAM model for training the word segmentation as a training sample; the word segmentation to be searched is a word segmentation in a preset vocabulary; obtained in the optimized word vector matrix and the word segmentation to be searched The target word vector corresponding to the word segmentation; and calculate the cosine distances of the target word vector and other vectors in the optimized word vector matrix separately; according to all cosine distances and the preset word library, obtain the n of the word segmentation to be found Synonyms.
本实施例提供一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令使所述计算机执行上述各方法实施例所提供的方法,例如包括:输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词的Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词;在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦距离;根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。This embodiment provides a non-transitory computer-readable storage medium. The non-transitory computer-readable storage medium stores computer instructions, and the computer instructions cause the computer to perform the methods provided by the foregoing method embodiments, for example, including : Input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, and SKIP-GRAM model for training; the word segmentation to be searched is a word segmentation in a preset lexicon; the target word vector corresponding to the word segmentation to be searched is obtained in the optimized word vector matrix; and the target words are calculated separately The cosine distance of the vector and other vectors in the optimized word vector matrix; according to all cosine distances and the preset lexicon, n synonyms of the word segmentation to be searched for are obtained.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art may understand that all or part of the steps to implement the above method embodiments may be completed by program instructions related hardware. The foregoing program may be stored in a computer-readable storage medium, and when the program is executed, The steps of the above method embodiments are included; and the foregoing storage media include various media that can store program codes, such as ROM, RAM, magnetic disks, or optical disks.
以上所描述的电子设备等实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The above-described embodiments of the electronic device and the like are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is It can be located in one place, or it can be distributed on multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art can understand and implement without paying creative labor.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用 以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions can be embodied in the form of software products in essence or part that contributes to the existing technology, and the computer software products can be stored in computer-readable storage media, such as ROM / RAM, magnetic Discs, optical discs, etc., include several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in the various embodiments or some parts of the embodiments.
最后应说明的是:以上各实施例仅用以说明本公开的实施例的技术方案,而非对其限制;尽管参照前述各实施例对本公开的实施例进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本公开的各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the embodiments of the present disclosure, rather than limiting them; although the embodiments of the present disclosure have been described in detail with reference to the foregoing embodiments, the ordinary The skilled person should understand that they can still modify the technical solutions described in the foregoing embodiments, or equivalently replace some or all of the technical features; and these modifications or replacements do not deviate from the essence of the corresponding technical solutions. The scope of the technical solutions of the various embodiments.

Claims (10)

  1. 一种查找同义词的方法,其特征在于,包括:A method for finding synonyms is characterized by:
    输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词的Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词;Input the word segmentation to be searched into the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring word segmentation and the word segmentation as a training sample, and performs The trained SKIP-GRAM model; the word segmentation to be searched is a word segmentation in a preset lexicon;
    在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦距离;Obtaining the target word vector corresponding to the word segment to be searched in the optimized word vector matrix; and calculating the cosine distance of the target word vector and other vectors in the optimized word vector matrix separately;
    根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。According to all cosine distances and the preset lexicon, obtain n synonyms of the word segmentation to be searched.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词,包括:The method according to claim 1, wherein the obtaining n synonyms of the word segmentation to be searched for based on all cosine distances and the preset lexicon includes:
    按照所有余弦距离的数值由小到大的顺序对与所有余弦距离对应的其他向量进行排序;Sort the other vectors corresponding to all cosine distances in the order of small to large values of all cosine distances;
    获取排序中的第一个向量对应的分词,确定第一个向量对应的分词是否在所述预设词库中;Obtaining the word segmentation corresponding to the first vector in the sorting, and determining whether the word segmentation corresponding to the first vector is in the preset word library;
    若确定为是,则将所述第一个向量对应的分词作为一个同义词,再确定第二个向量对应的分词是否在所述预设词库中,并重复执行,直到获取到n个同义词。If it is determined to be true, the word segment corresponding to the first vector is used as a synonym, and then it is determined whether the word segment corresponding to the second vector is in the preset thesaurus, and repeated execution until n synonyms are acquired.
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:The method according to claim 2, wherein the method further comprises:
    若确定为不是,则将所述第一个向量对应的分词剔除;再确定第二个向量对应的分词是否在所述预设词库中,并重复执行,直到获取到n个同义词。If it is determined to be not, the word segmentation corresponding to the first vector is eliminated; then it is determined whether the word segmentation corresponding to the second vector is in the preset lexicon and repeated execution until n synonyms are obtained.
  4. 根据权利要求1至3任一所述的方法,其特征在于,在所述获取所述待查找分词的n个同义词的步骤之后,所述方法还包括:The method according to any one of claims 1 to 3, wherein after the step of acquiring n synonyms of the word segmentation to be searched for, the method further comprises:
    将所述n个同义词对应的向量维数都降至二维,以平面显示所述n个同义词。All vector dimensions corresponding to the n synonyms are reduced to two dimensions, and the n synonyms are displayed in a plane.
  5. 根据权利要求1至3任一所述的方法,其特征在于,所述优化词向量矩阵的获取,包括:The method according to any one of claims 1 to 3, wherein the obtaining of the optimized word vector matrix includes:
    对语料库进行分词;Segment the corpus;
    在得到的分词中获取包含在所述预设词库中的目标分词;Obtaining the target word segment included in the preset word library from the obtained word segmentation;
    根据所述预设词库对所述目标分词进行合并,以获取合并词;其中,所述预设词库包括预设合并词与预设分词之间的对应关系;Merging the target word segmentation according to the preset thesaurus to obtain a merged word; wherein, the preset thesaurus includes the correspondence between the preset merged word and the preset word segmentation;
    根据所述合并词和未合并的剩余分词构建初始词向量矩阵;其中,所述初始词向量矩阵为N×M矩阵,其中,N为分词总数、M为每一个分词对应的向量维数,所述分词总数为所述合并词和未合并的剩余分词之和;An initial word vector matrix is constructed based on the merged words and the unmerged residual participles; wherein the initial word vector matrix is an N × M matrix, where N is the total number of participles, and M is the vector dimension corresponding to each participle, so The total number of participles is the sum of the merged words and the remaining uncombined words.
    采用所述Word2vec模型对所述语料库进行滑窗取词,以获取训练样本;Using the Word2vec model to perform sliding window word extraction on the corpus to obtain training samples;
    采用所述SKIP-GRAM模型对所述训练样本进行训练,以获取基于所述初始词向量矩阵的优化词向量矩阵。The SKIP-GRAM model is used to train the training samples to obtain an optimized word vector matrix based on the initial word vector matrix.
  6. 根据权利要求5所述的方法,其特征在于,所述对语料库进行分词,包括:The method according to claim 5, wherein the segmentation of the corpus includes:
    采用jieba库对语料库进行分词。Use jieba library to segment the corpus.
  7. 根据权利要求1所述的方法,其特征在于,所述预设词库为包含有医学专业词的医学词库。The method according to claim 1, wherein the preset lexicon is a medical lexicon containing medical professional words.
  8. 一种查找同义词的装置,其特征在于,包括:A device for searching synonyms is characterized by including:
    输入单元,用于输入待查找分词至优化词向量矩阵;所述优化词向量矩阵是采用预设模型得到的;所述预设模型包括用于获取分词的Word2vec模型和用于将所述分词作为训练样本,并进行训练的SKIP-GRAM模型;所述待查找分词为预设词库中的分词;The input unit is used to input the word segmentation to be searched to the optimized word vector matrix; the optimized word vector matrix is obtained by using a preset model; the preset model includes a Word2vec model for acquiring a word segmentation and the word segmentation as a Training samples, and the SKIP-GRAM model for training; the word segmentation to be searched is a word segmentation in a preset vocabulary;
    计算单元,用于在所述优化词向量矩阵中获取与所述待查找分词对应的目标词向量;并分别计算所述目标词向量和所述优化词向量矩阵中的其它向量的余弦距离;A calculation unit, configured to obtain a target word vector corresponding to the word segmentation to be found in the optimized word vector matrix; and calculate the cosine distance of the target word vector and other vectors in the optimized word vector matrix, respectively;
    查找单元,用于根据所有余弦距离和所述预设词库,获取所述待查找分词的n个同义词。The searching unit is configured to obtain n synonyms of the word segmentation to be searched based on all cosine distances and the preset lexicon.
  9. 一种电子设备,其特征在于,包括:处理器、存储器和总线,其中,An electronic device is characterized by comprising: a processor, a memory and a bus, wherein,
    所述处理器和所述存储器通过所述总线完成相互间的通信;The processor and the memory complete communication with each other through the bus;
    所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用 所述程序指令能够执行如权利要求1至7任一所述的方法。The memory stores program instructions executable by the processor, and the processor invokes the program instructions to perform the method according to any one of claims 1 to 7.
  10. 一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令使所述计算机执行如权利要求1至7任一所述的方法。A non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions that cause the computer to perform the method according to any one of claims 1 to 7. .
PCT/CN2019/124513 2018-10-11 2019-12-11 Synonym search method and device WO2020074022A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811181685.9 2018-10-11
CN201811181685.9A CN109543175B (en) 2018-10-11 2018-10-11 Method and device for searching synonyms

Publications (1)

Publication Number Publication Date
WO2020074022A1 true WO2020074022A1 (en) 2020-04-16

Family

ID=65843573

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/124513 WO2020074022A1 (en) 2018-10-11 2019-12-11 Synonym search method and device

Country Status (2)

Country Link
CN (1) CN109543175B (en)
WO (1) WO2020074022A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543175B (en) * 2018-10-11 2020-06-02 北京诺道认知医学科技有限公司 Method and device for searching synonyms
CN111191454A (en) * 2020-01-06 2020-05-22 精硕科技(北京)股份有限公司 Entity matching method and device
CN111241833A (en) * 2020-01-16 2020-06-05 支付宝(杭州)信息技术有限公司 Word segmentation method and device for text data and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844346A (en) * 2017-02-09 2017-06-13 北京红马传媒文化发展有限公司 Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec
CN107291914A (en) * 2017-06-27 2017-10-24 达而观信息科技(上海)有限公司 A kind of method and system for generating search engine inquiry expansion word
CN108133045A (en) * 2018-01-12 2018-06-08 广州杰赛科技股份有限公司 Keyword extracting method and system, keyword extraction model generating method and system
CN109543175A (en) * 2018-10-11 2019-03-29 北京诺道认知医学科技有限公司 A kind of method and device for searching synonym

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050033568A1 (en) * 2003-08-08 2005-02-10 Hong Yu Methods and systems for extracting synonymous gene and protein terms from biological literature
CN105718586B (en) * 2016-01-26 2018-12-28 中国人民解放军国防科学技术大学 The method and device of participle
CN105786782B (en) * 2016-03-25 2018-10-19 北京搜狗信息服务有限公司 A kind of training method and device of term vector
CN107451126B (en) * 2017-08-21 2020-07-28 广州多益网络股份有限公司 Method and system for screening similar meaning words
CN107748755B (en) * 2017-09-19 2019-11-05 华为技术有限公司 Synonym method for digging, device, equipment and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844346A (en) * 2017-02-09 2017-06-13 北京红马传媒文化发展有限公司 Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec
CN107291914A (en) * 2017-06-27 2017-10-24 达而观信息科技(上海)有限公司 A kind of method and system for generating search engine inquiry expansion word
CN108133045A (en) * 2018-01-12 2018-06-08 广州杰赛科技股份有限公司 Keyword extracting method and system, keyword extraction model generating method and system
CN109543175A (en) * 2018-10-11 2019-03-29 北京诺道认知医学科技有限公司 A kind of method and device for searching synonym

Also Published As

Publication number Publication date
CN109543175A (en) 2019-03-29
CN109543175B (en) 2020-06-02

Similar Documents

Publication Publication Date Title
CN109472033B (en) Method and system for extracting entity relationship in text, storage medium and electronic equipment
WO2022141861A1 (en) Emotion classification method and apparatus, electronic device, and storage medium
CN106570708B (en) Management method and system of intelligent customer service knowledge base
WO2019153737A1 (en) Comment assessing method, device, equipment and storage medium
WO2020052405A1 (en) Corpus annotation set generation method and apparatus, electronic device, and storage medium
US20170154077A1 (en) Method for comment tag extraction and electronic device
WO2020074022A1 (en) Synonym search method and device
CN105389307A (en) Statement intention category identification method and apparatus
US20130159348A1 (en) Computer-Implemented Systems and Methods for Taxonomy Development
US20190332620A1 (en) Natural language processing and artificial intelligence based search system
CN113761868B (en) Text processing method, text processing device, electronic equipment and readable storage medium
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN110955767A (en) Algorithm and device for generating intention candidate set list set in robot dialogue system
JP7369228B2 (en) Method, device, electronic device, and storage medium for generating images of user interest
CN114840685A (en) Emergency plan knowledge graph construction method
CN115329207B (en) Intelligent sales information recommendation method and system
CN108733702B (en) Method, device, electronic equipment and medium for extracting upper and lower relation of user query
CN113486649B (en) Text comment generation method and electronic device
Nguyen et al. A model of convolutional neural network combined with external knowledge to measure the question similarity for community question answering systems
CN115577109A (en) Text classification method and device, electronic equipment and storage medium
CN115129864A (en) Text classification method and device, computer equipment and storage medium
WO2021082570A1 (en) Artificial intelligence-based semantic identification method, device, and semantic identification apparatus
US11409773B2 (en) Selection device, selection method, and non-transitory computer readable storage medium
CN114036267A (en) Conversation method and system
KR101697992B1 (en) System and Method for Recommending Bug Fixing Developers based on Multi-Developer Network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19871591

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19871591

Country of ref document: EP

Kind code of ref document: A1