WO2021042526A1 - 基于相似度值的搜索方法、装置、计算机设备和存储介质 - Google Patents

基于相似度值的搜索方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2021042526A1
WO2021042526A1 PCT/CN2019/117213 CN2019117213W WO2021042526A1 WO 2021042526 A1 WO2021042526 A1 WO 2021042526A1 CN 2019117213 W CN2019117213 W CN 2019117213W WO 2021042526 A1 WO2021042526 A1 WO 2021042526A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
sequence
data
search
similarity value
Prior art date
Application number
PCT/CN2019/117213
Other languages
English (en)
French (fr)
Inventor
刘伟
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021042526A1 publication Critical patent/WO2021042526A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • This application relates to the computer field, and in particular to a search method, device, computer equipment and storage medium based on similarity value.
  • Searching is a method of using the high performance of the computer to purposefully enumerate part or all of the possible situations in the solution space of a problem, so as to find the solution of the problem.
  • the traditional search is to segment the input, and then match the keywords after the word segmentation to achieve the search purpose.
  • this search method is simple, it must be based on correct and a large number of entries; if the search target has multi-dimensional attributes, then the entries will increase exponentially, and subsequent maintenance costs will be large and error-prone. For example, if the search target has M attributes, and each attribute has N possibilities, then the total number of entries needs to be as high as N to the power of M. Therefore, the search method in the prior art requires too many entries and requires too much computer resources.
  • the main purpose of this application is to provide a search method, device, computer equipment and storage medium based on the similarity value, aiming to achieve accurate search under the premise of only a small amount of computer resources.
  • this application proposes a search method based on similarity values, which includes the following steps:
  • the target data bars are sorted to obtain a data bar sequence, and the data bar sequence is output.
  • the search method, device, computer equipment and storage medium based on the similarity value of the present application acquire data strips, preprocess the data strips to obtain the word sequence of the data strips; retrieve pre-stored designated standard sentences, and calculate the data The similarity value between the word sequence and the designated standard sentence; the data strip is stored in a preset database, and similar fields are added to the database; the search sentence entered by the user is obtained, and the search is performed The sentence is preprocessed to obtain the search word sequence; the search similarity value between the search word sequence and the specified standard sentence is calculated; the hit range is generated [similarity value for search-A, similarity value for search + A], And retrieve the data strips whose similarity values are in the hit range from the database and record them as target data strips; sort the target data strips to obtain a data strip sequence, and output the data strip sequence, thereby Realize search under the premise that only a small amount of computer resources are needed.
  • FIG. 1 is a schematic flowchart of a search method based on a similarity value according to an embodiment of the application
  • FIG. 2 is a schematic block diagram of the structure of a search device based on a similarity value according to an embodiment of the application;
  • FIG. 3 is a schematic block diagram of the structure of a computer device according to an embodiment of the application.
  • an embodiment of the present application provides a search method based on similarity values, including the following steps:
  • S7 Sort the target data strips according to a preset sorting rule to obtain a data strip sequence, and output the data strip sequence.
  • This application uses the similarity value as the basis for search and matching. Compared with related technical solutions, only one field is required to record the similarity value. If the number of similarity values is N, then only N entries (ie, one field) are required. Value structure) is sufficient, so that only a small amount of computer resources are needed to achieve the purpose of searching.
  • the data strip is obtained, and the data strip is preprocessed according to the preset preprocessing method to obtain the word sequence of the data strip.
  • the data bar refers to a sentence containing information, for example, the price of pork in Beijing on day A is B yuan; the transaction volume of financial product C in Shanghai on day A is D, and so on.
  • the preprocessing is, for example, performing word segmentation processing on the data strip, thereby obtaining an initial word sequence composed of multiple words; judging whether there is a meaningless word in the initial word sequence by querying a preset meaningless word library; If there are meaningless words in the initial word sequence, remove meaningless words in the initial word sequence to obtain an intermediate word sequence; determine whether there are synonyms in the intermediate word sequence by querying a preset thesaurus Group; if there is a synonym group in the intermediate word sequence, replace all words in the synonym group with any one of the synonym group to obtain a data strip word sequence. For example, preprocessing the pork price of Beijing on A day to B yuan to obtain:
  • step S2 the pre-stored designated standard sentence is retrieved, and the similarity value between the word sequence of the data strip and the designated standard sentence is calculated according to a preset similarity algorithm.
  • a designated standard sentence is used as a reference standard to obtain a similarity value, which is used to subsequently determine whether the search is a hit.
  • the preset similarity algorithm can be any algorithm that can be used to calculate the similarity between sentences, for example: retrieve a pre-stored specified standard sentence; query a preset word vector library to obtain the specified standard sentence The word vector corresponding to each word in the data bar, so as to obtain the standard word vector sequence corresponding to the specified standard sentence; query the preset word vector library to obtain the word vector corresponding to each word in the word sequence of the data entry, so as to obtain the data A data strip word vector sequence corresponding to a word sequence; using a preset distance calculation formula, calculate the distance value between the standard word vector sequence and the data strip word vector sequence, and record the distance value as the Similarity value.
  • word2vec is a tool for training word vectors, including CBOW (Continuous Bag of Words, continuous bag of words model) and Skip-Gram two models.
  • CBOW Continuous Bag of Words, continuous bag of words model
  • Skip-Gram two models.
  • CBOW infers the target word from the original sentence
  • Skip-Gram infers the original sentence from the target word.
  • the data strip is stored in a preset database, and a similar field is added to the database, wherein the similar field of the data strip records the similarity value.
  • the data bar is stored in the database, and since the similarity field of the data bar records the similarity value, the data bar can be found by searching based on the similarity value.
  • the search sentence input by the user is obtained, and the search sentence is preprocessed to obtain the search word sequence.
  • the method of preprocessing may be the same as or different from the method of preprocessing the data bar, but the method of preprocessing the search sentence at least includes segmenting the search sentence to obtain the search Word sequence.
  • this application preferably adopts the aforementioned method of preprocessing the data bar to preprocess the search sentence.
  • the search sentences are, for example, the price of pork in Beijing; the trading volume of financial product C in Shanghai, and so on.
  • the search similarity value between the search word sequence and the designated standard sentence is calculated according to a preset similarity algorithm.
  • the preset similarity algorithm may be the same as or different from the aforementioned method of calculating the similarity value between the data strip word sequence and the specified standard sentence.
  • This application is preferably the same as the aforementioned method for calculating the data strip word sequence and
  • the method for specifying the similarity value of the standard sentence is the same.
  • the calculated similarity value for search reflects the degree of matching between the search sentence and the specified standard sentence, and will be used as a basis for determining the search hit target in the follow-up.
  • a hit range [similarity value for search-a, similarity value for search+a] is generated, and data bars with similarity values in the hit range are retrieved from the database, and Recorded as the target data bar, where a is the preset range parameter, and a is a positive number greater than 0.
  • This application uses a method of generating a hit range [similarity value for search-a, similarity value for search+a], which reduces the missed detection rate of the search method of this application. If it is determined that the search hits the target only by the way that the similarity value used for search is the same as the similarity value retrieved in the database, some similar data bars will be missed, resulting in poor search effect.
  • the method of generating the hit range [similarity value for search-a, similarity value for search + a] is used to expand the hit range and avoid missed detection.
  • the target data bars are sorted to obtain a data bar sequence, and the data bar sequence is output.
  • the preset sorting rule may be any sorting rule, for example, sorting in ascending or descending order according to the absolute value of the difference between the similarity value for search and the similarity value of the similar field record, so as to obtain a data bar sequence.
  • the preset sorting rule is, for example: acquiring search records of the user, in which search keywords are recorded in the search records; according to whether the target data bar has the search keywords, the target Data bars are classified into first data bars and second data bars, wherein the first data bar has the search keyword; the similarity value between the search similarity value and the similar field record of the target data bar is calculated.
  • the absolute value of the difference between the values arrange the first data bar and the second data bar in descending or ascending order according to the absolute value size, respectively, to obtain a first data bar sequence and a second data bar sequence; Combining the first data strip sequence and the second data strip sequence by preferentially displaying the first data strip sequence, thereby obtaining the data strip sequence, and outputting the data strip sequence.
  • the search purpose is realized under the premise of only relying on a small amount of resources.
  • the step S1 of preprocessing the data strip according to the preset preprocessing method to obtain the word sequence of the data strip includes:
  • S101 Perform word segmentation processing on the data strip, thereby obtaining an initial word sequence composed of multiple words
  • S102 Determine whether there is a meaningless word in the initial word sequence by querying a preset meaningless word library;
  • S104 Determine whether there is a synonym group in the intermediate word sequence by querying a preset thesaurus
  • the data strip is preprocessed according to the preset preprocessing method to obtain the word sequence of the data strip.
  • word segmentation can use open source word segmentation tools, such as jieba, THULAC, NLPIR, etc.
  • open source word segmentation tools such as jieba, THULAC, NLPIR, etc.
  • the pork price in Beijing on A day is B yuan, divided into:
  • Further preprocessing includes: meaningless word removal and synonym replacement, so as to complete the preprocessing to obtain the data strip word sequence.
  • the meaningless words in the initial word sequence it is determined whether there are meaningless words in the initial word sequence; if there are meaningless words in the initial word sequence, the meaningless words in the initial word sequence are Removal, so as to obtain the intermediate word sequence, to achieve the step of removing meaningless words.
  • the words in, are, and are meaningless words, remove them.
  • the thesaurus it is determined whether there is a synonym group in the intermediate word sequence; if there is a synonym group in the intermediate word sequence, all words in the synonym group are replaced with those in the synonym group Any one of, thus obtain the word sequence of the data bar, and realize the substitution of synonyms.
  • the thesaurus includes multiple synonym entries.
  • the method includes:
  • S12 Determine whether the number of occurrences of the specified word is greater than a preset number threshold
  • This application uses designated standard sentences as the reference standard for similarity values. Therefore, the selection of designated standard sentences is particularly important, which is related to the accuracy of search results. Therefore, this application uses statistics of the number of occurrences of each word in the word sequence of multiple data strips to obtain the word with the most occurrences and record it as a designated word; if the number of occurrences of the designated word is greater than a preset threshold of times, it will be The corresponding relationship between the word and the standard sentence, the method of obtaining the designated standard sentence corresponding to the designated word, and the principle of maximizing the correlation between the designated standard sentence and the data strip to be stored, so as to obtain the designated standard sentence.
  • the specified standard sentence is, for example, a sentence including designated words.
  • the counting the number of occurrences of each word in the word sequence of the multiple data strips, obtaining the word with the most occurrences, and recording it as the designated word can also be replaced by: counting the number of occurrences of each word in the word sequence of the multiple data strips, Obtain the words whose occurrence times exceed the preset number value and record them as designated words; thus, the corresponding relationship between multiple designated words and standard sentences is used to obtain designated standard sentences, thereby further improving search accuracy.
  • the step S2 of retrieving the pre-stored designated standard sentence and calculating the similarity value between the word sequence of the data strip and the designated standard sentence according to a preset similarity algorithm includes:
  • S202 Query a preset word vector library to obtain a word vector corresponding to each word in the designated standard sentence, so as to obtain a standard word vector sequence corresponding to the designated standard sentence;
  • S203 Query a preset word vector library to obtain a word vector corresponding to each word in the data strip word sequence, so as to obtain a data strip word vector sequence corresponding to the data strip word sequence;
  • S204 Using a preset distance calculation formula, calculate the distance value between the standard word vector sequence and the data strip word vector sequence, and record the distance value as the similarity value.
  • the preset word vector database refers to a database that stores the mapping relationship between words and vectors, and is used to map words into vectors, thereby realizing the conversion of natural language that cannot be recognized by the computer into numbers.
  • the word vector database can be obtained in any way, such as directly using the trained word vector database, or using the word2vec tool to train the words prepared in advance.
  • word2vec includes the CBOW (Continuous Bag of Words, continuous bag of words model) model .
  • CBOW is to infer the target word from the original sentence, and this application preferably adopts the CBOW model for word vector training.
  • the distance calculation formula is to calculate the distance value between the standard word vector sequence and the data strip word vector sequence, and record the distance value as the similarity value.
  • the distance calculation formula in this application is used to calculate the distance (similarity) between two word vector sequences, and any feasible distance algorithm can be used, such as an algorithm based on Euclidean distance or an algorithm based on cosine similarity.
  • the preset distance calculation formula is used to calculate the distance value between the standard word vector sequence and the data strip word vector sequence, and the distance value is recorded as the similarity value Step S204 includes:
  • Distance(I,R) is the standard word vector sequence I and data The distance of the word vector sequence R; I is the standard word vector sequence; R is the data item word vector sequence; Tij is the i-th word in the standard word vector sequence I to the jth word in the data word vector sequence R The weight transfer amount of each word; di is the word frequency of the i-th word in the standard word vector sequence I; d' j is the word frequency of the j-th word in the data entry word vector sequence R; c(i,j) is the standard The Euclidean distance between the i-th word in the word vector sequence I and the j-th word in the data item word vector sequence R; m is the number of words with the word vector in the standard word vector sequence I; n is the data item word vector sequence R The number of words with word vectors in.
  • the calculation of the distance value between the standard word vector sequence and the data strip word vector sequence is realized.
  • the above formula uses the Euclidean distance of the word vector.
  • the calculation formula of the Euclidean distance is:
  • the standard word vector sequence and the data strip word vector sequence can be calculated The distance value between.
  • a is a preset range parameter, and a is a positive number greater than 0.
  • the range parameter a corresponding to the specified standard sentence is obtained.
  • This application uses the generation of the hit range [similarity value for search-a, similarity value for search+a] to expand the search range (fuzzy search) and avoid missed detection.
  • the precise search method can improve search efficiency and improve user experience. Therefore, before generating the hit range [similarity value for search-a, similarity value for search+a], this application first adopts the method of judging whether there is a data strip with a similarity value equal to the similarity value for search in the database.
  • the accurate search is missed, that is, if there is no data bar with the similarity value equal to the similarity value for the search in the database, then the corresponding relationship between the preset standard sentence and the range parameter is obtained.
  • the range parameter a corresponding to the designated standard sentence is used to perform fuzzy search again. According to this, the precise search first and then the fuzzy search are realized, and the precise search does not consume much computer resources, thereby improving the search efficiency under the premise of fewer computer resources.
  • the step S7 of sorting the target data bar according to a preset sorting rule to obtain a data bar sequence, and outputting the data bar sequence includes:
  • S702 According to whether the target data strip has the search keyword, classify the target data strip into a first data strip and a second data strip, wherein the first data strip has the search keyword;
  • S703 Calculate the absolute value of the difference between the similarity value for search and the similarity value recorded in the similar field of the target data bar.
  • S704 Arranging the first data bar and the second data bar in descending order or ascending order according to the absolute value size, respectively, to obtain a first data bar sequence and a second data bar sequence;
  • the target data strips are sorted to obtain a data strip sequence, and the data strip sequence is output.
  • the sorting of search results is very important, and the data bar that best meets the needs of users should be displayed to users first.
  • This application first obtains the search records of the user, where search keywords are recorded in the search records; classifies the target data bar into a first data bar and a second data bar, wherein the first data bar has all The search keywords are described, so the first data bar should be displayed first, because the first data bar is more in line with the user's search habits, that is, more in line with the user's needs.
  • the first data bar and the second data bar are respectively based on the absolute value
  • the values are sorted in descending order or ascending order to obtain the first data bar sequence and the second data bar sequence. Since "the absolute value of the difference between the similarity value for the search and the similarity value of the similar field record of the target data strip" reflects the matching degree of the search sentence and the target data strip, sorting is performed accordingly.
  • the first data strip sequence and the second data strip sequence are combined in a manner of preferentially displaying the first data strip sequence, thereby obtaining a data strip sequence, and outputting the data strip sequence.
  • the first priority principle is whether there is a search keyword
  • the absolute value size is the second priority principle for sorting, thereby obtaining a data bar sequence.
  • the search method based on the similarity value of the present application obtains data strips, preprocesses the data strips to obtain the word sequence of the data strips; retrieves a pre-stored designated standard sentence, and calculates the word sequence of the data strip and the designated standard
  • the similarity value of the sentence the data strip is stored in a preset database, and similar fields are added to the database; the search sentence entered by the user is obtained, and the search sentence is preprocessed to obtain the search word Sequence; Calculate the search similarity value between the search word sequence and the specified standard sentence; generate a hit range [search similarity value-a, search similarity value+a], and retrieve it from the database
  • the data bars whose similarity values are in the hit range are recorded as target data bars; the target data bars are sorted to obtain a data bar sequence, and the data bar sequence is output. In this way, the search can be realized on the premise that only a small amount of computer resources are required.
  • an embodiment of the present application provides a search device based on a similarity value, including:
  • the data strip word sequence acquisition unit 10 is configured to acquire data strips, and preprocess the data strips according to a preset preprocessing method to obtain a data strip word sequence;
  • the first similarity value calculation unit 20 is configured to retrieve a pre-stored designated standard sentence, and calculate the similarity value between the word sequence of the data strip and the designated standard sentence according to a preset similarity algorithm;
  • the storage unit 30 is configured to store the data strip in a preset database and add a similar field to the database, wherein the similar field of the data strip records the similarity value;
  • the search word sequence obtaining unit 40 is configured to obtain the search sentence input by the user, and preprocess the search sentence to obtain the search word sequence;
  • the second similarity value calculation unit 50 is configured to calculate the search similarity value between the search word sequence and the designated standard sentence according to a preset similarity algorithm
  • the target data strip acquiring unit 60 is configured to generate a hit range [similarity value for search-a, similarity value for search+a], and retrieve data bars with similarity values in the hit range from the database , And record it as the target data bar, where a is the preset range parameter, and a is a positive number greater than 0;
  • the data strip sequence output unit 70 is configured to sort the target data strips to obtain a data strip sequence according to a preset sorting rule, and output the data strip sequence.
  • the data strip word sequence acquiring unit 10 includes:
  • the initial word sequence acquisition subunit is used to perform word segmentation processing on the data strip, so as to obtain an initial word sequence composed of multiple words;
  • the meaningless word judgment subunit is used for judging whether there are meaningless words in the initial word sequence by querying a preset meaningless word library;
  • the intermediate word sequence acquisition subunit is used to remove the meaningless words in the initial word sequence if there are meaningless words in the initial word sequence, so as to obtain the intermediate word sequence;
  • the synonym group judgment subunit is used to judge whether there is a synonym group in the intermediate word sequence by querying a preset thesaurus;
  • the data strip word sequence acquisition subunit is used to replace all words in the synonym group with any one of the synonym groups if there is a synonym group in the intermediate word sequence, thereby obtaining the data strip word sequence.
  • the device includes:
  • the designated word acquisition unit is used to count the number of occurrences of each word in the word sequence of multiple data strips, acquire the word with the most occurrences, and record it as the designated word;
  • the frequency threshold judging unit is used to judge whether the occurrence frequency of the specified word is greater than a preset frequency threshold
  • a designated standard sentence acquiring unit is configured to, if the number of occurrences of the designated word is greater than a preset threshold of times, acquire a designated standard sentence corresponding to the designated word according to the correspondence relationship between the preset word and the standard sentence.
  • the first similarity value calculation unit 20 includes:
  • the designated standard sentence retrieval subunit is used to retrieve the pre-stored designated standard sentence
  • the standard word vector sequence obtaining subunit is used to query a preset word vector library to obtain the word vector corresponding to each word in the specified standard sentence, so as to obtain the standard word vector sequence corresponding to the specified standard sentence;
  • the first similarity value calculation subunit is configured to use a preset distance calculation formula to calculate the distance value between the standard word vector sequence and the data strip word vector sequence, and record the distance value as the Similarity value.
  • the first similarity value calculation subunit includes:
  • the first similarity value calculation module is used to adopt the formula:
  • Distance(I,R) is the standard word vector sequence I and data The distance of the word vector sequence R; I is the standard word vector sequence; R is the data item word vector sequence; Tij is the i-th word in the standard word vector sequence I to the jth word in the data word vector sequence R The weight transfer amount of each word; di is the word frequency of the i-th word in the standard word vector sequence I; d' j is the word frequency of the j-th word in the data entry word vector sequence R; c(i,j) is the standard The Euclidean distance between the i-th word in the word vector sequence I and the j-th word in the data item word vector sequence R; m is the number of words with the word vector in the standard word vector sequence I; n is the data item word vector sequence R The number of words with word vectors in.
  • the device includes:
  • a data bar judging unit for judging whether there is a data bar with a similarity value equal to the similarity value for searching in the database
  • the range parameter a obtaining unit is configured to, if there is no data bar with a similarity value equal to the similarity value for the search in the database, obtain the data corresponding to the specified standard according to the preset corresponding relationship between the standard sentence and the range parameter The range parameter a corresponding to the sentence;
  • the hit range generating instruction generating unit is configured to generate a hit range generating instruction, wherein the hit range generating instruction is used to instruct to generate a hit range according to the range parameter a and the similarity value for search.
  • the data bar sequence output unit 70 includes:
  • the search record obtaining subunit is used to obtain the search record of the user, wherein the search key is recorded in the search record;
  • the data strip classification subunit is used to classify the target data strip into a first data strip and a second data strip according to whether the target data strip has the search keyword, wherein the first data strip has the search keyword;
  • An absolute value obtaining subunit configured to calculate the absolute value of the difference between the similarity value for search and the similarity value of the similar field record of the target data strip;
  • a data strip arranging subunit for arranging the first data strip and the second data strip in descending order or ascending order according to the absolute value size, respectively, to obtain a first data strip sequence and a second data strip sequence;
  • the data strip sequence output subunit is configured to combine the first data strip sequence and the second data strip sequence in a manner of preferentially displaying the first data strip sequence to obtain a data strip sequence and output the data strip sequence.
  • an embodiment of the present application also provides a computer device.
  • the computer device may be a server, and its internal structure may be as shown in the figure.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor designed by the computer is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, a computer program, and a database.
  • the memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the database of the computer equipment is used to store the data used in the search method based on the similarity value.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer program is executed by the processor to realize a search method based on the similarity value.
  • the above-mentioned processor executes the above-mentioned similarity value-based search method, wherein the steps included in the method respectively correspond to the steps of executing the similarity value-based search method of the foregoing embodiment, and will not be repeated here.
  • An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored.
  • a search method based on a similarity value is implemented, wherein the steps included in the method are the same as those in the previous implementation.
  • the steps of the search method based on the similarity value correspond to each other, which will not be repeated here.
  • the computer-readable storage medium is, for example, a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.

Abstract

一种基于相似度值的搜索方法、装置、计算机设备和存储介质,该方法包括:获取数据条,根据预设的预处理方法对数据条进行预处理,得到数据条单词序列(S1);调取预存的指定标准句子,根据预设的相似度算法,计算数据条单词序列与指定标准句子的相似度值(S2);将数据条存入预设的数据库中,并在数据库中新增相似字段,其中数据条的相似字段记录相似度值(S3);获取用户输入的搜索句子,并对搜索句子进行预处理,得到搜索单词序列(S4);根据预设的相似度算法,计算搜索单词序列与指定标准句子的搜索用相似度值(S5);生成命中范围[搜索用相似度值-a,搜索用相似度值+a],并从数据库中调取相似度值处于命中范围中的数据条,并记为目标数据条,其中a为预设的范围参数,a为大于0的正数(S6);根据预设的排序规则,对目标数据条进行排序得到数据条序列,并输出数据条序列(S7)。从而在仅需要少量计算机资源的前提下实现搜索。

Description

基于相似度值的搜索方法、装置、计算机设备和存储介质
本申请要求于2019年9月6日提交中国专利局、申请号为201910844343.9,发明名称为“基于相似度值的搜索方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及到计算机领域,特别是涉及到一种基于相似度值的搜索方法、装置、计算机设备和存储介质。
背景技术
搜索是利用计算机的高性能来有目的的穷举一个问题解空间的部分或所有的可能情况,从而求出问题的解的一种方法。传统的搜索是通过对输入进行分词,分词后的关键字再进行词条匹配达到搜索的目的。这种搜索方法虽然简单,但必须以正确且大量的词条为基础;如果搜索的目标具有多维属性,那么词条就是爆炸性的增长,后期维护成本大且容易出错。例如,搜索目标有M个属性,每个属性有N种可能,那么词条总数量就需要高达到N的M次方。因此现有技术的搜索方法需要的词条数量太多,需要过多的计算机资源。
技术问题
本申请的主要目的为提供一种基于相似度值的搜索方法、装置、计算机设备和存储介质,旨在仅需要少量计算机资源的前提下实现准确搜索。
技术解决方案
为了实现上述发明目的,本申请提出一种基于相似度值的搜索方法,包括以下步骤:
获取数据条,并根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列;
调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值;
将所述数据条存入预设的数据库中,并在所述数据库中新增相似字段,其中所述数据条的所述相似字段记录所述相似度值;
获取用户输入的搜索句子,并对所述搜索句子进行预处理,得到搜索单词序列;
根据预设的相似度算法,计算所述搜索单词序列与所述指定标准句子的搜索用相似度值;
生成命中范围[搜索用相似度值-a,搜索用相似度值+a],并从所述数据库中调取相似度值处于所述命中范围中的数据条,并记为目标数据条,其中a为预设的范围参数,a为大于0的正数;
根据预设的排序规则,对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列。
有益效果
本申请的基于相似度值的搜索方法、装置、计算机设备和存储介质,获取数据条,对所述数据条进 行预处理,得到数据条单词序列;调取预存的指定标准句子,计算所述数据条单词序列与所述指定标准句子的相似度值;将所述数据条存入预设的数据库中,并在所述数据库中新增相似字段;获取用户输入的搜索句子,并对所述搜索句子进行预处理,得到搜索单词序列;计算所述搜索单词序列与所述指定标准句子的搜索用相似度值;生成命中范围[搜索用相似度值-A,搜索用相似度值+A],并从所述数据库中调取相似度值处于所述命中范围中的数据条,并记为目标数据条;对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列,从而在仅需要少量计算机资源的前提下实现搜索。
附图说明
图1为本申请一实施例的基于相似度值的搜索方法的流程示意图;
图2为本申请一实施例的基于相似度值的搜索装置的结构示意框图;
图3为本申请一实施例的计算机设备的结构示意框图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本申请的最佳实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
参照图1,本申请实施例提供一种基于相似度值的搜索方法,包括以下步骤:
S1、获取数据条,并根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列;
S2、调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值;
S3、将所述数据条存入预设的数据库中,并在所述数据库中新增相似字段,其中所述数据条的所述相似字段记录所述相似度值;
S4、获取用户输入的搜索句子,并对所述搜索句子进行预处理,得到搜索单词序列;
S5、根据预设的相似度算法,计算所述搜索单词序列与所述指定标准句子的搜索用相似度值;
S6、生成命中范围[搜索用相似度值-a,搜索用相似度值+a],并从所述数据库中调取相似度值处于所述命中范围中的数据条,并记为目标数据条,其中a为预设的范围参数,a为大于0的正数;
S7、根据预设的排序规则,对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列。
本申请采用相似度值作为搜索匹配的依据,相对于相有的技术方案,仅需要一个字段记录相似度值,若相似度值的数量有N个,那么仅需要N个词条(即一个字段值构成)即可,从而仅需要少量计算机资源即可实现搜索的目的。
如上述步骤S1所述,获取数据条,并根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列。其中所述数据条是指包含信息的句子,例如为北京在A日的猪肉价格为B元;上海在A日的金融产品C的成交数量为D等等。其中预处理例如为:对所述数据条进行分词处理,从而得到由 多个单词组成的初始单词序列;通过查询预设的无意义单词库,判断所述初始单词序列中是否存在无意义单词;若所述初始单词序列中存在无意义单词,则将所述初始单词序列中的无意义单词去除,从而获得中间单词序列;通过查询预设的同义词库,判断所述中间单词序列中是否存在同义词组;若所述中间单词序列中存在同义词组,则将所述同义词组中所有单词替换为所述同义词组中的任意一个,从而得到数据条单词序列。例如将北京在A日的猪肉价格为B元进行预处理得到:|北京|A日|猪肉价格|,其中”在”、“的”、“为”被视为无意义单词被去除。
如上述步骤S2所述,调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值。本申请采用指定标准句子作为参照标准,以获得相似度值,所述用于后续判断搜索是否命中的标准。其中预设的相似度算法可以为任意算法,用于计算句子与句子间的相似度即可,例如为:调取预存的指定标准句子;查询预设的词向量库以获取所述指定标准句子中各个单词对应的词向量,从而获得所述指定标准句子对应的标准词向量序列;查询预设的词向量库以获取所述数据条单词序列中各个单词对应的词向量,从而获得所述数据条单词序列对应的数据条词向量序列;采用预设的距离计算公式,计算所述标准词向量序列与所述数据条词向量序列之间的距离值,并将所述距离值记为所述相似度值。其中词向量库可以通过word2vec工具训练得到,word2vec是用于训练词向量的工具,包括CBOW(Continuous Bag of Words,连续词袋模型)和Skip-Gram两种模型。CBOW是从原始语句推测目标字词;而Skip-Gram是从目标字词推测出原始语句。
如上述步骤S3所述,将所述数据条存入预设的数据库中,并在所述数据库中新增相似字段,其中所述数据条的所述相似字段记录所述相似度值。从而完成了往数据库中存储数据条,并且由于所述数据条的相似字段记录了所述相似度值,因此可以基于所述相似度值搜索找到该数据条。
如上述步骤S4所述,获取用户输入的搜索句子,并对所述搜索句子进行预处理,得到搜索单词序列。其中预处理的方法可以与前述对所述数据条进行预处理的方法相同,也可以不相同,但对所述搜索句子进行预处理的方法至少包括了对所述搜索句子进行分词,从而得到搜索单词序列。并且为了满足数据处理的一致性,从而达到提高搜索准确度的目的,本申请优选采用前述对所述数据条进行预处理的方法对所述搜索句子进行预处理。其中搜索句子例如为:北京的猪肉价格;上海的金融产品C的成交量等等。
如上述步骤S5所述,根据预设的相似度算法,计算所述搜索单词序列与所述指定标准句子的搜索用相似度值。其中所述预设的相似度算法可以与前述计算所述数据条单词序列与所述指定标准句子的相似度值的方法相同,也可以不同,本申请优选与前述计算所述数据条单词序列与所述指定标准句子的相似度值的方法相同。其中计算得到的搜索用相似度值反应了所述搜索句子与所述指定标准句子的匹配程度,将在后续中作为确定搜索命中目标的依据。
如上述步骤S6所述,生成命中范围[搜索用相似度值-a,搜索用相似度值+a],并从所述数据库中调取 相似度值处于所述命中范围中的数据条,并记为目标数据条,其中a为预设的范围参数,a为大于0的正数。本申请采用生成命中范围[搜索用相似度值-a,搜索用相似度值+a]的方式,减少了本申请的搜索方法的漏检率。若仅以搜索用相似度值和所述数据库中调取相似度值相同的方式,确定搜索命中目标,则有些相近的数据条会被漏检,从而造成搜索效果不佳。而采用了生成命中范围[搜索用相似度值-a,搜索用相似度值+a]的方式,使得命中范围扩大,达到避免漏检的效果。
如上述步骤S7所述,根据预设的排序规则,对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列。预设的排序规则可以为任意的排序规则,例如为根据所述搜索用相似度值与所述相似字段记录的相似度值的差值的绝对值进行升序或者降序排列,从而得到数据条序列。更进一步地,预设的排序规则例如为:获取所述用户的搜索记录,其中所述搜索记录中记载了搜索关键词;根据所述目标数据条是否具有所述搜索关键词,将所述目标数据条分类为第一数据条和第二数据条,其中所述第一数据条具有所述搜索关键词;计算得到所述搜索用相似度值与所述目标数据条的相似字段记录的相似度值的差值的绝对值;将所述第一数据条和所述第二数据条分别根据所述绝对值大小进行降序或者升序排列,从而得到第一数据条序列和第二数据条序列;以优先展示所述第一数据条序列的方式组合所述第一数据条序列和第二数据条序列,从而得到数据条序列,并输出所述数据条序列。据此,实现了仅依靠少量资源的前提下实现了搜索目的。
在一个实施方式中,所述根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列的步骤S1,包括:
S101、对所述数据条进行分词处理,从而得到由多个单词组成的初始单词序列;
S102、通过查询预设的无意义单词库,判断所述初始单词序列中是否存在无意义单词;
S103、若所述初始单词序列中存在无意义单词,则将所述初始单词序列中的无意义单词去除,从而获得中间单词序列;
S104、通过查询预设的同义词库,判断所述中间单词序列中是否存在同义词组;
S105、若所述中间单词序列中存在同义词组,则将所述同义词组中所有单词替换为所述同义词组中的任意一个,从而得到数据条单词序列。
如上所述,实现了根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列。其中分词可使用开源的分词工具,例如jieba、THULAC、NLPIR等。例如将北京在A日的猪肉价格为B元,划分为:|北京|在|A日|的|猪肉价格|为|B元|。更进一步的预处理包括:无意义单词去除和同义词替换,从而完成预处理,以得到数据条单词序列。具体地,通过查询预设的无意义单词库,判断所述初始单词序列中是否存在无意义单词;若所述初始单词序列中存在无意义单词,则将所述初始单词序列中的无意义单词去除,从而获得中间单词序列的方式,实现无意义单词去除的步骤。以前述例子为例,其中的在、的、为即为无意义词汇,将其去除。具体地,通过查询预设的同义词库,判断所述中间单词序列中是否 存在同义词组;若所述中间单词序列中存在同义词组,则将所述同义词组中所有单词替换为所述同义词组中的任意一个,从而得到数据条单词序列,实现同义词替换。其中,同义词库中包括多个同义词条,若在所述单词序列中有两个以上单词出现在同一个同义词条中,表明所述两个以上单词构成了同义词组。一般而言,同义词的替换并不会导致单句的原义发生改变,因此采用同义词替换的方式以减少计算量与数据存储量。例如北京与首都可构成一个同义词组。
在一个实施方式中,所述数据条存在多个,所述数据条单词序列存在多个,所述调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值的步骤S2之前,包括:
S11、统计多个数据条单词序列中各个单词的出现次数,获取出现次数最多的单词,并记为指定单词;
S12、判断所述指定单词的出现次数是否大于预设的次数阈值;
S13、若所述指定单词的出现次数大于预设的次数阈值,则根据预设的单词与标准句子的对应关系,获取与所述指定单词对应的指定标准句子。
如上所述,实现了获取与所述指定单词对应的指定标准句子。本申请是采用指定标准句子作为相似度值的参照标准,因此指定标准句子的选取尤为重要,关系到搜索结果的准确性。因此本申请采用统计多条数据条单词序列中各个单词的出现次数,获取出现次数最多的单词,并记为指定单词;若所述指定单词的出现次数大于预设的次数阈值,则根据预设的单词与标准句子的对应关系,获取与所述指定单词对应的指定标准句子的方式,采用使指定标准句子与待存入的数据条的相关性达到最大的原则,从而获得指定标准句子。若所述指定单词的出现次数大于预设的次数阈值,表明该指定单词是多条数据条的代表特征,依据该指定单词找出的指定标准句子更为恰当。其中所述指定标准句子例如为包括指定单词的句子。进一步地,所述统计多条数据条单词序列中各个单词的出现次数,获取出现次数最多的单词,并记为指定单词还可替换为:统计多条数据条单词序列中各个单词的出现次数,获取出现次数超过预设数量值的单词,并记为指定单词;从而使用多个指定单词对标准句子的对应关系,得到指定标准句子,从而进一步提高搜索准确性。
在一个实施方式中,所述调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值的步骤S2,包括:
S201、调取预存的指定标准句子;
S202、查询预设的词向量库以获取所述指定标准句子中各个单词对应的词向量,从而获得所述指定标准句子对应的标准词向量序列;
S203、查询预设的词向量库以获取所述数据条单词序列中各个单词对应的词向量,从而获得所述数据条单词序列对应的数据条词向量序列;
S204、采用预设的距离计算公式,计算所述标准词向量序列与所述数据条词向量序列之间的距离值,并将所述距离值记为所述相似度值。
其中预设的词向量库是指存储有单词与向量映射关系的数据库,用于单词映射为向量,从而实现了将计算机无法识别的自然语言转变为数字。词向量库可以用任意方式获得,例如直接采用已训练好的词向量库,或者使用word2vec工具对预先准备的词料进行训练词得到,word2vec包括CBOW(Continuous Bag of Words,连续词袋模型)模型。CBOW是从原始语句推测目标字词,本申请优选采用CBOW模型进行词向量训练。从而,查询预设的词向量库以使单词映射为词向量,进而获得所述指定标准句子对应的标准词向量序列;获得所述数据条单词序列对应的数据条词向量序列;再采用预设的距离计算公式,计算所述标准词向量序列与所述数据条词向量序列之间的距离值,并将所述距离值记为所述相似度值。其中本申请的距离计算公式是用于计算两个词向量序列之间的距离(相似度),可以采用任意可行的距离算法,例如基于欧式距离的算法,或者基于余弦相似度的算法。
在一个实施方式中,所述采用预设的距离计算公式,计算所述标准词向量序列与所述数据条词向量序列之间的距离值,并将所述距离值记为所述相似度值的步骤S204,包括:
S2041、采用公式:
Figure PCTCN2019117213-appb-000001
,满足
Figure PCTCN2019117213-appb-000002
计算所述标准词向量序列与所述数据条词向量序列之间的距离值,并将所述距离值记为所述相似度值;其中Distance(I,R)为标准词向量序列I与数据条词向量序列R的距离;I为所述标准词向量序列;R为所述数据条词向量序列;Tij为标准词向量序列I中第i个单词至数据条词向量序列R中的第j个单词的权重转移量;di为第i个词在标准词向量序列I中的词频;d’ j为第j个词在数据条词向量序列R中的词频;c(i,j)为标准词向量序列I中的第i个词与数据条词向量序列R中第j个词的欧氏距离;m为标准词向量序列I中具有词向量的单词数量;n为数据条词向量序列R中具有词向量的单词数量。
如上所述,实现了计算所述标准词向量序列与所述数据条词向量序列之间的距离值。其中,上述公式利用了词向量的欧氏距离。所述欧氏距离的计算公式为:
Figure PCTCN2019117213-appb-000003
其中d(x,y)为词向量x=(x1,x2,x3…,xn)与词向量y=(y1,y2,y3…,yn)间的欧氏距离,n为词向量的维度。将欧氏距离计算公式代入所述计算所述标准词向量序列与所述数据条词向量序列之间的距离值的公式中,即可算出所述标准词向量序列与所述数据条词向量序列之间的距离值。
在一个实施方式中,所述生成命中范围[搜索用相似度值-a,搜索用相似度值+a],并从所述数据库中 调取相似度值处于所述命中范围中的数据条,并记为目标数据条,其中a为预设的范围参数,a为大于0的正数的步骤S6之前,包括:
S51、判断所述数据库中是否存在相似度值等于所述搜索用相似度值的数据条;
S52、若所述数据库中不存在相似度值等于所述搜索用相似度值的数据条,则根据预设的标准句子与范围参数的对应关系,获取与所述指定标准句子对应的范围参数a;
S53、生成命中范围生成指令,其中所述命中范围生成指令用于指示根据所述范围参数a和所述搜索用相似度值生成命中范围。
如上所述,实现了获取与所述指定标准句子对应的范围参数a。本申请采用生成命中范围[搜索用相似度值-a,搜索用相似度值+a]能够实现搜索范围的扩大(模糊检索),避免漏检。但是若用户已获知准确的相调取的数据条,并且用户输入的搜索句子与数据条完全相同,那么先采用精准搜索的方式能够提高搜索效率并提高用户体验。因此本申请在生成命中范围[搜索用相似度值-a,搜索用相似度值+a]之前,先采用判断所述数据库中是否存在相似度值等于所述搜索用相似度值的数据条的方式,进行精准检索,若精准检索未命中,即若所述数据库中不存在相似度值等于所述搜索用相似度值的数据条,则根据预设的标准句子与范围参数的对应关系,获取与所述指定标准句子对应的范围参数a,从而再次进行模糊检索。据此,实现了先精准检索再模糊检索,且精准检索消耗的计算机资源并不多,从而在较少的计算机资源的前提下提高了检索效率。
在一个实施方式中,所述根据预设的排序规则,对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列的步骤S7,包括:
S701、获取所述用户的搜索记录,其中所述搜索记录中记载了搜索关键词;
S702、根据所述目标数据条是否具有所述搜索关键词,将所述目标数据条分类为第一数据条和第二数据条,其中所述第一数据条具有所述搜索关键词;
S703、计算得到所述搜索用相似度值与所述目标数据条的相似字段记录的相似度值的差值的绝对值;
S704、将所述第一数据条和所述第二数据条分别根据所述绝对值大小进行降序或者升序排列,从而得到第一数据条序列和第二数据条序列;
S705、以优先展示所述第一数据条序列的方式组合所述第一数据条序列和第二数据条序列,从而得到数据条序列,并输出所述数据条序列。
如上所述,实现了对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列。对于搜索结果的排序问题很重要,应该将最符合用户需求的数据条优先展示给用户。本申请先获取所述用户的搜索记录,其中所述搜索记录中记载了搜索关键词;将所述目标数据条分类为第一数据条和第二数据条,其中所述第一数据条具有所述搜索关键词,从而第一数据条应优先展现,因为第一数据条更符合用户的搜索习惯,即更符合用户的需求。再计算得到所述搜索用相似度值与所述目标数据条的相似字段记录的 相似度值的差值的绝对值;将所述第一数据条和所述第二数据条分别根据所述绝对值大小进行降序或者升序排列,从而得到第一数据条序列和第二数据条序列。由于“所述搜索用相似度值与所述目标数据条的相似字段记录的相似度值的差值的绝对值”反应了搜索句子与目标数据条的匹配程度,据此进行排序。再以优先展示所述第一数据条序列的方式组合所述第一数据条序列和第二数据条序列,从而得到数据条序列,并输出所述数据条序列。从而实现了以是否具有搜索关键词为第一优先原则,再以所述绝对值大小为第二优选原则进行排序,从而得到数据条序列。
本申请的基于相似度值的搜索方法,获取数据条,对所述数据条进行预处理,得到数据条单词序列;调取预存的指定标准句子,计算所述数据条单词序列与所述指定标准句子的相似度值;将所述数据条存入预设的数据库中,并在所述数据库中新增相似字段;获取用户输入的搜索句子,并对所述搜索句子进行预处理,得到搜索单词序列;计算所述搜索单词序列与所述指定标准句子的搜索用相似度值;生成命中范围[搜索用相似度值-a,搜索用相似度值+a],并从所述数据库中调取相似度值处于所述命中范围中的数据条,并记为目标数据条;对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列。从而在仅需要少量计算机资源的前提下实现搜索。
参照图2,本申请实施例提供一种基于相似度值的搜索装置,包括:
数据条单词序列获取单元10,用于获取数据条,并根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列;
第一相似度值计算单元20,用于调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值;
存储单元30,用于将所述数据条存入预设的数据库中,并在所述数据库中新增相似字段,其中所述数据条的所述相似字段记录所述相似度值;
搜索单词序列获取单元40,用于获取用户输入的搜索句子,并对所述搜索句子进行预处理,得到搜索单词序列;
第二相似度值计算单元50,用于根据预设的相似度算法,计算所述搜索单词序列与所述指定标准句子的搜索用相似度值;
目标数据条获取单元60,用于生成命中范围[搜索用相似度值-a,搜索用相似度值+a],并从所述数据库中调取相似度值处于所述命中范围中的数据条,并记为目标数据条,其中a为预设的范围参数,a为大于0的正数;
数据条序列输出单元70,用于根据预设的排序规则,对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列。
其中上述单元分别用于执行的操作与前述实施方式的基于相似度值的搜索方法的步骤一一对应,在此不再赘述。
在一个实施方式中,所述数据条单词序列获取单元10,包括:
初始单词序列获取子单元,用于对所述数据条进行分词处理,从而得到由多个单词组成的初始单词序列;
无意义单词判断子单元,用于通过查询预设的无意义单词库,判断所述初始单词序列中是否存在无意义单词;
中间单词序列获取子单元,用于若所述初始单词序列中存在无意义单词,则将所述初始单词序列中的无意义单词去除,从而获得中间单词序列;
同义词组判断子单元,用于通过查询预设的同义词库,判断所述中间单词序列中是否存在同义词组;
数据条单词序列获取子单元,用于若所述中间单词序列中存在同义词组,则将所述同义词组中所有单词替换为所述同义词组中的任意一个,从而得到数据条单词序列。
其中上述子单元分别用于执行的操作与前述实施方式的基于相似度值的搜索方法的步骤一一对应,在此不再赘述。
在一个实施方式中,所述数据条存在多个,所述数据条单词序列存在多个,所述装置,包括:
指定单词获取单元,用于统计多个数据条单词序列中各个单词的出现次数,获取出现次数最多的单词,并记为指定单词;
次数阈值判断单元,用于判断所述指定单词的出现次数是否大于预设的次数阈值;
指定标准句子获取单元,用于若所述指定单词的出现次数大于预设的次数阈值,则根据预设的单词与标准句子的对应关系,获取与所述指定单词对应的指定标准句子。
其中上述单元分别用于执行的操作与前述实施方式的基于相似度值的搜索方法的步骤一一对应,在此不再赘述。
在一个实施方式中,所述第一相似度值计算单元20,包括:
指定标准句子调取子单元,用于调取预存的指定标准句子;
标准词向量序列获取子单元,用于查询预设的词向量库以获取所述指定标准句子中各个单词对应的词向量,从而获得所述指定标准句子对应的标准词向量序列;
数据条词向量序列获取子单元,用于查询预设的词向量库以获取所述数据条单词序列中各个单词对应的词向量,从而获得所述数据条单词序列对应的数据条词向量序列;
第一相似度值计算子单元,用于采用预设的距离计算公式,计算所述标准词向量序列与所述数据条词向量序列之间的距离值,并将所述距离值记为所述相似度值。
其中上述子单元分别用于执行的操作与前述实施方式的基于相似度值的搜索方法的步骤一一对应,在此不再赘述。
在一个实施方式中,所述第一相似度值计算子单元,包括:
第一相似度值计算模块,用于采用公式:
Figure PCTCN2019117213-appb-000004
,满足
Figure PCTCN2019117213-appb-000005
计算所述标准词向量序列与所述数据条词向量序列之间的距离值,并将所述距离值记为所述相似度值;其中Distance(I,R)为标准词向量序列I与数据条词向量序列R的距离;I为所述标准词向量序列;R为所述数据条词向量序列;Tij为标准词向量序列I中第i个单词至数据条词向量序列R中的第j个单词的权重转移量;di为第i个词在标准词向量序列I中的词频;d’ j为第j个词在数据条词向量序列R中的词频;c(i,j)为标准词向量序列I中的第i个词与数据条词向量序列R中第j个词的欧氏距离;m为标准词向量序列I中具有词向量的单词数量;n为数据条词向量序列R中具有词向量的单词数量。
其中上述模块分别用于执行的操作与前述实施方式的基于相似度值的搜索方法的步骤一一对应,在此不再赘述。
在一个实施方式中,所述装置,包括:
数据条判断单元,用于判断所述数据库中是否存在相似度值等于所述搜索用相似度值的数据条;
范围参数a获取单元,用于若所述数据库中不存在相似度值等于所述搜索用相似度值的数据条,则根据预设的标准句子与范围参数的对应关系,获取与所述指定标准句子对应的范围参数a;
命中范围生成指令生成单元,用于生成命中范围生成指令,其中所述命中范围生成指令用于指示根据所述范围参数a和所述搜索用相似度值生成命中范围。
其中上述单元分别用于执行的操作与前述实施方式的基于相似度值的搜索方法的步骤一一对应,在此不再赘述。
在一个实施方式中,所述数据条序列输出单元70,包括:
搜索记录获取子单元,用于获取所述用户的搜索记录,其中所述搜索记录中记载了搜索关键词;
数据条分类子单元,用于根据所述目标数据条是否具有所述搜索关键词,将所述目标数据条分类为第一数据条和第二数据条,其中所述第一数据条具有所述搜索关键词;
绝对值获取子单元,用于计算得到所述搜索用相似度值与所述目标数据条的相似字段记录的相似度值的差值的绝对值;
数据条排列子单元,用于将所述第一数据条和所述第二数据条分别根据所述绝对值大小进行降序或者升序排列,从而得到第一数据条序列和第二数据条序列;
数据条序列输出子单元,用于以优先展示所述第一数据条序列的方式组合所述第一数据条序列和第二数据条序列,从而得到数据条序列,并输出所述数据条序列。
其中上述子单元分别用于执行的操作与前述实施方式的基于相似度值的搜索方法的步骤一一对应, 在此不再赘述。
参照图3,本申请实施例中还提供一种计算机设备,该计算机设备可以是服务器,其内部结构可以如图所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设计的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储基于相似度值的搜索方法所用数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于相似度值的搜索方法。
上述处理器执行上述基于相似度值的搜索方法,其中所述方法包括的步骤分别与执行前述实施方式的基于相似度值的搜索方法的步骤一一对应,在此不再赘述。
本领域技术人员可以理解,图中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定。
本申请一实施例还提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现基于相似度值的搜索方法,其中所述方法包括的步骤分别与执行前述实施方式的基于相似度值的搜索方法的步骤一一对应,在此不再赘述。其中所述计算机可读存储介质,例如为非易失性的计算机可读存储介质,或者为易失性的计算机可读存储介质。

Claims (20)

  1. 一种基于相似度值的搜索方法,其特征在于,包括:
    获取数据条,并根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列;
    调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值;
    将所述数据条存入预设的数据库中,并在所述数据库中新增相似字段,其中所述数据条的所述相似字段记录所述相似度值;
    获取用户输入的搜索句子,并对所述搜索句子进行预处理,得到搜索单词序列;
    根据预设的相似度算法,计算所述搜索单词序列与所述指定标准句子的搜索用相似度值;
    生成命中范围[搜索用相似度值-a,搜索用相似度值+a],并从所述数据库中调取相似度值处于所述命中范围中的数据条,并记为目标数据条,其中a为预设的范围参数,a为大于0的正数;
    根据预设的排序规则,对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列。
  2. 根据权利要求1所述的基于相似度值的搜索方法,其特征在于,所述根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列的步骤,包括:
    对所述数据条进行分词处理,从而得到由多个单词组成的初始单词序列;
    通过查询预设的无意义单词库,判断所述初始单词序列中是否存在无意义单词;
    若所述初始单词序列中存在无意义单词,则将所述初始单词序列中的无意义单词去除,从而获得中间单词序列;
    通过查询预设的同义词库,判断所述中间单词序列中是否存在同义词组;
    若所述中间单词序列中存在同义词组,则将所述同义词组中所有单词替换为所述同义词组中的任意一个,从而得到数据条单词序列。
  3. 根据权利要求1所述的基于相似度值的搜索方法,其特征在于,所述数据条存在多个,所述数据条单词序列存在多个,所述调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值的步骤之前,包括:
    统计多个数据条单词序列中各个单词的出现次数,获取出现次数最多的单词,并记为指定单词;
    判断所述指定单词的出现次数是否大于预设的次数阈值;
    若所述指定单词的出现次数大于预设的次数阈值,则根据预设的单词与标准句子的对应关系,获取与所述指定单词对应的指定标准句子。
  4. 根据权利要求1所述的基于相似度值的搜索方法,其特征在于,所述调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值的步骤,包括:
    调取预存的指定标准句子;
    查询预设的词向量库以获取所述指定标准句子中各个单词对应的词向量,从而获得所述指定标准句子对应的标准词向量序列;
    查询预设的词向量库以获取所述数据条单词序列中各个单词对应的词向量,从而获得所述数据条单词序列对应的数据条词向量序列;
    采用预设的距离计算公式,计算所述标准词向量序列与所述数据条词向量序列之间的距离值,并将所述距离值记为所述相似度值。
  5. 根据权利要求4所述的基于相似度值的搜索方法,其特征在于,所述采用预设的距离计算公式,计算所述标准词向量序列与所述数据条词向量序列之间的距离值,并将所述距离值记为所述相似度值的步骤,包括:
    采用公式:
    Figure PCTCN2019117213-appb-100001
    ,满足
    Figure PCTCN2019117213-appb-100002
    计算所述标准词向量序列与所述数据条词向量序列之间的距离值,并将所述距离值记为所述相似度值;其中Distance(I,R)为标准词向量序列I与数据条词向量序列R的距离;I为所述标准词向量序列;R为所述数据条词向量序列;Tij为标准词向量序列I中第i个单词至数据条词向量序列R中的第j个单词的权重转移量;di为第i个词在标准词向量序列I中的词频;d’ j为第j个词在数据条词向量序列R中的词频;c(i,j)为标准词向量序列I中的第i个词与数据条词向量序列R中第j个词的欧氏距离;m为标准词向量序列I中具有词向量的单词数量;n为数据条词向量序列R中具有词向量的单词数量。
  6. 根据权利要求1所述的基于相似度值的搜索方法,其特征在于,所述生成命中范围[搜索用相似度值-a,搜索用相似度值+a],并从所述数据库中调取相似度值处于所述命中范围中的数据条,并记为目标数据条,其中a为预设的范围参数,a为大于0的正数的步骤之前,包括:
    判断所述数据库中是否存在相似度值等于所述搜索用相似度值的数据条;
    若所述数据库中不存在相似度值等于所述搜索用相似度值的数据条,则根据预设的标准句子与范围参数的对应关系,获取与所述指定标准句子对应的范围参数a;
    生成命中范围生成指令,其中所述命中范围生成指令用于指示根据所述范围参数a和所述搜索用相似度值生成命中范围。
  7. 根据权利要求1所述的基于相似度值的搜索方法,其特征在于,所述根据预设的排序规则,对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列的步骤,包括:
    获取所述用户的搜索记录,其中所述搜索记录中记载了搜索关键词;
    根据所述目标数据条是否具有所述搜索关键词,将所述目标数据条分类为第一数据条和第二数据条, 其中所述第一数据条具有所述搜索关键词;
    计算得到所述搜索用相似度值与所述目标数据条的相似字段记录的相似度值的差值的绝对值;
    将所述第一数据条和所述第二数据条分别根据所述绝对值大小进行降序或者升序排列,从而得到第一数据条序列和第二数据条序列;
    以优先展示所述第一数据条序列的方式组合所述第一数据条序列和第二数据条序列,从而得到数据条序列,并输出所述数据条序列。
  8. 一种基于相似度值的搜索装置,其特征在于,包括:
    数据条单词序列获取单元,用于获取数据条,并根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列;
    第一相似度值计算单元,用于调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值;
    存储单元,用于将所述数据条存入预设的数据库中,并在所述数据库中新增相似字段,其中所述数据条的所述相似字段记录所述相似度值;
    搜索单词序列获取单元,用于获取用户输入的搜索句子,并对所述搜索句子进行预处理,得到搜索单词序列;
    第二相似度值计算单元,用于根据预设的相似度算法,计算所述搜索单词序列与所述指定标准句子的搜索用相似度值;
    目标数据条获取单元,用于生成命中范围[搜索用相似度值-a,搜索用相似度值+a],并从所述数据库中调取相似度值处于所述命中范围中的数据条,并记为目标数据条,其中a为预设的范围参数,a为大于0的正数;
    数据条序列输出单元,用于根据预设的排序规则,对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列。
  9. 根据权利要求8所述的基于相似度值的搜索装置,其特征在于,所述数据条单词序列获取单元,包括:
    初始单词序列获取子单元,用于对所述数据条进行分词处理,从而得到由多个单词组成的初始单词序列;
    无意义单词判断子单元,用于通过查询预设的无意义单词库,判断所述初始单词序列中是否存在无意义单词;
    中间单词序列获取子单元,用于若所述初始单词序列中存在无意义单词,则将所述初始单词序列中的无意义单词去除,从而获得中间单词序列;
    同义词组判断子单元,用于通过查询预设的同义词库,判断所述中间单词序列中是否存在同义词组;
    数据条单词序列获取子单元,用于若所述中间单词序列中存在同义词组,则将所述同义词组中所有单词替换为所述同义词组中的任意一个,从而得到数据条单词序列。
  10. 根据权利要求8所述的基于相似度值的搜索装置,其特征在于,所述数据条存在多个,所述数据条单词序列存在多个,所述装置,包括:
    指定单词获取单元,用于统计多个数据条单词序列中各个单词的出现次数,获取出现次数最多的单词,并记为指定单词;
    次数阈值判断单元,用于判断所述指定单词的出现次数是否大于预设的次数阈值;
    指定标准句子获取单元,用于若所述指定单词的出现次数大于预设的次数阈值,则根据预设的单词与标准句子的对应关系,获取与所述指定单词对应的指定标准句子。
  11. 根据权利要求8所述的基于相似度值的搜索装置,其特征在于,所述第一相似度值计算单元,包括:
    指定标准句子调取子单元,用于调取预存的指定标准句子;
    标准词向量序列获取子单元,用于查询预设的词向量库以获取所述指定标准句子中各个单词对应的词向量,从而获得所述指定标准句子对应的标准词向量序列;
    数据条词向量序列获取子单元,用于查询预设的词向量库以获取所述数据条单词序列中各个单词对应的词向量,从而获得所述数据条单词序列对应的数据条词向量序列;
    第一相似度值计算子单元,用于采用预设的距离计算公式,计算所述标准词向量序列与所述数据条词向量序列之间的距离值,并将所述距离值记为所述相似度值。
  12. 根据权利要求11所述的基于相似度值的搜索装置,其特征在于,所述第一相似度值计算子单元,包括:
    第一相似度值计算模块,用于采用公式:
    Figure PCTCN2019117213-appb-100003
    ,满足
    Figure PCTCN2019117213-appb-100004
    计算所述标准词向量序列与所述数据条词向量序列之间的距离值,并将所述距离值记为所述相似度值;其中Distance(I,R)为标准词向量序列I与数据条词向量序列R的距离;I为所述标准词向量序列;R为所述数据条词向量序列;Tij为标准词向量序列I中第i个单词至数据条词向量序列R中的第j个单词的权重转移量;di为第i个词在标准词向量序列I中的词频;d’ j为第j个词在数据条词向量序列R中的词频;c(i,j)为标准词向量序列I中的第i个词与数据条词向量序列R中第j个词的欧氏距离;m为标准词向量序列I中具有词向量的单词数量;n为数据条词向量序列R中具有词向量的单词数量。
  13. 根据权利要求8所述的基于相似度值的搜索装置,其特征在于,所述装置,包括:
    数据条判断单元,用于判断所述数据库中是否存在相似度值等于所述搜索用相似度值的数据条;
    范围参数a获取单元,用于若所述数据库中不存在相似度值等于所述搜索用相似度值的数据条,则根据预设的标准句子与范围参数的对应关系,获取与所述指定标准句子对应的范围参数a;
    命中范围生成指令生成单元,用于生成命中范围生成指令,其中所述命中范围生成指令用于指示根据所述范围参数a和所述搜索用相似度值生成命中范围。
  14. 根据权利要求8所述的基于相似度值的搜索装置,其特征在于,所述数据条序列输出单元,包括:
    搜索记录获取子单元,用于获取所述用户的搜索记录,其中所述搜索记录中记载了搜索关键词;
    数据条分类子单元,用于根据所述目标数据条是否具有所述搜索关键词,将所述目标数据条分类为第一数据条和第二数据条,其中所述第一数据条具有所述搜索关键词;
    绝对值获取子单元,用于计算得到所述搜索用相似度值与所述目标数据条的相似字段记录的相似度值的差值的绝对值;
    数据条排列子单元,用于将所述第一数据条和所述第二数据条分别根据所述绝对值大小进行降序或者升序排列,从而得到第一数据条序列和第二数据条序列;
    数据条序列输出子单元,用于以优先展示所述第一数据条序列的方式组合所述第一数据条序列和第二数据条序列,从而得到数据条序列,并输出所述数据条序列。
  15. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现基于相似度值的搜索方法,所述基于相似度值的搜索方法,包括:
    获取数据条,并根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列;
    调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值;
    将所述数据条存入预设的数据库中,并在所述数据库中新增相似字段,其中所述数据条的所述相似字段记录所述相似度值;
    获取用户输入的搜索句子,并对所述搜索句子进行预处理,得到搜索单词序列;
    根据预设的相似度算法,计算所述搜索单词序列与所述指定标准句子的搜索用相似度值;
    生成命中范围[搜索用相似度值-a,搜索用相似度值+a],并从所述数据库中调取相似度值处于所述命中范围中的数据条,并记为目标数据条,其中a为预设的范围参数,a为大于0的正数;
    根据预设的排序规则,对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列。
  16. 根据权利要求15所述的计算机设备,其特征在于,所述根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列的步骤,包括:
    对所述数据条进行分词处理,从而得到由多个单词组成的初始单词序列;
    通过查询预设的无意义单词库,判断所述初始单词序列中是否存在无意义单词;
    若所述初始单词序列中存在无意义单词,则将所述初始单词序列中的无意义单词去除,从而获得中间单词序列;
    通过查询预设的同义词库,判断所述中间单词序列中是否存在同义词组;
    若所述中间单词序列中存在同义词组,则将所述同义词组中所有单词替换为所述同义词组中的任意一个,从而得到数据条单词序列。
  17. 根据权利要求15所述的计算机设备,其特征在于,所述数据条存在多个,所述数据条单词序列存在多个,所述调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值的步骤之前,包括:
    统计多个数据条单词序列中各个单词的出现次数,获取出现次数最多的单词,并记为指定单词;
    判断所述指定单词的出现次数是否大于预设的次数阈值;
    若所述指定单词的出现次数大于预设的次数阈值,则根据预设的单词与标准句子的对应关系,获取与所述指定单词对应的指定标准句子。
  18. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现基于相似度值的搜索方法,所述基于相似度值的搜索方法,包括:
    获取数据条,并根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列;
    调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值;
    将所述数据条存入预设的数据库中,并在所述数据库中新增相似字段,其中所述数据条的所述相似字段记录所述相似度值;
    获取用户输入的搜索句子,并对所述搜索句子进行预处理,得到搜索单词序列;
    根据预设的相似度算法,计算所述搜索单词序列与所述指定标准句子的搜索用相似度值;
    生成命中范围[搜索用相似度值-a,搜索用相似度值+a],并从所述数据库中调取相似度值处于所述命中范围中的数据条,并记为目标数据条,其中a为预设的范围参数,a为大于0的正数;
    根据预设的排序规则,对所述目标数据条进行排序得到数据条序列,并输出所述数据条序列。
  19. 根据权利要求18所述的计算机可读存储介质,其特征在于,所述根据预设的预处理方法对所述数据条进行预处理,得到数据条单词序列的步骤,包括:
    对所述数据条进行分词处理,从而得到由多个单词组成的初始单词序列;
    通过查询预设的无意义单词库,判断所述初始单词序列中是否存在无意义单词;
    若所述初始单词序列中存在无意义单词,则将所述初始单词序列中的无意义单词去除,从而获得中间单词序列;
    通过查询预设的同义词库,判断所述中间单词序列中是否存在同义词组;
    若所述中间单词序列中存在同义词组,则将所述同义词组中所有单词替换为所述同义词组中的任意一个,从而得到数据条单词序列。
  20. 根据权利要求18所述的计算机可读存储介质,其特征在于,所述数据条存在多个,所述数据条单词序列存在多个,所述调取预存的指定标准句子,根据预设的相似度算法,计算所述数据条单词序列与所述指定标准句子的相似度值的步骤之前,包括:
    统计多个数据条单词序列中各个单词的出现次数,获取出现次数最多的单词,并记为指定单词;
    判断所述指定单词的出现次数是否大于预设的次数阈值;
    若所述指定单词的出现次数大于预设的次数阈值,则根据预设的单词与标准句子的对应关系,获取与所述指定单词对应的指定标准句子。
PCT/CN2019/117213 2019-09-06 2019-11-11 基于相似度值的搜索方法、装置、计算机设备和存储介质 WO2021042526A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910844343.9A CN110737751B (zh) 2019-09-06 2019-09-06 基于相似度值的搜索方法、装置、计算机设备和存储介质
CN201910844343.9 2019-09-06

Publications (1)

Publication Number Publication Date
WO2021042526A1 true WO2021042526A1 (zh) 2021-03-11

Family

ID=69267513

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117213 WO2021042526A1 (zh) 2019-09-06 2019-11-11 基于相似度值的搜索方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN110737751B (zh)
WO (1) WO2021042526A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112668632B (zh) * 2020-12-25 2022-04-08 浙江大华技术股份有限公司 一种数据处理方法、装置、计算机设备及存储介质
CN113535824A (zh) * 2021-07-27 2021-10-22 杭州海康威视数字技术股份有限公司 数据搜索方法、装置、电子设备及存储介质
CN114064738B (zh) * 2022-01-14 2022-04-29 杭州捷配信息科技有限公司 电子元器件替料查找方法、装置及应用

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105678324A (zh) * 2015-12-31 2016-06-15 上海智臻智能网络科技股份有限公司 基于相似度计算的问答知识库的建立方法、装置及系统
CN106372208A (zh) * 2016-09-05 2017-02-01 东南大学 一种基于语句相似度的话题观点聚类方法
US20170262530A1 (en) * 2016-03-09 2017-09-14 Fujitsu Limited Search apparatus and search method
CN109635275A (zh) * 2018-11-06 2019-04-16 交控科技股份有限公司 文献内容检索与识别方法及装置
CN109740143A (zh) * 2018-11-28 2019-05-10 平安科技(深圳)有限公司 基于机器学习的句子距离映射方法、装置和计算机设备
CN109766429A (zh) * 2019-02-19 2019-05-17 北京奇艺世纪科技有限公司 一种语句检索方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7334195B2 (en) * 2003-10-14 2008-02-19 Microsoft Corporation System and process for presenting search results in a histogram/cluster format
US8606779B2 (en) * 2006-09-14 2013-12-10 Nec Corporation Search method, similarity calculation method, similarity calculation, same document matching system, and program thereof
CN106649868B (zh) * 2016-12-30 2019-03-26 首都师范大学 问答匹配方法及装置
CN109582966A (zh) * 2018-12-03 2019-04-05 北京容联易通信息技术有限公司 一种信息匹配方法及装置
CN110059155A (zh) * 2018-12-18 2019-07-26 阿里巴巴集团控股有限公司 文本相似度的计算、智能客服系统的实现方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105678324A (zh) * 2015-12-31 2016-06-15 上海智臻智能网络科技股份有限公司 基于相似度计算的问答知识库的建立方法、装置及系统
US20170262530A1 (en) * 2016-03-09 2017-09-14 Fujitsu Limited Search apparatus and search method
CN106372208A (zh) * 2016-09-05 2017-02-01 东南大学 一种基于语句相似度的话题观点聚类方法
CN109635275A (zh) * 2018-11-06 2019-04-16 交控科技股份有限公司 文献内容检索与识别方法及装置
CN109740143A (zh) * 2018-11-28 2019-05-10 平安科技(深圳)有限公司 基于机器学习的句子距离映射方法、装置和计算机设备
CN109766429A (zh) * 2019-02-19 2019-05-17 北京奇艺世纪科技有限公司 一种语句检索方法及装置

Also Published As

Publication number Publication date
CN110737751B (zh) 2023-10-20
CN110737751A (zh) 2020-01-31

Similar Documents

Publication Publication Date Title
WO2020108608A1 (zh) 搜索结果处理方法、装置、终端、电子设备及存储介质
WO2020182019A1 (zh) 图像检索方法、装置、设备及计算机可读存储介质
WO2020143326A1 (zh) 知识数据存储方法、装置、计算机设备和存储介质
WO2020143184A1 (zh) 知识融合方法、装置、计算机设备和存储介质
JP6073345B2 (ja) 検索結果をランク付けする方法および装置ならびに検索方法および装置
WO2021042526A1 (zh) 基于相似度值的搜索方法、装置、计算机设备和存储介质
US20060248074A1 (en) Term-statistics modification for category-based search
CN106021364A (zh) 图片搜索相关性预测模型的建立、图片搜索方法和装置
CN110765277B (zh) 一种基于知识图谱的移动端的在线设备故障诊断方法
CN110569289B (zh) 基于大数据的列数据处理方法、设备及介质
CN108509521B (zh) 一种自动生成文本索引的图像检索方法
US20090307209A1 (en) Term-statistics modification for category-based search
US20220277005A1 (en) Semantic parsing of natural language query
WO2018090468A1 (zh) 视频节目的搜索方法和装置
CN111625621B (zh) 一种文档检索方法、装置、电子设备及存储介质
CN108595546B (zh) 基于半监督的跨媒体特征学习检索方法
CN112182145A (zh) 文本相似度确定方法、装置、设备和存储介质
CN113342923A (zh) 数据查询方法、装置、电子设备及可读存储介质
CN106570196B (zh) 视频节目的搜索方法和装置
CN117149804A (zh) 数据处理方法、装置、电子设备及存储介质
WO2023130688A1 (zh) 一种自然语言处理方法、装置、设备及可读存储介质
CN110688559A (zh) 一种检索方法及装置
CN113420564B (zh) 一种基于混合匹配的电力铭牌语义结构化方法及系统
CN111339303B (zh) 一种基于聚类与自动摘要的文本意图归纳方法及装置
CN114328863A (zh) 一种基于高斯核函数的长文本检索方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19944476

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19944476

Country of ref document: EP

Kind code of ref document: A1