WO2021051557A1 - 基于语义识别的关键词确定方法、装置和存储介质 - Google Patents

基于语义识别的关键词确定方法、装置和存储介质 Download PDF

Info

Publication number
WO2021051557A1
WO2021051557A1 PCT/CN2019/117577 CN2019117577W WO2021051557A1 WO 2021051557 A1 WO2021051557 A1 WO 2021051557A1 CN 2019117577 W CN2019117577 W CN 2019117577W WO 2021051557 A1 WO2021051557 A1 WO 2021051557A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
preset
search term
word
candidate index
Prior art date
Application number
PCT/CN2019/117577
Other languages
English (en)
French (fr)
Inventor
张师琲
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021051557A1 publication Critical patent/WO2021051557A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Definitions

  • This application relates to the field of natural language processing technology, and in particular to a method, device and storage medium for determining keywords based on semantic recognition.
  • the mainstream method for determining keywords is to extract keywords in sentences input by users, and use keyword matching technology to extract data with the highest matching degree from a database as search results and feed them back to users.
  • the above search methods have certain defects in the definition of keywords. If the keywords are words with similar fonts or polysemous words, the keywords cannot be accurately defined, resulting in deviations in search results.
  • the main purpose of this application is to provide a method, device and storage medium for keyword determination based on semantic recognition, aiming at the technical problem that the existing keyword determination method cannot accurately define keywords, which leads to the technical problem that the accuracy rate is too low.
  • this application provides a method for determining keywords based on semantic recognition, which includes the following steps:
  • the present application also provides a device, the device including: a memory, a processor, and computer-readable instructions stored on the memory and capable of running on the processor, and the computer can When the read instruction is executed by the processor, the steps of the keyword determination method based on semantic recognition as described above are implemented.
  • the present application also provides a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions are processed When the device is executed, the steps of the keyword determination method based on semantic recognition as described above are implemented.
  • This application discloses a method, a device and a storage medium for determining keywords based on semantic recognition.
  • the method first obtains a search sentence input by a user, performs word segmentation on the search sentence, and extracts the feature vector of each word after the word segmentation; Input to the trained multi-class perceptrons to obtain the corresponding character labeling results, and obtain the corresponding search terms according to the character labeling results; input the search terms into the preset index database for query, and obtain the corresponding candidate index items;
  • Candidate index items determine the reverse document frequency of the search term in the preset index library; input the reverse document frequency, search term and candidate index items into the preset similarity algorithm to determine the similarity value between the candidate index item and the corresponding search term , And determine keywords based on the similarity value.
  • FIG. 1 is a schematic diagram of the device structure of the hardware operating environment involved in the solution of the embodiment of the present application;
  • FIG. 2 is a schematic flowchart of an embodiment of a method for determining keywords based on semantic recognition according to this application;
  • FIG. 3 is a schematic flowchart of another embodiment of a keyword determination method based on semantic recognition in this application.
  • FIG. 4 is a detailed flow diagram of the steps of inputting the search term into the preset index database for query to obtain the corresponding candidate index item according to the application;
  • FIG. 5 is a detailed flow diagram of the step of determining the reverse document frequency of the search term in the preset index database according to the candidate index item according to the application.
  • FIG. 1 is a schematic diagram of a terminal structure of a hardware operating environment involved in a solution of an embodiment of the present application.
  • the terminal of this application is a device, and the device may be a terminal device with a storage function such as a mobile phone, a computer, or a mobile computer.
  • the terminal may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1005 can be a high-speed RAM memory or a stable memory (non-volatile memory), such as disk storage.
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
  • the terminal may also include a camera, a Wi-Fi module, etc., which will not be repeated here.
  • terminal structure shown in FIG. 1 does not constitute a limitation on the terminal, and may include more or fewer components than shown in the figure, or combine some components, or arrange different components.
  • the network interface 1004 is mainly used to connect to a back-end server and communicate with the back-end server;
  • the user interface 1003 mainly includes an input unit such as a keyboard.
  • the keyboard includes a wireless keyboard and a wired keyboard for connecting to a client.
  • Perform data communication with the client; and the processor 1001 can be used to call computer-readable instructions stored in the memory 1005 and perform the following operations:
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • the training feature vector of the training sentence is used as the training sample of the multi-class perceptron to obtain the multi-class perceptron after training.
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • the search term set is input into a preset part-of-speech tagging algorithm, the part-of-speech of each word in the search term set is determined, and the words whose part-of-speech is the preset search part-of-speech are determined as search terms.
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • the index item corresponding to the core word in the index library is used as the candidate index item.
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • the similarity value of the candidate index items is calculated according to the number of search words and the frequency of the reverse document.
  • processor 1001 may call computer-readable instructions stored in the memory 1005, and also perform the following operations:
  • the optional embodiments of the device are basically the same as the following embodiments of the keyword determination method based on semantic recognition, and will not be repeated here.
  • FIG. 2 is a schematic flowchart of an embodiment of a method for determining keywords based on semantic recognition according to this application.
  • the method for determining keywords based on semantic recognition provided in this embodiment includes the following steps:
  • Step S10 Obtain the search sentence input by the user, perform word segmentation on the search sentence, and extract the feature vector of each word after the word segmentation;
  • the search sentence input by the user is first obtained. It is easy to understand that the sentence input by the user on the search interface can be used as the search sentence, or the corresponding search sentence can be obtained by voice recognition of the voice entered by the user.
  • the retrieval sentence input by the user may be obtained in other ways, which is not limited in this embodiment.
  • the NLP algorithm can be used to segment the search sentence, or the feature template extraction algorithm can be used to segment the search sentence, and the feature vector corresponding to each word after the word segmentation can be constructed.
  • Step S20 Input the feature vector into the trained multi-class perceptron to obtain the corresponding character tagging result, and obtain the corresponding search term according to the character tagging result;
  • a plurality of different types of perceptrons are also preset.
  • the feature vector corresponding to the retrieval sentence is obtained, the feature vector is input into multiple types of perceptrons, because each perceptron only sees one type of target. It is a positive example, and the rest of the targets are regarded as negative examples, so the sample data of multi-type perceptrons can be trained first.
  • the feature vector is input into the trained multi-class perceptron to obtain the corresponding character labeling result, and the corresponding search term is obtained according to the character labeling result. It is easy to understand that the above word labeling result refers to the labeling of the position of each word in the search sentence.
  • Step S30 input the search term into the preset index database for query, and obtain the corresponding candidate index item;
  • an index library is also preset, and the mapping relationship between search terms and candidate index items is stored in the index library.
  • the search terms are input into the preset index library to obtain candidates corresponding to the search terms. Indicator items.
  • Step S40 Determine the reverse document frequency of the search term in the preset index database according to the candidate index item
  • the frequency of the reverse document can reflect the importance of the obtained candidate index items in the entire retrieval process. Therefore, after the candidate index items are obtained, the number corresponding to the candidate index items is obtained according to the number of all index items in the preset index library. Reverse document frequency to determine the importance of the search term.
  • Step S50 Input the frequency of the reverse document, the search term and the candidate index item into a preset similarity algorithm, determine the similarity value between the candidate index item and the corresponding search term, and determine the similarity value between the candidate index item and the corresponding search term.
  • the similarity value determines the keywords.
  • a similarity algorithm is also preset, and the similarity value of each candidate index item is calculated according to the frequency of the reverse document, the search term, and the candidate index item.
  • the similarity value is The highest candidate index item is determined as a keyword.
  • This application discloses a method, a device and a storage medium for determining keywords based on semantic recognition.
  • the method first obtains a search sentence input by a user, performs word segmentation on the search sentence, and extracts the feature vector of each word after the word segmentation; Input to the trained multi-class perceptrons to obtain the corresponding character labeling results, and obtain the corresponding search terms according to the character labeling results; input the search terms into the preset index database for query, and obtain the corresponding candidate index items;
  • Candidate index items determine the reverse document frequency of the search term in the preset index library; input the reverse document frequency, search term and candidate index items into the preset similarity algorithm to determine the similarity value between the candidate index item and the corresponding search term , And determine keywords based on the similarity value.
  • the multi-type perceptron includes a plurality of training sentences, and after the step S10 extracts the feature vector of each word after word segmentation, it further includes:
  • Step S60 input the training sentence into a preset feature module to extract the training feature vector of the training sentence;
  • the perceptron includes corresponding training samples.
  • the training samples appear in the form of training sentences.
  • the type of the feature template for training the perceptron should be the same as the type of the feature template for obtaining the word feature vector.
  • step S70 the training feature vector of the training sentence is used as the training sample of the multi-type perceptron to obtain the multi-type perceptron after training.
  • the keywords in the search sentence can be accurately determined.
  • the step of inputting the feature vector into the trained multi-class perceptron to obtain the corresponding word tagging result includes:
  • Step S21 Input the feature vector into the trained multi-type perceptron to obtain the label position corresponding to each feature vector;
  • the label position of the feature vector is first obtained, and the label position of the feature vector is labeled to obtain the character labeling result of the feature vector.
  • the number of label positions of each word in the feature vector corresponds to the word formation position information.
  • the preset word formation position information is 4, namely the word beginning position information, the word position information, the word ending position information, and the word Position information, each character in the feature vector corresponds to 4 label positions.
  • Step S22 At the labeling position corresponding to each feature vector, use the preset word formation position information to label each feature vector to obtain the corresponding character labeling result.
  • the word formation position information is the word beginning position information, the word position information, the word ending position information, and the word position information. It should be understood that the word formation position information in this embodiment may also include other compatible features.
  • the word formation position information marked by the vector is not limited in this embodiment. After obtaining the labeling position of the feature vector, use the word head position information, word position information, word ending position information, and word position information to label the feature vector at the labeling position to obtain the word labeling result of the retrieval sentence. Further, for more To elaborate on this embodiment, the following examples are as follows:
  • the retrieval sentence is: What is the amount of fixed asset investment completed this quarter.
  • the word labeling results obtained through the multi-type sensor are: this/I season/A degree/E solid/A fixed/M capital/M production/E investment/A capital/E end/A success/M amount/E yes /IMore/Aless/E.
  • the word tagging results corresponding to the search sentence are obtained through the above method, and the part of speech of the words after word segmentation is preliminarily divided through the multi-class perceptual classifier. Compared with the traditional word segmentation technology, it further reflects the words in the sentence. Contextual semantics, so the division of words is more precise.
  • the step of obtaining the corresponding search term according to the character tagging result includes:
  • Step 23 Perform word segmentation on the search sentence according to the word formation position information to obtain a corresponding search word set
  • the search sentence is segmented according to the word formation position information and the word tagging result to obtain a plurality of different words after the word segmentation of the search sentence, and the words obtained after the plurality of word segmentation are used as a search word set.
  • the word formation position information is used as the initial position information A, the middle position information M, the ending position information E, and the word position information I.
  • the search sentence is: What is the amount of fixed asset investment completed in this quarter? , As an example. After passing through multiple types of perceptrons, the result of the word labeling corresponding to the retrieval sentence is: this/I season/A degree/E solid/A fixed/M capital/M production/E investment/A capital/E end/A Cheng /M amount/E is /I more/A less/E.
  • the word marked as ⁇ I ⁇ can be used as the search term, and the two characters, three characters or several characters marked as ⁇ AE ⁇ or ⁇ AME ⁇ or ⁇ AM...ME ⁇ can be used as one search term.
  • the search term set corresponding to the above search sentence is: current, quarter, fixed assets, investment completed, yes, how much.
  • the words marked as ⁇ I ⁇ may not be included in the search term set.
  • Step S24 input the search term set into a preset part of speech tagging algorithm, determine the part of speech of each word in the search term set, and determine the word whose part of speech is the preset search part of speech as the search term.
  • the search sentence is generally a complete sentence, containing many words of different parts of speech. Among them, some key words often represent the main meaning of a sentence, such as nouns and adjectives. These words of part of speech are likely to be search terms. Therefore, in this proposal, it is necessary to perform part-of-speech analysis on the words in the search term set to obtain the key words of the search sentence, that is, the search term.
  • This embodiment also presets a part-of-speech tagging algorithm.
  • the part-of-speech tagging in the NLP algorithm can be used to determine the part of speech of each word; of course, CLAWS (Contituent-Likelihood Automatic Word-tagging System The component-likelihood automatic part-of-speech tagging system) algorithm, or the VOLSUNGA algorithm, is used to determine the part of speech of each word in the search term set.
  • CLAWS Contituent-Likelihood Automatic Word-tagging System
  • the component-likelihood automatic part-of-speech tagging system or the VOLSUNGA algorithm
  • the above CLAWS algorithm and the VOLSUNGA algorithm are based on statistical part-of-speech tagging algorithms, which mark the part of speech according to the co-occurrence probability .
  • a plurality of index items and corresponding core words are stored in the index library, and the step of inputting the search terms into a preset index library for query, and obtaining corresponding candidate index items includes:
  • Step S31 input the search term into a preset index library, and determine the core word corresponding to the search term in the index library;
  • an index library is preset, and the index item and the corresponding core word are stored in the index library.
  • the index item and the core word are not in a one-to-one correspondence, and multiple index items may be Corresponding to the same core words, the core words can be the words directly extracted from each indicator item, or the words corresponding to each indicator item formulated by the user.
  • the indicator item is the core corresponding to "Fixed Asset Investment Completed" The term is "investment completed”.
  • step S32 an index item corresponding to the core word in the index library is used as the candidate index item.
  • the index item corresponding to the core word in the index library After determining the core word corresponding to the search term, preset the index item corresponding to the core word in the index library, and use the index item as the candidate index item, which is easy to understand. Because the core in the index library A word may correspond to multiple index items, so the number of candidate index items may also be multiple.
  • the candidate index items corresponding to the search words are determined in the above-mentioned manner, avoiding directly using multiple search words to determine the keywords of the search sentence, thereby reducing the amount of calculation in the keyword determination process.
  • the step of determining the reverse document frequency of the search term in a preset index database according to the candidate index item includes:
  • Step S41 Determine the number of candidate index items and the number of all index items in the preset index library
  • the keywords in the search sentence are determined.
  • the degree of similarity between the candidate index items and the search sentence is determined by the number of search words contained in the candidate index items and the importance of the contained search words. Among them, The number of search terms contained in the candidate index item is related to the frequency of the reverse document. In order to obtain the reverse document frequency corresponding to the candidate index items and the retrieval sentence, the number of candidate index items and the number of all index items in the preset index library are first determined.
  • Step S42 Divide the number of candidate index items by the number of all index items, and take the logarithm of the obtained quotient to obtain the reverse document frequency corresponding to the search term.
  • the frequency of the reverse file can reflect the degree of discrimination of candidate index items.
  • the degree of discrimination of the candidate index items is higher, the importance of the candidate index items is higher, and the more likely it is to be determined as a keyword.
  • the reverse document frequency can be obtained by dividing the total number of index items included in the index item set by the number of index items in the index item set containing the search term, and then taking the logarithm of the obtained quotient.
  • the reverse document frequency corresponding to the candidate index item is determined by the above-mentioned method, so as to determine the importance of the search term, and then determine the similarity of each candidate index item.
  • the step of inputting the reverse document frequency, the search term, and the candidate index item into a preset similarity algorithm to obtain the corresponding similarity value includes:
  • Step S51 Determine the number of search terms included in the candidate index item, and use the number as the number of search terms;
  • the number of matches between each candidate index item and the search term is counted.
  • the number of candidate index items matches the search term the higher the similarity corresponding to the candidate index item.
  • the number of search terms contained in the candidate index items is determined, and the number is regarded as the number of search terms.
  • the candidate index item “fixed assets in the whole society” “Investment completion amount” contains the search terms “whole society”, “fixed assets” and “investment completion amount”; while the candidate index item “fixed assets investment completion amount” only contains the search terms "fixed assets” and “investment completion amount”, so The candidate index item “Fixed Asset Investment Completed in the Whole Society” contains more search terms than the candidate index item “Fixed Asset Investment Completed”.
  • Step S52 calculating the similarity value of the candidate index item according to the number of search words and the frequency of the reverse document.
  • the similarity value of the candidate index item is obtained.
  • the TF-IDF algorithm can be used to calculate the similarity of each candidate index item.
  • the TF-IDF algorithm is The working method is to synthesize words with higher information content based on contextual semantic synthesis, increase the proportion coefficient of words with higher information content, reduce the proportion coefficient of repetition factor, and then strengthen the content of information entropy of the vocabulary itself.
  • This embodiment determines the similarity value of each candidate index item according to the number of search terms contained in each candidate index item and the frequency of the reverse document. Compared with the traditional keyword matching method, this embodiment uses the number of search terms and the reverse document frequency. Two indexes of file frequency are used to determine the similarity of candidate index items to ensure that the result of keyword determination is more accurate.
  • step of determining keywords according to the similarity value includes:
  • Step S53 Determine the similarity value of each candidate index item, and determine the candidate index item with the highest similarity value as a keyword
  • the candidate index item with the highest similarity value is used as the keyword to complete the confirmation of the keyword in the search sentence.
  • the candidate index item with the highest similarity value is used as the keyword to complete the confirmation of the keyword in the search sentence.
  • they can be used as keywords of the search sentence at the same time.
  • the embodiment of the present application also proposes a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and when the computer-readable instructions are executed by a processor The operation of the keyword determination method based on semantic recognition as described above is realized.
  • the optional embodiments of the non-volatile computer-readable storage medium of the present application are basically the same as the above embodiments of the keyword determination method based on semantic recognition, and will not be repeated here.
  • the method of the embodiment can be realized by means of software plus the necessary general hardware platform, of course, it can also be realized through Over hardware, but in many cases the former is a better implementation.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium (such as ROM/RAM, floppy disk, optical disk)
  • the disk includes several instructions to enable a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.

Abstract

一种基于语义识别的关键词确定方法、装置和存储介质,方法包括如下步骤:获取用户输入的检索语句,对检索语句进行分词,并提取分词后各个词语的特征向量(S10);将所述特征向量输入至训练完成的多类感知器中,得到对应的字标注结果,并根据所述字标注结果得到对应的检索词(S20);将所述检索词输入至预设指标库中进行查询,得到对应的候选指标项(S30);根据所述候选指标项确定所述检索词在预设指标库中的逆向文件频率(S40);将所述逆向文件频率、所述检索词和所述候选指标项输入至预设相似度算法中,确定所述候选指标项与对应的所述检索词的相似度数值,并根据所述相似度数值确定关键词(S50)。

Description

基于语义识别的关键词确定方法、装置和存储介质
本申请要求于2019年9月18日提交中国专利局、申请号为201910884362.4、发明名称为“基于语义识别的关键词确定方法、装置和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及自然语言处理技术领域,尤其涉及一种基于语义识别的关键词确定方法、装置和存储介质。
背景技术
随着网络信息的膨胀和网络用户的增长,人们对获取网络信息的及时性以及准确性提出了更高的要求,为此一些搜索软件和搜索引擎应运而生。目前,主流的关键词确定方法为提取用户输入的语句中的关键词,使用关键词匹配的技术从数据库中提取匹配度最高的数据作为搜索结果反馈给用户。
然而,上述搜索方式对于关键词的定义存在一定缺陷,如若关键词为字形相近的词语或多义词,则无法准确的定义关键词,从而导致搜索结果的偏差。
发明内容
本申请的主要目的在于提供了一种基于语义识别的关键词确定方法、装置和存储介质,旨在现有的关键词确定方法无法准确定义关键词而导致准确率过低的技术问题。
为实现上述目的,本申请提供了一种基于语义识别的关键词确定方法,包括以下步骤:
获取用户输入的检索语句,使用NLP算法或特征模板提取算法对所述检索语句进行分词,并提取分词后各个词语的特征向量;
将多类感知器中的训练语句输入至预设特征模块中,以提取出所述训练语句的训练特征向量;
将所述训练语句的训练特征向量作为所述多类感知器的训练样本,以得到训练完成的多类感知器;
将所述特征向量输入至训练完成的多类感知器中,得到对应的字标注结果,并根据所述字标注结果得到对应的检索词;
将所述检索词输入至预设指标库中进行查询,得到对应的候选指标项,其中预设指标库中存储有检索词和候选指标项的映射关系;
根据所述候选指标项确定所述检索词在预设指标库中的逆向文件频率;
将所述逆向文件频率、所述检索词和所述候选指标项输入至预设相似度算法中,确定所述候选指标项与对应的所述检索词的相似度数值,并根据所述相似度数值确定关键词。
此外,为实现上述目的,本申请还提供一种装置,所述装置包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令被所述处理器执行时实现如上所述基于语义识别的关键词确定方法的步骤。
此外,为实现上述目的,本申请还提供一种非易失性计算机可读存储介质,所述非易失性计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如上所述基于语义识别的关键词确定方法的步骤。
本申请公开了一种基于语义识别的关键词确定方法、装置和存储介质,所述方法先是获取用户输入的检索语句,对检索语句进行分词,并提取分词后各个词语的特征向量;将特征向量输入至训练完成的多类感知器中,得到对应的字标注结果,并根据字标注结果得到对应的检索词;将检索词输入至预设指标库中进行查询,得到对应的候选指标项;根据候选指标项确定检索词在预设指标库中的逆向文件频率;将逆向文件频率、检索词和候选指标项输入至预设相似度算法中,确定候选指标项与对应的检索词的相似度数值,并根据相似度数值确定关键词。使用基于多类感知器的字标注方法对检索语句进行精确的分词,再通过预设指标库确定与分词对应的候选指标项,最后通过计算得到的逆向文件频率结合和预设相似度算法,确定各个候选指标项的相似度,并以此确定关键词,从而使得对于关键词的确定符合检索语句整体的语义,进而准确定义关键词,提高搜索结果的准确率。
附图说明
图1是本申请实施例方案涉及的硬件运行环境的装置结构示意图;
图2为本申请基于语义识别的关键词确定方法一实施例的流程示意图;
图3为本申请基于语义识别的关键词确定方法另一实施例的流程示意图;
图4为本申请所述将所述检索词输入至预设指标库中进行查询,得到对应的候选指标项的步骤细化流程示意图;
图5为本申请所述根据所述候选指标项确定所述检索词在预设指标库中的逆向文件频率的步骤细化流程示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的可选实施例仅仅用以解释本申请,并不用于限定本申请。
如图1所示,图1是本申请实施例方案涉及的硬件运行环境的终端结构示意图。
本申请终端是一种装置,该装置可以是一种手机、电脑、移动电脑等具有存储功能的终端设备。
如图1所示,该终端可以包括:处理器1001,例如CPU,通信总线1002,用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选的用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。
可选地,终端还可以包括摄像头、Wi-Fi模块等等,在此不再赘述。
本领域技术人员可以理解,图1中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
在图1所示的终端中,网络接口1004主要用于连接后台服务器,与后台服务器进行数据通信;用户接口1003主要包括输入单元比如键盘,键盘包括无线键盘和有线键盘,用于连接客户端,与客户端进行数据通信;而处理器1001可以用于调用存储器1005中存储的计算机可读指令,并执行以下操作:
获取用户输入的检索语句,对所述检索语句进行分词,并提取分词后各个词语的特征向量;
将所述特征向量输入至训练完成的多类感知器中,得到对应的字标注结果,并根据所述字标注结果得到对应的检索词;
将所述检索词输入至预设指标库中进行查询,得到对应的候选指标项;
根据所述候选指标项确定所述检索词在预设指标库中的逆向文件频率;
将所述逆向文件频率、所述检索词和所述候选指标项输入至预设相似度算法中,确定所述候选指标项与对应的所述检索词的相似度数值,并根据所述相似度数值确定关键词。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
将所述训练语句输入至预设特征模块中,以提取出所述训练语句的训练特征向量;
将所述训练语句的训练特征向量作为所述多类感知器的训练样本,以得到训练完成的多类感知器。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
将所述特征向量输入至训练完成的多类感知器中,得到每个特征向量对应的标注位置;
在每个特征向量对应的标注位置上,使用预设构词位置信息对各个特征向量进行标注,得到对应的字标注结果。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
根据所述构词位置信息对所述检索语句进行分词,得到对应的检索词集合;
将所述检索词集合输入至预设词性标注算法中,确定检索词集合中各个词语的词性,并将词性为预设检索词性的词语确定为检索词。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
将所述检索词输入至预设指标库中,确定所述指标库中与所述检索词对应的核心词;
将所述指标库中与所述核心词对应的指标项作为所述候选指标项。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
确定所述候选指标项的数目以及预设指标库中所有指标项的数目;
将所述候选指标项的数目除以所有指标项的数目,并将得到的商取对数,以得到与检索词对应的逆向文件频率。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
确定候选指标项中所包含的检索词的数目,并将所述数目作为检索词个数;
根据所述检索词个数以及所述逆向文件频率计算得到候选指标项的相似度数值。
进一步地,处理器1001可以调用存储器1005中存储的计算机可读指令,还执行以下操作:
确定各个候选指标项的相似度数值,并将相似度数值最高的候选指标项确定为关键词。
本装置的可选实施例与下述基于语义识别的关键词确定方法各实施例基本相同,在此不作赘述。
请参阅图2,图2为本申请基于语义识别的关键词确定方法一实施例的流程示意图,本实施例提供的基于语义识别的关键词确定方法包括如下步骤:
步骤S10,获取用户输入的检索语句,对所述检索语句进行分词,并提取分词后各个词语的特征向量;
本实施例中,先获取用户输入的检索语句,容易理解的是,可以将用户在检索界面上输入的语句作为检索语句,也可以通过对用户录入的声音进行语音识别得到对应的检索语句,也可以采用其他方式获取用户输入的检索语句,本实施例在此不限制。
可选的,获取到用户输入的检索语句后,可以使用NLP算法对检索语句进行分词,也可以使用特征模板提取算法对所述检索语句分词,并构建与分词后各个词语对应的特征向量。
步骤S20,将所述特征向量输入至训练完成的多类感知器中,得到对应的字标注结果,并根据所述字标注结果得到对应的检索词;
本实施例中,还预先设置有多个不同种类的感知器,在得到检索语句对应的特征向量后,将所述特征向量输入至多类感知器中,由于每个感知器只将一类目标视为正例,而将其余目标视为负例,因此可以先对多类感知器的样本数据进行训练。将所述特征向量输入至训练完成的多类感知器中,得到对应的字标注结果,并根据所述字标注结果得到对应的检索词。容易理解的是,上述字标注结果是指在检索语句中每个字所在的位置进行的标注。
步骤S30,将所述检索词输入至预设指标库中进行查询,得到对应的候选指标项;
本实施例中还预先设置有指标库,所述指标库中存储有检索词和候选指标项的映射关系,将所述检索词输入到预设指标库中,得到与所述检索词对应的候选指标项。
步骤S40,根据所述候选指标项确定所述检索词在预设指标库中的逆向文件频率;
所述逆向文件频率能反映得到的候选指标项在整个检索过程中词性的重要程度,因此在得到候选指标项后,根据预设指标库中所有指标项的数目得到与所述候选指标项对应的逆向文件频率,以确定所述检索词的重要性。
步骤S50,将所述逆向文件频率、所述检索词和所述候选指标项输入至预设相似度算法中,确定所述候选指标项与对应的所述检索词的相似度数值,并根据所述相似度数值确定关键词。
本实施例中,还预设有相似度算法,根据所述逆向文件频率、所述检索词和所述候选指标项,计算得到各个候选指标项的相似度数值,可选的,将相似度数值最高的候选指标项确定为关键词。
本申请公开了一种基于语义识别的关键词确定方法、装置和存储介质,所述方法先是获取用户输入的检索语句,对检索语句进行分词,并提取分词后各个词语的特征向量;将特征向量输入至训练完成的多类感知器中,得到对应的字标注结果,并根据字标注结果得到对应的检索词;将检索词输入至预设指标库中进行查询,得到对应的候选指标项;根据候选指标项确定检索词在预设指标库中的逆向文件频率;将逆向文件频率、检索词和候选指标项输入至预设相似度算法中,确定候选指标项与对应的检索词的相似度数值,并根据相似度数值确定关键词。使用基于多类感知器的字标注方法对检索语句进行精确的分词,再通过预设指标库确定与分词对应的候选指标项,最后通过计算得到的逆向文件频率结合和预设相似度算法,确定各个候选指标项的相似度,并以此确定关键词,从而使得对于关键词的确定符合检索语句整体的语义,进而准确定义关键词,提高搜索结果的准确率。
进一步的,所述多类感知器包括多个训练语句,所述步骤S10提取分词后各个词语的特征向量之后,还包括:
步骤S60,将所述训练语句输入至预设特征模块中,以提取出所述训练语句的训练特征向量;
基于上述实施例,在得到检索语句中各个词语的特征向量后,为了确定各个检索词的字标注结果,需要对多类感知器进行训练。容易理解的是,感知器包括有对应的训练样本,一般的,所述训练样本都以训练语句的形式出现,将感知器的训练语句输入到预设特征模板中,提取出对应的训练特征向量。应当理解的是,如若上述词语的特征向量是根据特征模板得到的,则训练感知器的特征模板的种类,应当与获取词语特征向量的特征模板的种类相同。
步骤S70,将所述训练语句的训练特征向量作为所述多类感知器的训练样本,以得到训练完成的多类感知器。
得到训练语句的训练特征向量后,将所述训练特征向量替代训练语句作为感知器新的训练样本,则得到训练完成的多类感知器,通过训练完成的多类感知器得到检索语句的字标注结果,从而精准的确定检索语句中关键词。
进一步的,所述将所述特征向量输入至训练完成的多类感知器中,得到对应的字标注结果的步骤包括:
步骤S21,将所述特征向量输入至训练完成的多类感知器中,得到每个特征向量对应的标注位置;
本实施例中,先得到特征向量的标注位置,在特征向量的标注位置上进行标注,以得到特征向量的字标注结果。
一般而言,特征向量中每个字的标注位置的数目与构词位置信息对应,例如,预设构词位置信息为4个,即词首位置信息、词中位置信息、词尾位置信息以及单词位置信息,则特征向量中每个字对应有4个标注位置。
步骤S22,在每个特征向量对应的标注位置上,使用预设构词位置信息对各个特征向量进行标注,得到对应的字标注结果。
如上所述,假设构词位置信息为,词首位置信息、词中位置信息、词尾位置信息以及单词位置信息,应当理解都是,本实施例中的构词位置信息也可以包括其他能对特征向量进行标注的构词位置信息,本实施例在此不做限制。在得到特征向量的标注位置后,使用词首位置信息、词中位置信息、词尾位置信息以及单词位置信息在标注位置对特征向量进行标注,以得到检索语句的字标注结果,进一步的,为了更详尽的阐述本实施例,以下举例:
将词首位置信息设置为A,词中位置信息设置为M,词尾位置信息设置为E,单词位置信息设置为I,检索语句为:本季度固定资产投资完成额是多少。则通过多类感知器得到的字标注结果为:本/I季/A度/E固/A定/M资/M产/E投/A资/E完/A成/M额/E是/I多/A少/E。
本实施例通过上述方式,得到检索语句对应的字标注结果,通过多类感知分类器对分词后词语的词性进行了初步的划分,较比传统的分词技术,进一步的体现了词语在语句中的上下文语义,因此对于词语的划分更为精准。
进一步的,所述根据所述字标注结果得到对应的检索词的步骤包括:
步骤23,根据所述构词位置信息对所述检索语句进行分词,得到对应的检索词集合;
根据构词位置信息以及字标注结果对检索语句进行分词,得到所述检索语句分词后的多个不同词语,并将所述多个分词后得到的词语作为检索词集合。
为了进一步详尽的阐述本实施例,以构词位置信息为词首位置信息A、词中位置信息M、词尾位置信息E以及单词位置信息I,检索语句为:本季度固定资产投资完成额是多少,为例。通过多类感知器后,得到的与检索语句对应的字标注结果为:本/I季/A度/E固/A定/M资/M产/E投/A资/E完/A成/M额/E是/I多/A少/E。则可以将其中标注为{I}的词作为检索词,将标注为{AE}或{AME}或{AM...ME}的两字、三字或若干字作为一个检索词。那么,上述检索语句对应的检索词集合为:本,季度,固定资产,投资完成额,是,多少。作为另外一种实施方式,为了减少计算量,可以将标注为{I}的词不纳入检索词集合。
步骤S24,将所述检索词集合输入至预设词性标注算法中,确定检索词集合中各个词语的词性,并将词性为预设检索词性的词语确定为检索词。
检索语句一般是完整的一句话,包含了很多词性不同的词,其中,某些关键词性的词语往往代表了一句话的主要含义,例如名词、形容词,这些词性的词语很可能就是检索词。因此,在本提案中,需要对检索词集合中的词语进行词性分析,获得检索语句的关键词语,即检索词。
本实施例中还预设有词性标注算法,当采用NLP算法对检索语句进行分词时,可以采用NLP算法中的词性标注确定各个词语的词性;当然,也可以使用CLAWS(Contituent-Likelihood Automatic Word-tagging System 成分似然性自动词性标注系统)算法,或VOLSUNGA算法,来实现对于检索词集合中各个词语词性的确定,上述CLAWS算法和VOLSUNGA算法都是基于统计的词性标注算法,根据同现概率来标注词性。也可以采用一些基于规则的算法确定词语的词性,即利用事先制定好的规则对具有多个词性的词进行消歧,最后保留一个正确的词性。容易理解都是,本实施例并不限制词性标注算法。
本实施例通过上述方式,根据字标注结果进行精准的分词,并分析词语的词性,以此确定关键词,从而去除掉检索语句中的语气助词等词性的词语,避免其对最后关键词的确定结果产生影响。
进一步的,所述指标库中存储有多个指标项和对应的核心词,所述将所述检索词输入至预设指标库中进行查询,得到对应的候选指标项的步骤包括:
步骤S31,将所述检索词输入至预设指标库中,确定所述指标库中与所述检索词对应的核心词;
本实施例中,预先设置有指标库,所述指标库中存储有指标项和对应的核心词,应当理解都是,所述指标项与核心词并不是一一对应关系,多个指标项可能对应有相同的核心词,所述核心词可以为各指标项中直接提取的词语,也可以为用户制定的各指标项对应的词语,例如,指标项为“固定资产投资完成额”对应的核心词为“投资完成额”。
步骤S32,将所述指标库中与所述核心词对应的指标项作为所述候选指标项。
确定与检索词对应的核心词后,将预设指标库中与所述核心词对应的指标项,并将所述指标项作为所述候选指标项,容易理解都是,由于指标库中的核心词可能对应有多个指标项,因此候选指标项的数目也可以为多个。
本实施例通过上述方式,确定与检索词对应的候选指标项,避免直接利用多个检索词确定检索语句的关键词,从而减少关键词确定过程中的计算量。
进一步的,所述根据所述候选指标项确定所述检索词在预设指标库中的逆向文件频率的步骤包括:
步骤S41,确定所述候选指标项的数目以及预设指标库中所有指标项的数目;
得到候选指标项后,进行检索语句中关键词的确定,候选指标项与检索语句的相似程度,由候选指标项所包含的检索词个数以及所包含的检索词的重要性共同决定,其中,候选指标项所包含的检索词个数与逆向文件频率有关。为了得到候选指标项与检索语句所对应的逆向文件频率,先确定候选指标项的数目以及预设指标库中所有指标项的数目。
步骤S42,将所述候选指标项的数目除以所有指标项的数目,并将得到的商取对数,以得到与检索词对应的逆向文件频率。
逆向文件频率能体现候选指标项的区分度,当候选指标项的区分度越高时,则所述候选指标项的重要性也就越高,越有可能确定为关键词。在预设指标库的多个指标项中,如若与该检索词对应的指标项的数目越少,则该指标项越重要。因此,逆向文件频率可以由指标项集合中包含的总指标项数目除以该指标项集合中包含该检索词的指标项数目,再将得到的商取对数得到。
本实施例通过上述方式,确定候选指标项对应的逆向文件频率,从而确定检索词对应的重要性,进而确定各个候选指标项的相似度。
进一步的,所述将所述逆向文件频率、所述检索词和所述候选指标项输入至预设相似度算法中,得到对应的相似度数值的步骤包括:
步骤S51,确定候选指标项中所包含的检索词的数目,并将所述数目作为检索词个数;
在本实施例中,统计各个候选指标项与检索词匹配的数目,当候选指标项中与检索词匹配的数目越多,则候选指标项对应的相似度也就越高。为达到上述目的,确定候选指标项中所包含的检索词的数目,并将所述数目作为检索词个数。
例如,对于候选指标项“全社会固定资产投资完成额”,“固定资产投资完成额”与检索词“全社会”,“固定资产”“投资完成额”,其中候选指标项“全社会固定资产投资完成额”包含有检索词“全社会”,“固定资产”“投资完成额”;而候选指标项“固定资产投资完成额”只包含有检索词“固定资产”“投资完成额”,因此候选指标项“全社会固定资产投资完成额”所包含的检索词个数多于候选指标项“固定资产投资完成额”。
步骤S52,根据所述检索词个数以及所述逆向文件频率计算得到候选指标项的相似度数值。
根据各个候选指标项的检索词个数以及逆向文件频率,得到候选指标项的相似度数值,可选的,可以使用TF-IDF算法来计算各个候选指标项的相似度,TF-IDF算法它的工作方式为,基于上下文的语义综合判断出信息含量较高的词语,提高信息含量较高的词语的比重系数,降低重复因子的比重系数,进而加强词汇本身的信息熵的含量。
本实施例根据各个候选指标项所包含的检索词个数以及逆向文件频率,确定各个候选指标项的相似度数值,相比于传统的关键词匹配方法,本实施例通过检索词个数以及逆向文件频率两个指标来确定候选指标项的相似度,保证关键词确定的结果更为准确。
进一步的,所述根据所述相似度数值确定关键词的步骤包括:
步骤S53,确定各个候选指标项的相似度数值,并将相似度数值最高的候选指标项确定为关键词
在得到各个候选指标项的相似度数值后,将相似度数值最高的候选指标项作为关键词,以此完成检索语句中关键词的确认。特别的,当存在2个或2个以上的候选指标项的相似度数值相同时,可以同时将其作为检索语句的关键词。
此外,本申请实施例还提出一种非易失性计算机可读存储介质,所述非易失性计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如上所述基于语义识别的关键词确定方法的操作。
本申请非易失性计算机可读存储介质的可选实施例与上述基于语义识别的关键词确定方法各实施例基本相同,在此不作赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述 实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通 过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体 现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光 盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的可选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种基于语义识别的关键词确定方法,其中,包括以下步骤:
    获取用户输入的检索语句,使用NLP算法或特征模板提取算法对所述检索语句进行分词,并提取分词后各个词语的特征向量;
    将多类感知器中的训练语句输入至预设特征模块中,以提取出所述训练语句的训练特征向量;
    将所述训练语句的训练特征向量作为所述多类感知器的训练样本,以得到训练完成的多类感知器;
    将所述特征向量输入至训练完成的多类感知器中,得到对应的字标注结果,并根据所述字标注结果得到对应的检索词;
    将所述检索词输入至预设指标库中进行查询,得到对应的候选指标项,其中预设指标库中存储有检索词和候选指标项的映射关系;
    根据所述候选指标项确定所述检索词在预设指标库中的逆向文件频率;
    将所述逆向文件频率、所述检索词和所述候选指标项输入至预设相似度算法中,确定所述候选指标项与对应的所述检索词的相似度数值,并根据所述相似度数值确定关键词。
  2. 如权利要求1所述的基于语义识别的关键词确定方法,其中,所述将所述特征向量输入至训练完成的多类感知器中,得到对应的字标注结果的步骤包括:
    将所述特征向量输入至训练完成的多类感知器中,得到每个特征向量对应的标注位置;
    在每个特征向量对应的标注位置上,使用预设构词位置信息对各个特征向量进行标注,得到对应的字标注结果。
  3. 如权利要求2所述的基于语义识别的关键词确定方法,其中,所述根据所述字标注结果得到对应的检索词的步骤包括:
    根据所述构词位置信息对所述检索语句进行分词,得到对应的检索词集合;
    将所述检索词集合输入至预设词性标注算法中,确定检索词集合中各个词语的词性,并将词性为预设检索词性的词语确定为检索词。
  4. 如权利要求1所述的基于语义识别的关键词确定方法,其中,所述指标库中存储有多个指标项和对应的核心词,所述将所述检索词输入至预设指标库中进行查询,得到对应的候选指标项的步骤包括:
    将所述检索词输入至预设指标库中,确定所述指标库中与所述检索词对应的核心词;
    将所述指标库中与所述核心词对应的指标项作为所述候选指标项。
  5. 如权利要求1所述的基于语义识别的关键词确定方法,其中,所述根据所述候选指标项确定所述检索词在预设指标库中的逆向文件频率的步骤包括:
    确定所述候选指标项的数目以及预设指标库中所有指标项的数目;
    将所述候选指标项的数目除以所有指标项的数目,并将得到的商取对数,以得到与检索词对应的逆向文件频率。
  6. 如权利要求1所述的基于语义识别的关键词确定方法,其中,所述将所述逆向文件频率、所述检索词和所述候选指标项输入至预设相似度算法中,得到对应的相似度数值的步骤包括:
    确定候选指标项中所包含的检索词的数目,并将所述数目作为检索词个数;
    根据所述检索词个数以及所述逆向文件频率计算得到候选指标项的相似度数值。
  7. 如权利要求6所述的基于语义识别的关键词确定方法,其中,所述根据所述相似度数值确定关键词的步骤包括:
    确定各个候选指标项的相似度数值,并将相似度数值最高的候选指标项确定为关键词。
  8. 一种装置,其中,所述装置包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述计算机可读指令被所述处理器执行时,执行如下步骤:
    获取用户输入的检索语句,使用NLP算法或特征模板提取算法对所述检索语句进行分词,并提取分词后各个词语的特征向量;
    将多类感知器中的训练语句输入至预设特征模块中,以提取出所述训练语句的训练特征向量;
    将所述训练语句的训练特征向量作为所述多类感知器的训练样本,以得到训练完成的多类感知器;
    将所述特征向量输入至训练完成的多类感知器中,得到对应的字标注结果,并根据所述字标注结果得到对应的检索词;
    将所述检索词输入至预设指标库中进行查询,得到对应的候选指标项,其中预设指标库中存储有检索词和候选指标项的映射关系;
    根据所述候选指标项确定所述检索词在预设指标库中的逆向文件频率;
    将所述逆向文件频率、所述检索词和所述候选指标项输入至预设相似度算法中,确定所述候选指标项与对应的所述检索词的相似度数值,并根据所述相似度数值确定关键词。
  9. 如权利要求8所述的装置,所述计算机可读指令被所述处理器执行时,还执行如下步骤:
    将所述特征向量输入至训练完成的多类感知器中,得到每个特征向量对应的标注位置;
    在每个特征向量对应的标注位置上,使用预设构词位置信息对各个特征向量进行标注,得到对应的字标注结果。
  10. 如权利要求9所述的装置,所述计算机可读指令被所述处理器执行时,还执行如下步骤:
    根据所述构词位置信息对所述检索语句进行分词,得到对应的检索词集合;
    将所述检索词集合输入至预设词性标注算法中,确定检索词集合中各个词语的词性,并将词性为预设检索词性的词语确定为检索词。
  11. 如权利要求8所述的装置,所述计算机可读指令被所述处理器执行时,还执行如下步骤:
    将所述检索词输入至预设指标库中,确定所述指标库中与所述检索词对应的核心词;
    将所述指标库中与所述核心词对应的指标项作为所述候选指标项。
  12. 如权利要求8所述的装置,所述计算机可读指令被所述处理器执行时,还执行如下步骤:
    确定所述候选指标项的数目以及预设指标库中所有指标项的数目;
    将所述候选指标项的数目除以所有指标项的数目,并将得到的商取对数,以得到与检索词对应的逆向文件频率。
  13. 如权利要求8所述的装置,所述计算机可读指令被所述处理器执行时,还执行如下步骤:
    确定候选指标项中所包含的检索词的数目,并将所述数目作为检索词个数;
    根据所述检索词个数以及所述逆向文件频率计算得到候选指标项的相似度数值。
  14. 如权利要求13所述的装置,所述计算机可读指令被所述处理器执行时,还执行如下步骤:
    确定各个候选指标项的相似度数值,并将相似度数值最高的候选指标项确定为关键词。
  15. 一种非易失性计算机可读存储介质,其中,所述非易失性计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时,执行如下步骤:
    获取用户输入的检索语句,使用NLP算法或特征模板提取算法对所述检索语句进行分词,并提取分词后各个词语的特征向量;
    将多类感知器中的训练语句输入至预设特征模块中,以提取出所述训练语句的训练特征向量;
    将所述训练语句的训练特征向量作为所述多类感知器的训练样本,以得到训练完成的多类感知器;
    将所述特征向量输入至训练完成的多类感知器中,得到对应的字标注结果,并根据所述字标注结果得到对应的检索词;
    将所述检索词输入至预设指标库中进行查询,得到对应的候选指标项,其中预设指标库中存储有检索词和候选指标项的映射关系;
    根据所述候选指标项确定所述检索词在预设指标库中的逆向文件频率;
    将所述逆向文件频率、所述检索词和所述候选指标项输入至预设相似度算法中,确定所述候选指标项与对应的所述检索词的相似度数值,并根据所述相似度数值确定关键词。
  16. 如权利要求15所述的非易失性计算机可读存储介质,所述计算机可读指令被处理器执行时,还执行如下步骤:
    将所述训练语句输入至预设特征模块中,以提取出所述训练语句的训练特征向量;
    将所述训练语句的训练特征向量作为所述多类感知器的训练样本,以得到训练完成的多类感知器。
  17. 如权利要求16所述的非易失性计算机可读存储介质,所述计算机可读指令被处理器执行时,还执行如下步骤:
    将所述特征向量输入至训练完成的多类感知器中,得到每个特征向量对应的标注位置;
    在每个特征向量对应的标注位置上,使用预设构词位置信息对各个特征向量进行标注,得到对应的字标注结果。
  18. 如权利要求15所述的非易失性计算机可读存储介质,所述计算机可读指令被处理器执行时,还执行如下步骤:
    根据所述构词位置信息对所述检索语句进行分词,得到对应的检索词集合;
    将所述检索词集合输入至预设词性标注算法中,确定检索词集合中各个词语的词性,并将词性为预设检索词性的词语确定为检索词。
  19. 如权利要求15所述的非易失性计算机可读存储介质,所述计算机可读指令被处理器执行时,还执行如下步骤:
    将所述检索词输入至预设指标库中,确定所述指标库中与所述检索词对应的核心词;
    将所述指标库中与所述核心词对应的指标项作为所述候选指标项。
  20. 如权利要求15所述的非易失性计算机可读存储介质,所述计算机可读指令被处理器执行时,还执行如下步骤:
    确定所述候选指标项的数目以及预设指标库中所有指标项的数目;
    将所述候选指标项的数目除以所有指标项的数目,并将得到的商取对数,以得到与检索词对应的逆向文件频率。
PCT/CN2019/117577 2019-09-18 2019-11-12 基于语义识别的关键词确定方法、装置和存储介质 WO2021051557A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910884362.4 2019-09-18
CN201910884362.4A CN110795942B (zh) 2019-09-18 2019-09-18 基于语义识别的关键词确定方法、装置和存储介质

Publications (1)

Publication Number Publication Date
WO2021051557A1 true WO2021051557A1 (zh) 2021-03-25

Family

ID=69427313

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/117577 WO2021051557A1 (zh) 2019-09-18 2019-11-12 基于语义识别的关键词确定方法、装置和存储介质

Country Status (2)

Country Link
CN (1) CN110795942B (zh)
WO (1) WO2021051557A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239697A (zh) * 2021-06-01 2021-08-10 平安科技(深圳)有限公司 实体识别模型训练方法、装置、计算机设备及存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753069B (zh) * 2020-06-09 2024-05-07 北京小米松果电子有限公司 语义检索方法、装置、设备及存储介质
CN114385890B (zh) * 2022-03-22 2022-05-20 深圳市世纪联想广告有限公司 互联网舆情监控系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090177754A1 (en) * 2008-01-03 2009-07-09 Xobni Corporation Presentation of Organized Personal and Public Data Using Communication Mediums
CN104731797A (zh) * 2013-12-19 2015-06-24 北京新媒传信科技有限公司 一种提取关键词的方法及装置
CN105893410A (zh) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 一种关键词提取方法和装置
CN105989040A (zh) * 2015-02-03 2016-10-05 阿里巴巴集团控股有限公司 智能问答的方法、装置及系统
CN107608960A (zh) * 2017-09-08 2018-01-19 北京奇艺世纪科技有限公司 一种命名实体链接的方法和装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040002849A1 (en) * 2002-06-28 2004-01-01 Ming Zhou System and method for automatic retrieval of example sentences based upon weighted editing distance
CN101510221B (zh) * 2009-02-17 2012-05-30 北京大学 一种用于信息检索的查询语句分析方法与系统
CN107122413B (zh) * 2017-03-31 2020-04-10 北京奇艺世纪科技有限公司 一种基于图模型的关键词提取方法及装置
CN108345672A (zh) * 2018-02-09 2018-07-31 平安科技(深圳)有限公司 智能应答方法、电子装置及存储介质
CN108664473A (zh) * 2018-05-11 2018-10-16 平安科技(深圳)有限公司 文本关键信息的识别方法、电子装置及可读存储介质
CN109992978B (zh) * 2019-03-05 2021-03-26 腾讯科技(深圳)有限公司 信息的传输方法、装置及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090177754A1 (en) * 2008-01-03 2009-07-09 Xobni Corporation Presentation of Organized Personal and Public Data Using Communication Mediums
CN104731797A (zh) * 2013-12-19 2015-06-24 北京新媒传信科技有限公司 一种提取关键词的方法及装置
CN105989040A (zh) * 2015-02-03 2016-10-05 阿里巴巴集团控股有限公司 智能问答的方法、装置及系统
CN105893410A (zh) * 2015-11-18 2016-08-24 乐视网信息技术(北京)股份有限公司 一种关键词提取方法和装置
CN107608960A (zh) * 2017-09-08 2018-01-19 北京奇艺世纪科技有限公司 一种命名实体链接的方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239697A (zh) * 2021-06-01 2021-08-10 平安科技(深圳)有限公司 实体识别模型训练方法、装置、计算机设备及存储介质
CN113239697B (zh) * 2021-06-01 2023-03-24 平安科技(深圳)有限公司 实体识别模型训练方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN110795942A (zh) 2020-02-14
CN110795942B (zh) 2022-10-14

Similar Documents

Publication Publication Date Title
WO2021132927A1 (en) Computing device and method of classifying category of data
WO2015068947A1 (ko) 녹취된 음성 데이터에 대한 핵심어 추출 기반 발화 내용 파악 시스템과, 이 시스템을 이용한 인덱싱 방법 및 발화 내용 파악 방법
WO2021051557A1 (zh) 基于语义识别的关键词确定方法、装置和存储介质
WO2016010245A1 (en) Method and system for robust tagging of named entities in the presence of source or translation errors
WO2021003930A1 (zh) 客服录音的质检方法、装置、设备及计算机可读存储介质
WO2018157789A1 (zh) 一种语音识别的方法、计算机、存储介质以及电子装置
JP2836159B2 (ja) 同時通訳向き音声認識システムおよびその音声認識方法
WO2019037197A1 (zh) 主题分类器的训练方法、装置及计算机可读存储介质
WO2020207035A1 (zh) 骚扰电话拦截方法、装置、设备及存储介质
WO2020251233A1 (ko) 영상데이터의 추상적특성 획득 방법, 장치 및 프로그램
WO2020082562A1 (zh) 字符识别方法、装置、设备及存储介质
WO2020246702A1 (en) Electronic device and method for controlling the electronic device thereof
WO2019208860A1 (ko) 음성 인식 기술을 이용한 다자간 대화 기록/출력 방법 및 이를 위한 장치
WO2016068455A1 (ko) 적응적인 키보드 인터페이스를 제공하기 위한 방법 및 시스템, 대화 내용과 연동되는 적응적 키보드를 이용한 답변 입력 방법
WO2015023035A1 (ko) 전치사 교정 방법 및 이를 수행하는 장치
WO2020159140A1 (ko) 전자 장치 및 이의 제어 방법
WO2021251539A1 (ko) 인공신경망을 이용한 대화형 메시지 구현 방법 및 그 장치
WO2021029643A1 (en) System and method for modifying speech recognition result
WO2018056779A1 (en) Method of translating speech signal and electronic device employing the same
CN114266256A (zh) 一种领域新词的提取方法及系统
WO2022164192A1 (ko) 사용자의 발화 입력에 관련된 추천 문장을 제공하는 디바이스 및 방법
CN107424612A (zh) 处理方法、装置和机器可读介质
WO2023101377A1 (en) Method and apparatus for performing speaker diarization based on language identification
WO2014033855A1 (ja) 音声検索装置、計算機読み取り可能な記憶媒体、及び音声検索方法
WO2023195769A1 (ko) 신경망 모델을 활용한 유사 특허 문헌 추출 방법 및 이를 제공하는 장치

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19945600

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19945600

Country of ref document: EP

Kind code of ref document: A1