CN110795942A - Keyword determination method and device based on semantic recognition and storage medium - Google Patents
Keyword determination method and device based on semantic recognition and storage medium Download PDFInfo
- Publication number
- CN110795942A CN110795942A CN201910884362.4A CN201910884362A CN110795942A CN 110795942 A CN110795942 A CN 110795942A CN 201910884362 A CN201910884362 A CN 201910884362A CN 110795942 A CN110795942 A CN 110795942A
- Authority
- CN
- China
- Prior art keywords
- word
- search
- determining
- preset
- candidate index
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a keyword determining method, a keyword determining device and a storage medium based on semantic recognition, wherein the method comprises the following steps: acquiring a retrieval sentence input by a user, segmenting the retrieval sentence, and extracting a feature vector of each word after segmentation; inputting the feature vectors into the trained multi-class perceptrons to obtain corresponding word annotation results, and obtaining corresponding search terms according to the word annotation results; inputting the search terms into a preset index library for query to obtain corresponding candidate index items; determining the reverse file frequency of the search term in a preset index library according to the candidate index items; inputting the reverse file frequency, the search terms and the candidate index items into a preset similarity algorithm, determining similarity numerical values of the candidate index items and the corresponding search terms, and determining the keywords according to the similarity numerical values. According to the method, the keyword is determined to accord with the integral semantics of the retrieval statement, and the keyword is further accurately defined.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a keyword determination method and device based on semantic recognition and a storage medium.
Background
With the expansion of network information and the growth of network users, people put higher demands on the timeliness and accuracy of network information acquisition, and some search software and search engines are produced accordingly. At present, the mainstream keyword determination method is to extract keywords in a sentence input by a user, and extract data with the highest matching degree from a database by using a keyword matching technology to serve as a search result to be fed back to the user.
However, the above search method has certain defects in defining the keywords, and if the keywords are words with similar fonts or ambiguous words, the keywords cannot be defined accurately, which results in deviation of search results.
Disclosure of Invention
The invention mainly aims to provide a keyword determination method, a keyword determination device and a storage medium based on semantic recognition, and aims to solve the technical problem that the accuracy is too low due to the fact that the existing keyword determination method cannot accurately define keywords.
In order to achieve the above object, the present invention provides a keyword determination method based on semantic recognition, comprising the following steps:
acquiring a retrieval sentence input by a user, segmenting the retrieval sentence, and extracting a feature vector of each word after segmentation;
inputting the feature vectors into trained multi-class perceptrons to obtain corresponding word marking results, and obtaining corresponding search terms according to the word marking results;
inputting the search terms into a preset index library for query to obtain corresponding candidate index items;
determining the reverse file frequency of the search term in a preset index library according to the candidate index item;
inputting the reverse file frequency, the search terms and the candidate index items into a preset similarity algorithm, determining similarity numerical values of the candidate index items and the corresponding search terms, and determining keywords according to the similarity numerical values.
Optionally, the multi-class perceptron includes a plurality of training sentences, and after the step of extracting feature vectors of the words after word segmentation, the method further includes:
inputting the training sentences into a preset feature module to extract training feature vectors of the training sentences;
and taking the training feature vector of the training sentence as a training sample of the multi-class perceptron to obtain the multi-class perceptron after training.
Optionally, the step of inputting the feature vector into a plurality of classes of sensors that have been trained to obtain a corresponding word annotation result includes:
inputting the feature vectors into trained multi-class perceptrons to obtain a labeling position corresponding to each feature vector;
and labeling each feature vector by using preset word-forming position information at a labeling position corresponding to each feature vector to obtain a corresponding word labeling result.
Optionally, the step of obtaining a corresponding search term according to the word annotation result includes:
segmenting the search sentences according to the word formation position information to obtain a corresponding search word set;
and inputting the search word set into a preset part-of-speech tagging algorithm, determining the part of speech of each word in the search word set, and determining the word with the part of speech being the preset search part of speech as the search word.
Optionally, the index library stores a plurality of index items and corresponding core words, and the step of inputting the search word into a preset index library for querying to obtain corresponding candidate index items includes:
inputting the search word into a preset index library, and determining a core word corresponding to the search word in the index library;
and taking the index item corresponding to the core word in the index library as the candidate index item.
Optionally, the step of determining, according to the candidate indicator item, a reverse file frequency of the search term in a preset indicator library includes:
determining the number of the candidate index items and the number of all index items in a preset index library;
and dividing the number of the candidate index items by the number of all the index items, and taking the logarithm of the obtained quotient to obtain the reverse file frequency corresponding to the search term.
Optionally, the step of inputting the reverse document frequency, the search term, and the candidate index item into a preset similarity algorithm to obtain a corresponding similarity value includes:
determining the number of search terms contained in the candidate index item, and taking the number as the number of the search terms;
and calculating to obtain a similarity numerical value of the candidate index item according to the number of the search terms and the reverse file frequency.
Optionally, the step of determining the keyword according to the similarity value includes:
and determining the similarity value of each candidate index item, and determining the candidate index item with the highest similarity value as the keyword.
Further, to achieve the above object, the present invention also provides an apparatus comprising: the system comprises a memory, a processor and a keyword determination program based on semantic recognition, which is stored on the memory and can run on the processor, wherein the keyword determination program based on semantic recognition realizes the steps of the keyword determination method based on semantic recognition when being executed by the processor.
Further, to achieve the above object, the present invention also provides a computer readable storage medium having stored thereon a keyword determination program based on semantic recognition, which when executed by a processor implements the steps of the keyword determination method based on semantic recognition as described above.
The invention discloses a keyword determination method, a keyword determination device and a storage medium based on semantic recognition, wherein the method comprises the steps of firstly, acquiring a retrieval statement input by a user, segmenting the retrieval statement, and extracting the feature vector of each word after segmentation; inputting the feature vectors into the trained multi-class perceptrons to obtain corresponding word annotation results, and obtaining corresponding search terms according to the word annotation results; inputting the search terms into a preset index library for query to obtain corresponding candidate index items; determining the reverse file frequency of the search term in a preset index library according to the candidate index items; inputting the reverse file frequency, the search terms and the candidate index items into a preset similarity algorithm, determining similarity numerical values of the candidate index items and the corresponding search terms, and determining the keywords according to the similarity numerical values. The word marking method based on the multiple perceptrons is used for accurately segmenting the retrieved sentences, candidate index items corresponding to the segmentation are determined through a preset index library, and finally the similarity of each candidate index item is determined through the reverse file frequency combination obtained through calculation and a preset similarity algorithm, and the keywords are determined according to the similarity, so that the determination of the keywords accords with the integral semantics of the retrieved sentences, the keywords are accurately defined, and the accuracy of the search results is improved.
Drawings
FIG. 1 is a schematic diagram of an apparatus in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a keyword determination method based on semantic recognition according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating another embodiment of a keyword determination method based on semantic recognition according to the present invention;
FIG. 4 is a flowchart illustrating a detailed process of the step of inputting the search term into a preset index library for querying to obtain a corresponding candidate index item according to the present invention;
FIG. 5 is a flowchart illustrating a step of refining the process of determining the reverse document frequency of the search term in the predetermined index library according to the candidate index.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic terminal structure diagram of a hardware operating environment according to an embodiment of the present invention.
The terminal of the invention is a device which can be a mobile phone, a computer, a mobile computer and other terminal equipment with a storage function.
As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Optionally, the terminal may further include a camera, a Wi-Fi module, and the like, which are not described herein again.
Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 mainly includes an input unit such as a keyboard including a wireless keyboard and a wired keyboard, and is used to connect to the client and perform data communication with the client; and the processor 1001 may be configured to call the keyword determination program based on semantic recognition stored in the memory 1005, and perform the following operations:
acquiring a retrieval sentence input by a user, segmenting the retrieval sentence, and extracting a feature vector of each word after segmentation;
inputting the feature vectors into trained multi-class perceptrons to obtain corresponding word marking results, and obtaining corresponding search terms according to the word marking results;
inputting the search terms into a preset index library for query to obtain corresponding candidate index items;
determining the reverse file frequency of the search term in a preset index library according to the candidate index item;
inputting the reverse file frequency, the search terms and the candidate index items into a preset similarity algorithm, determining similarity numerical values of the candidate index items and the corresponding search terms, and determining keywords according to the similarity numerical values.
Further, the processor 1001 may call the keyword determination program based on semantic recognition stored in the memory 1005, and further perform the following operations:
inputting the training sentences into a preset feature module to extract training feature vectors of the training sentences;
and taking the training feature vector of the training sentence as a training sample of the multi-class perceptron to obtain the multi-class perceptron after training.
Further, the processor 1001 may call the keyword determination program based on semantic recognition stored in the memory 1005, and further perform the following operations:
inputting the feature vectors into trained multi-class perceptrons to obtain a labeling position corresponding to each feature vector;
and labeling each feature vector by using preset word-forming position information at a labeling position corresponding to each feature vector to obtain a corresponding word labeling result.
Further, the processor 1001 may call the keyword determination program based on semantic recognition stored in the memory 1005, and further perform the following operations:
segmenting the search sentences according to the word formation position information to obtain a corresponding search word set;
and inputting the search word set into a preset part-of-speech tagging algorithm, determining the part of speech of each word in the search word set, and determining the word with the part of speech being the preset search part of speech as the search word.
Further, the processor 1001 may call the keyword determination program based on semantic recognition stored in the memory 1005, and further perform the following operations:
inputting the search word into a preset index library, and determining a core word corresponding to the search word in the index library;
and taking the index item corresponding to the core word in the index library as the candidate index item.
Further, the processor 1001 may call the keyword determination program based on semantic recognition stored in the memory 1005, and further perform the following operations:
determining the number of the candidate index items and the number of all index items in a preset index library;
and dividing the number of the candidate index items by the number of all the index items, and taking the logarithm of the obtained quotient to obtain the reverse file frequency corresponding to the search term.
Further, the processor 1001 may call the keyword determination program based on semantic recognition stored in the memory 1005, and further perform the following operations:
determining the number of search terms contained in the candidate index item, and taking the number as the number of the search terms;
and calculating to obtain a similarity numerical value of the candidate index item according to the number of the search terms and the reverse file frequency.
Further, the processor 1001 may call the keyword determination program based on semantic recognition stored in the memory 1005, and further perform the following operations:
and determining the similarity value of each candidate index item, and determining the candidate index item with the highest similarity value as the keyword.
The specific embodiment of the apparatus is basically the same as the following embodiments of the keyword determination method based on semantic recognition, and is not described herein again.
Referring to fig. 2, fig. 2 is a schematic flowchart of an embodiment of a method for determining keywords based on semantic recognition according to the present invention, where the method for determining keywords based on semantic recognition provided in this embodiment includes the following steps:
step S10, acquiring a retrieval sentence input by a user, segmenting the retrieval sentence, and extracting the feature vector of each word after segmentation;
in this embodiment, the search term input by the user is obtained first, it is easy to understand that the term input by the user on the search interface may be used as the search term, the corresponding search term may be obtained by performing voice recognition on the sound input by the user, or the search term input by the user may be obtained in other manners, which is not particularly limited in this embodiment.
Optionally, after the retrieval statement input by the user is obtained, the NLP algorithm may be used to perform word segmentation on the retrieval statement, or the feature template extraction algorithm may be used to perform word segmentation on the retrieval statement, and construct feature vectors corresponding to the words after word segmentation.
Step S20, inputting the feature vectors into trained multi-class perceptrons to obtain corresponding word labeling results, and obtaining corresponding search terms according to the word labeling results;
in this embodiment, a plurality of sensors of different types are also preset, and after the feature vector corresponding to the search statement is obtained, the feature vector is input into the sensors of multiple types, and since each sensor only regards one type of target as a positive example and regards the other targets as negative examples, sample data of the sensors of multiple types can be trained first. And inputting the characteristic vectors into the trained multi-class perceptrons to obtain corresponding word marking results, and obtaining corresponding search words according to the word marking results. It is easy to understand that the word annotation result refers to the annotation at the position of each word in the search sentence.
Step S30, inputting the search terms into a preset index library for query to obtain corresponding candidate index items;
in this embodiment, an index library is also preset, where a mapping relationship between a search term and a candidate index item is stored in the index library, and the search term is input into the preset index library to obtain the candidate index item corresponding to the search term.
Step S40, determining the reverse file frequency of the search term in a preset index library according to the candidate index item;
the reverse file frequency can reflect the importance degree of the part of speech of the obtained candidate index items in the whole retrieval process, so that after the candidate index items are obtained, the reverse file frequency corresponding to the candidate index items is obtained according to the number of all the index items in a preset index library to determine the importance of the retrieval words.
Step S50, inputting the reverse file frequency, the search term and the candidate index item into a preset similarity algorithm, determining the similarity value of the candidate index item and the corresponding search term, and determining the keyword according to the similarity value.
In this embodiment, a similarity algorithm is further preset, a similarity value of each candidate index item is calculated according to the reverse file frequency, the search term, and the candidate index item, and optionally, the candidate index item with the highest similarity value is determined as the keyword.
The invention discloses a keyword determination method, a keyword determination device and a storage medium based on semantic recognition, wherein the method comprises the steps of firstly, acquiring a retrieval statement input by a user, segmenting the retrieval statement, and extracting the feature vector of each word after segmentation; inputting the feature vectors into the trained multi-class perceptrons to obtain corresponding word annotation results, and obtaining corresponding search terms according to the word annotation results; inputting the search terms into a preset index library for query to obtain corresponding candidate index items; determining the reverse file frequency of the search term in a preset index library according to the candidate index items; inputting the reverse file frequency, the search terms and the candidate index items into a preset similarity algorithm, determining similarity numerical values of the candidate index items and the corresponding search terms, and determining the keywords according to the similarity numerical values. The word marking method based on the multiple perceptrons is used for accurately segmenting the retrieved sentences, candidate index items corresponding to the segmentation are determined through a preset index library, and finally the similarity of each candidate index item is determined through the reverse file frequency combination obtained through calculation and a preset similarity algorithm, and the keywords are determined according to the similarity, so that the determination of the keywords accords with the integral semantics of the retrieved sentences, the keywords are accurately defined, and the accuracy of the search results is improved.
Further, the multi-class perceptron includes a plurality of training sentences, and after the step S10 extracts the feature vectors of the words after the word segmentation, the method further includes:
step S60, inputting the training sentence into a preset feature module to extract a training feature vector of the training sentence;
based on the above embodiment, after the feature vectors of each word in the search sentence are obtained, in order to determine the word annotation result of each search word, the multi-class perceptron needs to be trained. It is easy to understand that the sensor includes corresponding training samples, generally, the training samples all appear in the form of training sentences, and the training sentences of the sensor are input into the preset feature template to extract corresponding training feature vectors. It should be understood that if the feature vectors of the words are obtained from the feature templates, the types of the feature templates for training the perceptron should be the same as the types of the feature templates for obtaining the word feature vectors.
And step S70, taking the training feature vector of the training sentence as a training sample of the multi-class perceptron to obtain the multi-class perceptron after training.
And after the training characteristic vector of the training sentence is obtained, replacing the training sentence with the training characteristic vector to serve as a new training sample of the sensor, obtaining the trained multi-class sensor, and obtaining a word annotation result of the retrieval sentence through the trained multi-class sensor, so that the keyword in the retrieval sentence is accurately determined.
Further, the step of inputting the feature vector into the trained multiple classes of perceptrons to obtain the corresponding word annotation result includes:
step S21, inputting the feature vectors into trained multi-class perceptrons to obtain the corresponding labeling position of each feature vector;
in this embodiment, the labeling position of the feature vector is obtained first, and the labeling is performed on the labeling position of the feature vector to obtain the word labeling result of the feature vector.
Generally, the number of labeled positions of each character in the feature vector corresponds to the word formation position information, for example, 4 word formation position information is preset, that is, the head position information, the middle position information, the tail position information, and the word position information, and then 4 labeled positions correspond to each character in the feature vector.
And step S22, labeling each feature vector by using preset word-building position information at the labeling position corresponding to each feature vector to obtain the corresponding word labeling result.
As described above, it is assumed that the formation position information includes the beginning position information, the middle position information, the end position information, and the word position information, and it should be understood that the formation position information in this embodiment may also include other formation position information capable of labeling the feature vector, and this embodiment is not limited herein. After the labeling position of the feature vector is obtained, labeling the feature vector at the labeling position by using the prefix position information, the in-word position information, the end-of-word position information, and the word position information to obtain a word labeling result of the search statement, and further, for more elaboration of the embodiment, the following specific examples are given:
setting the initial position information as A, the in-word position information as M, the end position information as E, the word position information as I, and the retrieval statement as follows: what the fixed asset investment completion amount is in this quarter. The word annotation result obtained by the multi-class perceptron is: this/I season/A degree/E fixed/A fixed/M capital/M productive/E invested/A capital/E complete/A productive/M rated/E is/I excessive/A deficient/E.
In the embodiment, word annotation results corresponding to the retrieval sentences are obtained through the above manner, and the part of speech of the words after word segmentation is primarily divided through the multi-class perception classifier, so that context semantics of the words in the sentences are further reflected compared with the traditional word segmentation technology, and therefore, the division of the words is more accurate.
Further, the step of obtaining the corresponding search term according to the word annotation result includes:
step 23, segmenting the search sentences according to the word formation position information to obtain corresponding search word sets;
and segmenting the search sentence according to the word formation position information and the word annotation result to obtain a plurality of different words after the segmentation of the search sentence, and taking the words obtained after the segmentation of the plurality of words as a search word set.
To further explain the embodiment in detail, the term-forming location information is used as the initial location information a, the in-term location information M, the final location information E, and the word location information I, and the search term is: the fixed asset investment completion amount in this quarter is taken as an example. After passing through the multiple perceptrons, the word annotation result corresponding to the retrieval statement is obtained as follows: this/I season/A degree/E fixed/A fixed/M capital/M productive/E invested/A capital/E complete/A productive/M rated/E is/I excessive/A deficient/E. Then the word labeled I can be used as a search term and the word labeled AE, AME, or ME can be used as a search term. Then, the set of search words corresponding to the search term is: this, quarterly, fixed assets, investment completion amount, yes. As another embodiment, to reduce the amount of computation, the word labeled { I } may not be included in the set of search terms.
Step S24, inputting the search word set into a preset part of speech tagging algorithm, determining the part of speech of each word in the search word set, and determining the word with the part of speech being the preset search part of speech as the search word.
The retrieval sentence is generally a complete sentence and includes many words with different parts of speech, wherein some key words often represent the main meaning of the sentence, such as nouns, adjectives, and these parts of speech words are likely to be retrieval words. Therefore, in the present proposal, it is necessary to perform part-of-speech analysis on words in the search term set to obtain a search term, which is a key word of the search term.
Specifically, a part-of-speech tagging algorithm is also preset in the embodiment, and when the NLP algorithm is used for segmenting the words of the retrieval sentence, part-of-speech tagging in the NLP algorithm can be used for determining the part-of-speech of each word; of course, the determination of the parts of speech of each Word in the search Word set may also be implemented using a CLAWS (content-likelid Automatic Word-tagging System component Likelihood Automatic part of speech tagging System) algorithm or a VOLSUNGA algorithm, which are both part of speech tagging algorithms based on statistics and tag the parts of speech according to co-occurrence probability. Or determining the part of speech of the word by using some rule-based algorithms, namely disambiguating the word with a plurality of parts of speech by using a preset rule and finally reserving a correct part of speech. It should be readily understood that the present embodiment is not limited to a specific part-of-speech tagging algorithm.
In this embodiment, by the above manner, accurate word segmentation is performed according to the word annotation result, and the part of speech of the word is analyzed, so as to determine the keyword, thereby removing words of part of speech such as the mood assist word in the search sentence, and avoiding the influence of the word on the determination result of the final keyword.
Further, the step of storing a plurality of index items and corresponding core words in the index library, and inputting the search words into a preset index library for querying to obtain corresponding candidate index items includes:
step S31, inputting the search word into a preset index library, and determining a core word corresponding to the search word in the index library;
in this embodiment, an index library is preset, index items and corresponding core words are stored in the index library, it should be understood that the index items and the core words are not in a one-to-one correspondence relationship, a plurality of index items may correspond to the same core words, the core words may be words directly extracted from each index item, or words corresponding to each index item formulated by a user, for example, a core word corresponding to an index item "fixed asset investment completion amount" is "investment completion amount".
Step S32, using the index item corresponding to the core word in the index library as the candidate index item.
After the core word corresponding to the search word is determined, the index item corresponding to the core word in the preset index library is used as the candidate index item, and it is easy to understand that the number of the candidate index items can be multiple because the core word in the index library may correspond to multiple index items.
In the embodiment, the candidate index items corresponding to the search terms are determined in the above manner, and the keyword of the search sentence is prevented from being determined by directly using a plurality of search terms, so that the calculation amount in the keyword determination process is reduced.
Further, the step of determining the reverse file frequency of the search term in a preset index library according to the candidate index item includes:
step S41, determining the number of the candidate index items and the number of all index items in a preset index library;
and after the candidate index item is obtained, determining keywords in the search sentence, wherein the similarity degree of the candidate index item and the search sentence is jointly determined by the number of search words contained in the candidate index item and the importance of the contained search words, and the number of the search words contained in the candidate index item is related to the reverse file frequency. In order to obtain the reverse file frequency corresponding to the candidate index items and the retrieval statement, the number of the candidate index items and the number of all index items in a preset index library are determined.
And step S42, dividing the number of the candidate index items by the number of all index items, and taking the logarithm of the obtained quotient to obtain the reverse file frequency corresponding to the search term.
The reverse file frequency can reflect the discrimination of the candidate index items, and when the discrimination of the candidate index items is higher, the importance of the candidate index items is higher, and the candidate index items are more likely to be determined as the keywords. In a plurality of index items of the preset index library, if the number of the index items corresponding to the search term is smaller, the index item is more important. Therefore, the reverse file frequency can be obtained by dividing the total index item number contained in the index item set by the index item number containing the search term in the index item set and then taking the logarithm of the obtained quotient.
In this embodiment, the reverse file frequency corresponding to the candidate index item is determined in the above manner, so as to determine the importance corresponding to the search term, and further determine the similarity of each candidate index item.
Further, the step of inputting the reverse document frequency, the search term and the candidate index item into a preset similarity algorithm to obtain a corresponding similarity value includes:
step S51, determining the number of the search terms contained in the candidate index item, and taking the number as the number of the search terms;
in this embodiment, the number of matching between each candidate index item and the search term is counted, and when the number of matching between the candidate index item and the search term is larger, the corresponding similarity of the candidate index item is higher. In order to achieve the above purpose, the number of search terms contained in the candidate index item is determined, and the number is used as the number of the search terms.
For example, for the candidate index item "the total-society fixed asset investment completion amount", "the fixed asset investment completion amount", and the search term "the total society", "the fixed asset", and "the investment completion amount", the candidate index item "the total-society fixed asset investment completion amount" includes the search terms "the total society", "the fixed asset", and "the investment completion amount"; the candidate index item "fixed asset investment completion amount" only contains the search words "fixed asset" and "investment completion amount", so the number of the search words contained in the candidate index item "fixed asset investment completion amount of the whole society" is more than the number of the search words contained in the candidate index item "fixed asset investment completion amount".
And step S52, calculating to obtain the similarity value of the candidate index item according to the number of the search terms and the reverse file frequency.
According to the number of the search terms of each candidate index item and the reverse file frequency, the similarity numerical value of the candidate index item is obtained, optionally, the similarity of each candidate index item can be calculated by using a TF-IDF algorithm, and the TF-IDF algorithm has a specific working mode that words with high information content are comprehensively judged based on the context semantics, the specific gravity coefficient of the words with high information content is improved, the specific gravity coefficient of a repetition factor is reduced, and the content of the information entropy of the words is further enhanced.
Compared with the traditional keyword matching method, the similarity of the candidate index items is determined through the two indexes of the number of the search words and the reverse file frequency, and the result of determining the keywords is more accurate.
Further, the step of determining the keyword according to the similarity value includes:
step S53, determining the similarity value of each candidate index item, and determining the candidate index item with the highest similarity value as the keyword
And after the similarity values of the candidate index items are obtained, the candidate index item with the highest similarity value is used as the keyword, so that the confirmation of the keyword in the retrieval statement is completed. Specifically, when there are 2 or more candidate index items having the same similarity degree value, the candidate index items may be simultaneously used as keywords of the search sentence.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where a keyword determination program based on semantic recognition is stored, and when executed by a processor, the keyword determination program based on semantic recognition implements the operations of the keyword determination method based on semantic recognition as described above.
The specific embodiment of the computer-readable storage medium of the present invention is substantially the same as the embodiments of the keyword determination method based on semantic recognition, and is not repeated herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (10)
1. A keyword determination method based on semantic recognition is characterized by comprising the following steps:
acquiring a retrieval sentence input by a user, segmenting the retrieval sentence, and extracting a feature vector of each word after segmentation;
inputting the feature vectors into trained multi-class perceptrons to obtain corresponding word marking results, and obtaining corresponding search terms according to the word marking results;
inputting the search terms into a preset index library for query to obtain corresponding candidate index items;
determining the reverse file frequency of the search term in a preset index library according to the candidate index item;
inputting the reverse file frequency, the search terms and the candidate index items into a preset similarity algorithm, determining similarity numerical values of the candidate index items and the corresponding search terms, and determining keywords according to the similarity numerical values.
2. The method for determining keywords based on semantic recognition according to claim 1, wherein the multi-class perceptron includes a plurality of training sentences, and after the step of extracting feature vectors of the words after word segmentation, the method further comprises:
inputting the training sentences into a preset feature module to extract training feature vectors of the training sentences;
and taking the training feature vector of the training sentence as a training sample of the multi-class perceptron to obtain the multi-class perceptron after training.
3. The method for determining keywords based on semantic recognition according to claim 1, wherein the step of inputting the feature vectors into trained multi-class perceptrons to obtain corresponding word annotation results comprises:
inputting the feature vectors into trained multi-class perceptrons to obtain a labeling position corresponding to each feature vector;
and labeling each feature vector by using preset word-forming position information at a labeling position corresponding to each feature vector to obtain a corresponding word labeling result.
4. The method for determining keywords based on semantic recognition according to claim 3, wherein the step of obtaining corresponding search terms according to the word annotation result comprises:
segmenting the search sentences according to the word formation position information to obtain a corresponding search word set;
and inputting the search word set into a preset part-of-speech tagging algorithm, determining the part of speech of each word in the search word set, and determining the word with the part of speech being the preset search part of speech as the search word.
5. The keyword determination method based on semantic recognition according to claim 1, wherein the index library stores a plurality of index items and corresponding core words, and the step of inputting the search word into a preset index library for query to obtain corresponding candidate index items comprises:
inputting the search word into a preset index library, and determining a core word corresponding to the search word in the index library;
and taking the index item corresponding to the core word in the index library as the candidate index item.
6. The method for determining keywords based on semantic recognition according to claim 1, wherein the step of determining the reverse document frequency of the search term in a preset index library according to the candidate index items comprises:
determining the number of the candidate index items and the number of all index items in a preset index library;
and dividing the number of the candidate index items by the number of all the index items, and taking the logarithm of the obtained quotient to obtain the reverse file frequency corresponding to the search term.
7. The method for determining keywords based on semantic recognition according to claim 1, wherein the step of inputting the reverse document frequency, the search term and the candidate index item into a preset similarity algorithm to obtain a corresponding similarity value comprises:
determining the number of search terms contained in the candidate index item, and taking the number as the number of the search terms;
and calculating to obtain a similarity numerical value of the candidate index item according to the number of the search terms and the reverse file frequency.
8. The method for keyword determination based on semantic recognition according to claim 7, wherein the step of determining keywords according to the similarity value comprises:
and determining the similarity value of each candidate index item, and determining the candidate index item with the highest similarity value as the keyword.
9. An apparatus, characterized in that the apparatus comprises: memory, a processor and a keyword determination program based on semantic recognition stored on the memory and executable on the processor, the keyword determination program based on semantic recognition being configured to implement the steps of the keyword determination method based on semantic recognition according to any of claims 1 to 8.
10. A storage medium, characterized in that the storage medium has stored thereon a keyword determination program based on semantic recognition, which when executed by a processor implements the steps of the keyword determination method based on semantic recognition according to any one of claims 1 to 8.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910884362.4A CN110795942B (en) | 2019-09-18 | 2019-09-18 | Keyword determination method and device based on semantic recognition and storage medium |
PCT/CN2019/117577 WO2021051557A1 (en) | 2019-09-18 | 2019-11-12 | Semantic recognition-based keyword determination method and apparatus, and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910884362.4A CN110795942B (en) | 2019-09-18 | 2019-09-18 | Keyword determination method and device based on semantic recognition and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110795942A true CN110795942A (en) | 2020-02-14 |
CN110795942B CN110795942B (en) | 2022-10-14 |
Family
ID=69427313
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910884362.4A Active CN110795942B (en) | 2019-09-18 | 2019-09-18 | Keyword determination method and device based on semantic recognition and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110795942B (en) |
WO (1) | WO2021051557A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753069A (en) * | 2020-06-09 | 2020-10-09 | 北京小米松果电子有限公司 | Semantic retrieval method, device, equipment and storage medium |
CN114385890A (en) * | 2022-03-22 | 2022-04-22 | 深圳市世纪联想广告有限公司 | Internet public opinion monitoring system |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113239697B (en) * | 2021-06-01 | 2023-03-24 | 平安科技(深圳)有限公司 | Entity recognition model training method and device, computer equipment and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040002849A1 (en) * | 2002-06-28 | 2004-01-01 | Ming Zhou | System and method for automatic retrieval of example sentences based upon weighted editing distance |
CN101510221A (en) * | 2009-02-17 | 2009-08-19 | 北京大学 | Enquiry statement analytical method and system for information retrieval |
CN105893410A (en) * | 2015-11-18 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Keyword extraction method and apparatus |
CN105989040A (en) * | 2015-02-03 | 2016-10-05 | 阿里巴巴集团控股有限公司 | Intelligent question-answer method, device and system |
CN107122413A (en) * | 2017-03-31 | 2017-09-01 | 北京奇艺世纪科技有限公司 | A kind of keyword extracting method and device based on graph model |
CN107608960A (en) * | 2017-09-08 | 2018-01-19 | 北京奇艺世纪科技有限公司 | A kind of method and apparatus for naming entity link |
CN108345672A (en) * | 2018-02-09 | 2018-07-31 | 平安科技(深圳)有限公司 | Intelligent response method, electronic device and storage medium |
CN108664473A (en) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message |
CN109992978A (en) * | 2019-03-05 | 2019-07-09 | 腾讯科技(深圳)有限公司 | Transmission method, device and the storage medium of information |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9584343B2 (en) * | 2008-01-03 | 2017-02-28 | Yahoo! Inc. | Presentation of organized personal and public data using communication mediums |
CN104731797B (en) * | 2013-12-19 | 2018-09-18 | 北京新媒传信科技有限公司 | A kind of method and device of extraction keyword |
-
2019
- 2019-09-18 CN CN201910884362.4A patent/CN110795942B/en active Active
- 2019-11-12 WO PCT/CN2019/117577 patent/WO2021051557A1/en active Application Filing
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040002849A1 (en) * | 2002-06-28 | 2004-01-01 | Ming Zhou | System and method for automatic retrieval of example sentences based upon weighted editing distance |
CN101510221A (en) * | 2009-02-17 | 2009-08-19 | 北京大学 | Enquiry statement analytical method and system for information retrieval |
CN105989040A (en) * | 2015-02-03 | 2016-10-05 | 阿里巴巴集团控股有限公司 | Intelligent question-answer method, device and system |
CN105893410A (en) * | 2015-11-18 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Keyword extraction method and apparatus |
CN107122413A (en) * | 2017-03-31 | 2017-09-01 | 北京奇艺世纪科技有限公司 | A kind of keyword extracting method and device based on graph model |
CN107608960A (en) * | 2017-09-08 | 2018-01-19 | 北京奇艺世纪科技有限公司 | A kind of method and apparatus for naming entity link |
CN108345672A (en) * | 2018-02-09 | 2018-07-31 | 平安科技(深圳)有限公司 | Intelligent response method, electronic device and storage medium |
CN108664473A (en) * | 2018-05-11 | 2018-10-16 | 平安科技(深圳)有限公司 | Recognition methods, electronic device and the readable storage medium storing program for executing of text key message |
CN109992978A (en) * | 2019-03-05 | 2019-07-09 | 腾讯科技(深圳)有限公司 | Transmission method, device and the storage medium of information |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753069A (en) * | 2020-06-09 | 2020-10-09 | 北京小米松果电子有限公司 | Semantic retrieval method, device, equipment and storage medium |
CN111753069B (en) * | 2020-06-09 | 2024-05-07 | 北京小米松果电子有限公司 | Semantic retrieval method, device, equipment and storage medium |
CN114385890A (en) * | 2022-03-22 | 2022-04-22 | 深圳市世纪联想广告有限公司 | Internet public opinion monitoring system |
CN114385890B (en) * | 2022-03-22 | 2022-05-20 | 深圳市世纪联想广告有限公司 | Internet public opinion monitoring system |
Also Published As
Publication number | Publication date |
---|---|
CN110795942B (en) | 2022-10-14 |
WO2021051557A1 (en) | 2021-03-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108829893B (en) | Method and device for determining video label, storage medium and terminal equipment | |
CN109933785B (en) | Method, apparatus, device and medium for entity association | |
WO2019184217A1 (en) | Hotspot event classification method and apparatus, and storage medium | |
CN110019732B (en) | Intelligent question answering method and related device | |
CN112069298A (en) | Human-computer interaction method, device and medium based on semantic web and intention recognition | |
CN111198948A (en) | Text classification correction method, device and equipment and computer readable storage medium | |
CN110795942B (en) | Keyword determination method and device based on semantic recognition and storage medium | |
CN108027814B (en) | Stop word recognition method and device | |
CN108573707B (en) | Method, device, equipment and medium for processing voice recognition result | |
CN111563384A (en) | Evaluation object identification method and device for E-commerce products and storage medium | |
CN113094478B (en) | Expression reply method, device, equipment and storage medium | |
CN110929498A (en) | Short text similarity calculation method and device and readable storage medium | |
CN111325033A (en) | Entity identification method, entity identification device, electronic equipment and computer readable storage medium | |
CN117235137B (en) | Professional information query method and device based on vector database | |
CN112487159B (en) | Search method, search device, and computer-readable storage medium | |
CN114003725A (en) | Information annotation model construction method and information annotation generation method | |
CN116644183B (en) | Text classification method, device and storage medium | |
CN114970554B (en) | Document checking method based on natural language processing | |
CN111782789A (en) | Intelligent question and answer method and system | |
CN115563515A (en) | Text similarity detection method, device and equipment and storage medium | |
CN114528851B (en) | Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium | |
CN114647739B (en) | Entity chain finger method, device, electronic equipment and storage medium | |
CN113220824B (en) | Data retrieval method, device, equipment and storage medium | |
CN114118049A (en) | Information acquisition method and device, electronic equipment and storage medium | |
CN117931858B (en) | Data query method, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |