CN112347339A - Search result processing method and device - Google Patents
Search result processing method and device Download PDFInfo
- Publication number
- CN112347339A CN112347339A CN202011345102.9A CN202011345102A CN112347339A CN 112347339 A CN112347339 A CN 112347339A CN 202011345102 A CN202011345102 A CN 202011345102A CN 112347339 A CN112347339 A CN 112347339A
- Authority
- CN
- China
- Prior art keywords
- search
- word
- search result
- context
- keyword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9538—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application relates to the technical field of computers, in particular to a method and a device for processing search results, which are used for obtaining search words input based on a text and determining context expression vectors of the search words according to context information corresponding to the search words in the text when the search words are determined to be ambiguous words; obtaining each search result corresponding to the search word, and respectively determining a context expression vector of each search result; respectively determining the similarity of the search word and each search result according to the context expression vector of the search word and the context expression vector of each search result; and filtering the search results according to the determined similarity to obtain the filtered search results, so that the search results are filtered according to the context information, and the accuracy of the search results is improved.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing search results.
Background
With the development of internet search engines, search requirements are ubiquitous, for example, in common reading scenes, when a user sees a certain interested word and needs to search, the user can press the word on a screen for a long time, then a system can determine the corresponding search word according to the position pressed by the finger for a long time, and prompt the user to initiate a search-search operation, and a search result of the search word can be obtained after clicking, but in related technologies, the selected word is directly used as the search word, the returned search result is also the same result as the search word actively input by the user in a search box, and no processing is performed, and the search result cannot sense specific meaning information of the search word in the current reading scene, so that the search result is inaccurate.
Disclosure of Invention
The embodiment of the application provides a search result processing method and device, so as to improve the accuracy of a search result.
The embodiment of the application provides the following specific technical scheme:
one embodiment of the present application provides a search result processing method, including:
acquiring a search word input based on a text, and determining a context expression vector of the search word according to corresponding context information of the search word in the text when the search word is determined to be an ambiguous word;
obtaining each search result corresponding to the search word, and respectively determining a context expression vector of each search result;
respectively determining the similarity of the search word and each search result according to the context expression vector of the search word and the context expression vector of each search result;
and filtering the search results according to the determined similarity to obtain the filtered search results.
Another embodiment of the present application provides a search result processing apparatus, including:
the acquisition module is used for acquiring search terms input based on the text;
a first determining module, configured to determine a context representation vector of the search word according to context information corresponding to the search word in the text when the search word is determined to be an ambiguous word;
the second determining module is used for obtaining each search result corresponding to the search word and respectively determining the context expression vector of each search result;
a third determining module, configured to determine similarity between the search term and each search result according to the context representation vector of the search term and the context representation vector of each search result;
and the filtering module is used for filtering the search results according to the determined similarity to obtain the filtered search results.
In another embodiment of the present application, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of any one of the above-mentioned search result processing methods when executing the program.
In another embodiment of the present application, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, implements the steps of any one of the above-mentioned search result processing methods.
Another embodiment of the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute any one of the search result processing methods provided in the various alternative implementations described above.
In the embodiment of the application, when a search word input based on a text is obtained and determined to be an ambiguous word, determining a context expression vector of the search word according to context information corresponding to the search word in the text, obtaining search results corresponding to the search word, determining the context expression vector of each search result respectively, further determining similarity between the search word and each search result according to the context expression vector of the search word and the context expression vector of each search result respectively, filtering each search result according to the determined similarity, and obtaining a filtered search result, so that for each obtained search result, each search result can be filtered and filtered according to the context information of the search word and the context information of each search result, so that the filtered search result is more consistent with the context information of the search word, the accuracy of the search result is improved.
Drawings
FIG. 1 is a schematic diagram of an application architecture of a search result processing method according to an embodiment of the present application;
FIG. 2 is a flowchart of a search result processing method according to an embodiment of the present application;
FIG. 3 is a diagram illustrating a word with ambiguous word identifiers crawled from a search engine in an embodiment of the present application;
FIG. 4 is a schematic diagram of a word vector model in an embodiment of the present application;
FIG. 5 is a schematic interface diagram of initiating a search process in an embodiment of the present application;
FIG. 6 is a diagram illustrating a search result returned in the related art;
FIG. 7 is a diagram illustrating a search result returned in an embodiment of the present application;
FIG. 8 is a schematic structural diagram of a search result processing apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
For the purpose of facilitating an understanding of the embodiments of the present application, a brief introduction of several concepts is provided below:
fingertip search terms: in the embodiment of the present application, a fingertip search word represents a search word evoked by a finger, and is not a search word actively input by a user in a search box, and is generally a scene in which a word is selected by a finger in a section of text and is searched, for example, in a reading scene, a user may press a certain word position for a long time, a corresponding search word may be determined according to the position where the user presses the finger for a long time, and a search operation is clicked to obtain a search result of the search word.
Ambiguous words: in the embodiment of the present application, the ambiguous word represents a word having at least two different semantics, which may also be referred to as a ambiguous word, such as "BD", and there may be multiple interpretations, such as Business Development (BD) and Blu-ray (BD), where "BD" is an ambiguous word.
Key words: the keywords are generally words that can represent the subjects of the content such as articles, search results, and the like and have substantial meanings, and there are various methods for extracting the keywords at present, and the method for extracting the keywords in the embodiment of the present application is not limited.
The context represents a vector: the context expression vector in the embodiment of the application is mainly used for reflecting context information and can be determined according to the extracted keywords.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. For example, the embodiments of the present application mainly relate to a natural language processing technology, and may perform operations such as segmenting a text or a search result where a search word is located, and extracting a keyword. For another example, the context semantic analysis may be performed on the search terms and the keywords extracted from the search results, so as to determine the context expression vectors of the search terms and the search results.
Along with the research and progress of artificial intelligence technology, the artificial intelligence technology develops research and application in a plurality of fields, for example, common intelligent home, intelligent wearable equipment, virtual assistant, intelligent sound box, intelligent marketing, unmanned driving, automatic driving, unmanned aerial vehicle, robot, intelligent medical treatment, intelligent customer service and the like.
The scheme provided by the embodiment of the application mainly relates to an artificial intelligence natural language processing technology, and is specifically explained by the following embodiment:
with the development of internet search engines, search requirements are ubiquitous, except that a user can actively key a search word in a search box, the user can also initiate a search process by selecting the search word in text content, for example, in a reading scene on an intelligent terminal, when the user sees that a certain interested word needs to be searched, the user can long-press the word on a screen, a system can further determine the corresponding search word according to the long-press position of a finger and prompt the user to initiate a search-search operation, and a search result of the search word can be obtained after clicking, but in the related technology, the selected word is directly used as the search word and is directly input to a search-search system without any processing, the returned search result is also the same result as the search word actively input by the user in the search box, and the search result cannot sense specific meaning information of the search word in the current reading scene, especially, under the condition that the search terms are ambiguous, the returned search results are inaccurate and cannot meet the requirements of users.
Therefore, in order to solve the above problems, an embodiment of the present application provides a search result processing method, where when a search word is determined to be an ambiguous word, context expression vectors of the search word are determined according to context information corresponding to the search word in a text, context expression vectors of search results corresponding to the search word are respectively determined, similarities between the search word and the search results are respectively determined according to the context expression vectors of the search word and the context expression vectors of the search results, and then each search result is filtered according to the determined similarities, so that when the search word is determined to be the ambiguous word, disambiguation and filtering can be performed on each search result according to the context information where the search word is located, so that a returned search result is more similar to a context or scene subject semantic where the search word is located, and accuracy of the search result is improved.
Fig. 1 is a schematic diagram of an application architecture of a search result processing method according to an embodiment of the present application, including a terminal 100 and a server 200.
The terminal 100 may be any intelligent device such as a smart phone, a tablet computer, a portable personal computer, a desktop computer, a smart television, a smart robot, a vehicle-mounted electronic device, and the like, and various Applications (APPs) may be installed on the terminal 100, for example, when a user reads through an APP, such as reading a novel, reading a public article, and the like, the user is interested in a word, and may invoke a search function by pressing the word, and click a search, that is, a search request based on the word is sent to the server 200, further, in this embodiment of the present Application, the terminal 100 may also send context information of the word to the server 200, and after the server 200 receives the search word, determine each search result of the search word, and determine that the search word is an ambiguous word, according to the context information of the search word and the determined context information of each search result, the search results are filtered, the filtered search results can be returned to the terminal 100, and the terminal 100 displays the search results returned by the server 200, so that the returned search results are closer to the context information of the text where the search words are located, the accuracy of the search results is improved, and the returned search results are more in line with the search requirements of the user.
The server 200 can provide various network services for the terminal 100, and for different applications, the server 200 may be regarded as a corresponding background server, where the server 200 may be a server, a server cluster composed of several servers, or a cloud computing center.
The terminal 100 and the server 200 may be connected via the internet to communicate with each other. Optionally, the internet described above uses standard communication techniques and/or protocols. The internet is typically the internet, but can be any Network including, but not limited to, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), any combination of mobile, wireline or wireless networks, private or virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including Hypertext Mark-up Language (HTML), Extensible Markup Language (XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as Secure Socket Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), Internet Protocol Security (IPsec). In other embodiments, custom and/or dedicated data communication techniques may also be used in place of, or in addition to, the data communication techniques described above.
It should be noted that, in each embodiment of the present application, the search result processing method may be executed by the server 200, and certainly may also be executed by the terminal 100, and the present application is not limited in this embodiment, taking the execution by the server 200 as an example, the server 200 acquires the search terms and the context information of the search terms from the terminal 100, and further filters each search result based on the search result processing method in the embodiment of the present application, and returns only the filtered search result. For another example, taking the terminal 100 as an example, the terminal 100 sends a search request including a search term to the server 200, the server 200 returns each search result based on the search term, and the terminal 100 performs filtering based on the search result processing method in the embodiment of the present application, that is, the terminal 100 filters each search result based on the context information of the search term and the context information of each search result, and presents the filtered search result to the user.
It should be noted that the application architecture diagram in the embodiment of the present application is to more clearly illustrate the technical solution in the embodiment of the present application, and does not limit the technical solution provided in the embodiment of the present application, and may be applied to a scene such as reading on a terminal, a scene for initiating a search flow by pressing with a finger, and other scenes for initiating a search flow for a certain word in context information, without limitation, but for other application architectures and business applications, the technical solution provided in the embodiment of the present application is also applicable to similar problems.
In the embodiments of the present application, the application architecture shown in fig. 1 is taken as an example to schematically illustrate the search result processing method.
Based on the foregoing embodiment, referring to fig. 2, a flowchart of a search result processing method in the embodiment of the present application is described by taking an application to a server as an example, and specifically the method includes:
step 200: the method comprises the steps of obtaining a search word input based on a text, and determining a context expression vector of the search word according to context information corresponding to the search word in the text when the search word is determined to be an ambiguous word.
In the embodiment of the application, mainly for a scene with context information and search words input based on text, rather than inputting the search words in a search box, for example, when reading an article, the search words are determined by pressing for a long time and a search process is initiated, that is, the context information of the search words in the article is determined.
When step 200 is executed, the method specifically includes:
and S1, acquiring the search word input based on the text.
And S2, determining whether the search word is an ambiguous word.
Specifically, according to a preset ambiguous word database, if the search word is determined to be in the preset ambiguous word database, the search word is determined to be an ambiguous word, and if the search word is determined not to be in the preset ambiguous word database, the search word is determined not to be an ambiguous word.
The ambiguous word represents a word with at least two different semantics, for example, "shoot carve hero pass" is an ambiguous word which can be expressed as a novel work, a shot movie, a shot television series and the like, and the movie and television series content body is a movie television series script and is not a novel, so that the word is an ambiguous word.
The preset ambiguous word database is constructed at least according to words with ambiguous word identifications crawled from preset search engines. The method for constructing the ambiguous word database in the embodiment of the present application is not limited, and other methods may also be used to collect ambiguous words, for example, the ambiguous words are determined based on the dictionary database, and for example, the ambiguous words may also be crawled from a search engine.
For example, referring to fig. 3, in order to illustrate that a crawler may be used offline to crawl information from a search engine for a word with an ambiguous word identifier in an embodiment of the present application, a word with an ambiguous word identifier field may be identified based on a HyperText Markup Language (html) parsing tool, as shown in fig. 3, a word "BD" is used as an example to describe that BD is an ambiguous word in the search engine, and it may be determined that "BD" is an ambiguous word by identifying the field, that is, "BD" may be added to an ambiguous word database.
Of course, other ways may also be used to determine whether a search term is an ambiguous term, and several other possible implementations are provided in the embodiments of the present application:
the first embodiment: and judging whether the search word contains polyphone characters, and if the search word contains polyphone characters, determining that the search word is an ambiguous word.
The second embodiment: and judging whether the search word contains the polysemous word or not, and if the search word contains the polysemous word, determining that the search word is an ambiguous word.
Third embodiment: and training an ambiguous word recognition model in advance by adopting a machine learning algorithm, and further recognizing the search word based on the ambiguous word recognition model to determine whether the search word is an ambiguous word.
It should be noted that different implementations of determining ambiguous words may have different accuracies, and only a few reference examples are provided here, and different implementations of determining ambiguous words may be specifically adopted according to actual situations and requirements.
S3, when the search word is determined to be an ambiguous word, determining the context expression vector of the search word according to the corresponding context information of the search word in the text.
In the embodiment of the present application, when the search word is determined to be an ambiguous word, the search result may be filtered and filtered through context information of the search word, because if the search word is not an ambiguous word, that is, has a unique specific meaning, at this time, no matter what the context information of the search word is, interpretation of the search word and the search result are generally the same, even if filtering is performed through context information, the meaning is not very large, and a calculation amount may be increased.
Of course, in the embodiment of the present application, the search result may be filtered and filtered by using the context information without determining whether the word is an ambiguous word, and the embodiment of the present application is not limited thereto.
Determining a context expression vector of a search word according to context information corresponding to the search word in a text, specifically comprising:
and S3.1, extracting keywords of the corresponding context information of the search terms in the text.
Further, before extracting the keywords, the context information is segmented, and generally, the chinese segmentation method can be classified into three categories: dictionary-based word segmentation method, statistical-based word segmentation method and semantic-based word segmentation method.
Wherein, 1) the word segmentation method based on the dictionary, which may also be called character string matching, needs a chinese dictionary, matches the text to be segmented with the entries in the dictionary, if a certain entry is found, the matching is successful, and a word is recognized, for example, the word segmentation algorithm based on the dictionary may be classified into the following: the word segmentation method based on the dictionary is high in speed and simple to implement.
2) The word segmentation method based on statistics mainly includes the steps of counting the frequency of each word combination which is adjacent and co-occurs in a corpus, taking a word segmentation result with the highest probability as a final result, for example, common word segmentation methods based on statistics include Hidden Markov Models (HMMs), Conditional Random Field (CRF) models and the like.
3) Semantic-based word segmentation methods generally include three components: word segmentation (used to obtain related words), syntactic semantics (using syntactic and semantic information to judge word ambiguity), and total control.
Specifically, according to the characteristics and actual requirements of different word segmentation methods, different word segmentation methods are adopted to segment context information, in addition, when the word segmentation is performed, a word segmentation tool can be used for word segmentation, for example, a jieba (jieba) word segmentation tool is adopted to segment context information corresponding to a search word in a text, the jieba word segmentation is mainly based on a prefix dictionary to generate a Directed Acyclic Graph (DAG) formed by all possible word segments in a sentence, a dynamic programming is adopted to search a maximum probability path and find a maximum segmentation combination based on word frequency, for unknown words, an HMM (hidden Markov model) model and a Viterbi algorithm can be adopted to process, in addition, based on a staying word dictionary, staying words in a word segmentation result are removed, for example, the staying words are meaningless words such as 'some' and 'are' and the like, nouns in a word segmentation result after the staying words are removed are extracted, namely a final nouns set is obtained, and then aiming at the finally obtained noun set, extracting a keyword which can best represent the theme of the text where the search word is located.
And extracting keywords from the words, wherein the method for extracting keywords is not limited in the embodiment of the present application, and examples of the method include a Term Frequency-Inverse Document Frequency (TF-IDF) Keyword Extraction method, a Topic model (Topic-model) Keyword Extraction method, and a Rapid Automatic Keyword Extraction (RAKE) method.
Taking the example of the TF-IDF keyword extraction method as an example, TF-IDF is a statistical method for evaluating the importance of a word to one of documents in a document set or a corpus. The importance of a word increases in direct proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in a corpus, the main idea of TF-IDF is: if a word appears in a document with a high frequency TF and rarely appears in other documents in the corpus, the word is considered to have a good class distinguishing capability and is suitable for classification.
Generally, the importance degree of the TF-IDF evaluation word can be obtained by the following method: TF-IDF ═ TF ═ IDF, where TF ═ number of occurrences of a word in a document)/(total number of words in a document), IDF ═ log (total number of documents in a corpus/(number of documents containing the word)).
That is, TF reflects the frequency of occurrence of a word in a document, and IDF reflects the popularity of a keyword, i.e. when a word is more popular (i.e. a large number of documents contain the word), the IDF value is lower, otherwise, the IDF value is higher, and further TF-IDF obtained through TF and IDF can reflect the importance of the word in the document, and the larger TF-IDF value is, which indicates that the higher importance of the word to the document is, the higher probability of becoming a keyword is.
Further, in order to improve the calculation efficiency, in the embodiment of the present application, only the first N keywords may be selected for subsequent context expression vector calculation, that is, before determining the word vectors of the extracted keywords, the first preset number of keywords sorted based on the importance degree are screened out, where the preset number may be set according to actual conditions and requirements, and the embodiment of the present application is not limited.
And S3.2, respectively determining the word vectors of the extracted keywords.
The important significance of the word vector is that the natural language is converted into the vector which can be understood by a computer, the word vector can grasp the context, the semantics and the like of the word, the similarity between the word and the word is measured, and the word vector plays an important role in many natural language processing fields such as text classification, emotion analysis and the like. In the embodiment of the present application, the word vector of each keyword is determined based on a trained word vector model, for example, a word2vec model, a Bidirectional Encoder (BERT) model, and the like, which is not limited in this respect.
Taking word2vec models as an example, word2vec is some related models for generating word vectors, which are shallow and double-layer neural networks for training to reconstruct the linguistic word text. The network is represented in words and the input words in neighboring positions are guessed, whereas the order of the words is unimportant under the bag of words model assumption in word2 vec. After the training is completed, the word2vec model may be used to map each word to a vector, which may be used to represent a word-to-word relationship, where the vector is a Hidden Layer of a neural network, for example, as shown in fig. 4, which is a schematic diagram of a word vector model in an embodiment of the present application, the word2vec model may be understood as a three-Layer neural network, the input is a text of a segmented word, and the output is a dense vector of a word, specifically, as shown in fig. 4, the input is a One-Hot vector, the Hidden Layer (Hidden Layer) has no activation function, that is, a linear unit, the dimension of the output Layer is the same as that of the input Layer, a Softmax regression function may be used, and after the training of the entire model is completed, the model may be applied to the word vector model by training learned parameters, such as a weight matrix of the Hidden Layer, to obtain a word vector of the word.
Taking a BERT model as an example, the BERT model aims to obtain vector representation of a text containing rich semantic information by utilizing large-scale unmarked corpus training, namely semantic representation of the text, then the semantic representation of the text is finely adjusted in a specific NLP task and finally applied to the NLP task, specifically, the BERT model converts each word in the text into a one-dimensional vector by inquiring a word vector table to be used as model input, the output of the BERT model is the vector representation after the full-text semantic information corresponding to each word is input, in addition, the input of the BERT model usually comprises a text vector and a position vector besides the word vector, wherein the value of the text vector is automatically learned in the model training process and is used for depicting the global semantic information of the text and is fused with the semantic information of single words/words; position vectors because semantic information carried by words appearing at different positions of a text is different, the BERT model adds different vectors to the words at different positions respectively for distinguishing, and thus the BERT model can take the sum of the word vectors, the text vectors and the position vectors as model input.
And S3.3, obtaining a context expression vector of the search word according to the word vector of each keyword.
Specifically, several possible implementations are provided in the examples of the present application:
1) and averaging the word direction of each keyword to obtain a context expression vector of the search word.
For example, a word2vec model may represent a keyword as a k-dimensional word vector, and average word vectors of all keywords in context information corresponding to a search word in a text according to corresponding positions of the vectors, so as to obtain a k-dimensional vector that may represent the whole context, that is, obtain a context representation vector of the search word, where there are three keywords, and their corresponding word vectors are, a ═ x1, x2, …, xk, b ═ y1, y2, …, yk), c ═ z1, z2, …, zk, and then the averaged context representation vector is: ((x1+ y1+ z1)/3, (x2+ y2+ z2)/3, …, (xk + yk + zk)/3).
2) And respectively determining the occurrence frequency of each keyword, respectively determining the weight of each keyword according to the occurrence frequency of each keyword, and respectively carrying out weighted average on the word vector of each keyword according to the weight of each keyword to obtain the context expression vector of the search word.
In addition, after determining the word vector of each keyword, other ways besides averaging may also be adopted to obtain the context expression vector of each search result, which is not limited in the embodiment of the present application, for example, in a possible implementation manner, the word vectors of each keyword in each search result may also be weighted and averaged respectively in combination with the position information of the keyword appearing in the search result or the position relationship between the keyword and the search word, so as to obtain the context expression vector of each search result.
Step 210: and obtaining each search result corresponding to the search word, and respectively determining the context expression vector of each search result.
When step 210 is executed, the method specifically includes:
1) and obtaining each search result corresponding to the search word.
In the embodiment of the application, for the way of obtaining each search result corresponding to the search word, a search result matching way in a general search scene in which the search word is actively input in the search box in the related art may be adopted, that is, only the search matching way of the search word is considered, and then each search result corresponding to the search word may be obtained. Because the search result obtained at this time is not accurate and may not be in accordance with the content that the user actually wants to search, after obtaining each search result in the embodiment of the present application, disambiguation and filtering may be performed on each search result.
2) A context representation vector for each search result is determined separately.
Specifically, keywords of each search result are respectively extracted; respectively determining word vectors of the keywords extracted from the search results; and obtaining a context expression vector of each search result according to the word vector of each keyword extracted from each search result.
The context expression vector of each search result is obtained according to the word vector of each keyword extracted from each search result, and several possible implementation modes are provided in the embodiment of the application:
the first mode is as follows: and respectively averaging the word vectors of the keywords extracted from each search result to obtain the context expression vector of each search result.
The second mode is as follows: and respectively determining the occurrence frequency of each keyword extracted from each search result, respectively determining the weight of each keyword according to the occurrence frequency of each keyword, and respectively performing weighted average on word vectors of each keyword extracted from each search result according to the weight of each keyword extracted from each search result to obtain a context expression vector of each search result.
Further, before word vectors of the keywords extracted from the search results are determined, a preset number of keywords sorted based on importance degrees can be screened out, so that the calculation efficiency is improved.
The manner of extracting the keywords from the search result, determining the word vectors, and calculating the context expression vectors is the same as the manner of determining the context expression vectors of the search words in step 200, and is not repeated here.
Step 220: and respectively determining the similarity between the search word and each search result according to the context expression vector of the search word and the context expression vector of each search result.
The method for determining the similarity between the search term and each search result may adopt a cosine distance calculation method, an euclidean distance calculation method, and the like, which is not limited in the embodiment of the present application.
For example, there are n search results, the word vector set of the keywords extracted from each search result is D _ K, that is, the word set corresponding to the search result D1 is D _ K1, the word set corresponding to the search result D2 is D _ K2, …, the word set corresponding to the search result Dn is D _ kn, the context expression vectors of the search results are respectively obtained by calculation according to the word vectors of the keywords in the keyword sets, and the cosine distances between the context expression vectors of the search words and the context expression vectors of the search results, that is, the similarity between the search words and the search results is obtained.
Step 230: and filtering the search results according to the determined similarity to obtain the filtered search results.
Specifically, search results with a similarity less than a threshold are filtered out.
When the filtered search results are displayed to the user, the terminal can display the filtered search results in sequence from high to low according to the similarity determined by the corresponding filtered search results.
That is to say, in the embodiment of the present application, through similarity calculation based on context information, if it is determined that the similarity is smaller than a threshold, it may be considered that the context of the topic of the text where the search word is located is ambiguous, and filtering may be performed without being displayed to the user, where the threshold may be set according to actual needs and situations, and the embodiment of the present application is not limited.
In the embodiment of the application, when a search word input based on a text is obtained and determined to be an ambiguous word, a context expression vector of the search word is determined according to context information corresponding to the search word in the text, search results corresponding to the search word are obtained, context expression vectors of the search results are respectively determined, similarity between the search word and each search result is further respectively determined, each search result is filtered according to the determined similarity, and a filtered search result is obtained, so that when the search word is determined to be the ambiguous word, the search result can be filtered according to the context information of the search word and the context information of the search result, the search result with ambiguous context information and the context information of the search word can be filtered, and the final filtered search result can better accord with the context information of the search word, the method is closer to the theme of the text where the search word is located, so that the accuracy of the final search result is improved, and the search experience of the user is better met.
Based on the above embodiments, the following describes a search result processing method in the embodiments of the present application from a product side, taking application to a reading scene as an example, a user reads an article in a terminal, and in the reading process, if a certain word is interested, a search flow may be initiated by long pressing, for example, referring to fig. 5, as an interface schematic diagram for initiating a search flow in the embodiments of the present application, the user may long press the position of the word "BD", the system determines a corresponding search word according to the position where the user's finger is long pressed, and prompts the user to initiate a "search-and-search" operation, and then the user clicks "search-and-search", i.e., sends a search request to a server, and may obtain a search result returned by the server.
For example, referring to fig. 6, which is a schematic diagram of a search result returned in the related art, it can be seen that a user initiates a search word "BD" in the reading scene, context information of the article is related to an optical disc, and it can be considered that the search result returned in the related art is a related search result in the meaning of "blu-ray disc" that the user actually wants to know, and the search result returned in the related art is a related result in the meaning of "business", and the search result is inaccurate, and the search result in the related art cannot sense correct meaning information of the article where the search word is located, and the search result is not disambiguated or filtered, so that the returned search result has semantic drift, and does not meet the actual search requirement of the user.
In the embodiment of the present application, when a search word is determined to be an ambiguous word, a context expression vector of the search word and a context expression vector of each search result are determined according to context information of the search word and context information of each search result, similarity between the search word and each search result is determined according to the context expression vectors, filtering is performed according to the determined similarities and each search result, and then the filtered search result is returned and displayed to a user, for example, referring to fig. 7, which is a schematic diagram of a returned search result in the embodiment of the present application, as can be seen in fig. 7, a returned search result in the embodiment of the present application better conforms to context information corresponding to the search word, and accuracy of the search result is improved.
Based on the same inventive concept, the embodiment of the present application further provides a search result processing apparatus, which may be a hardware structure, a software module, or a hardware structure plus a software module. Based on the foregoing embodiments, referring to fig. 8, a search result processing apparatus in an embodiment of the present application specifically includes:
an obtaining module 80, configured to obtain a search term based on text input;
the first determining module 81 is configured to determine a context expression vector of a search word according to context information corresponding to the search word in a text when the search word is determined to be an ambiguous word;
a second determining module 82, configured to obtain search results corresponding to the search terms, and determine context expression vectors of the search results respectively;
a third determining module 83, configured to determine similarity between the search term and each search result according to the context indicating vector of the search term and the context indicating vector of each search result;
and a filtering module 84, configured to filter the search results according to the determined similarity, so as to obtain filtered search results.
Optionally, the search word is determined to be an ambiguous word, and the first determining module 81 is specifically configured to:
and according to a preset ambiguous word database, if the search word is determined to be in the preset ambiguous word database, determining the search word as an ambiguous word, wherein the ambiguous word represents a word with at least two different semantics, and the preset ambiguous word database is constructed at least according to a word with a multi-meaning word identifier crawled from each preset search engine.
Optionally, when determining the context expression vector of the search term according to the context information corresponding to the search term in the text, the first determining module 81 is specifically configured to:
extracting keywords of the search terms in context information corresponding to the text;
respectively determining word vectors of the extracted keywords;
and obtaining a context expression vector of the search word according to the word vector of each keyword.
Optionally, when obtaining the context expression vector of the search term according to the term vector of each keyword, the first determining module 81 is specifically configured to: and averaging the word direction of each keyword to obtain a context expression vector of the search word.
Optionally, the first determining module 81 is specifically configured to:
and respectively determining the occurrence frequency of each keyword, respectively determining the weight of each keyword according to the occurrence frequency of each keyword, and respectively carrying out weighted average on the word vectors of each keyword according to the weight of each keyword to obtain the context expression vector of the search word.
Optionally, when determining the context expression vector of each search result, the second determining module 82 is specifically configured to:
extracting key words of each search result respectively;
respectively determining word vectors of the keywords extracted from the search results;
and obtaining the context expression vector of each search result according to the word vector of each keyword extracted from each search result.
Optionally, when obtaining the context expression vector of each search result according to the word vector of each keyword extracted from each search result, the second determining module 82 is specifically configured to: and respectively averaging the word vectors of the keywords extracted from the search results to obtain the context expression vectors of the search results.
Optionally, the second determining module 82 is specifically configured to: and respectively determining the occurrence frequency of each keyword extracted from each search result, respectively determining the weight of each keyword according to the occurrence frequency of each keyword, and respectively performing weighted average on word vectors of each keyword extracted from each search result according to the weight of each keyword extracted from each search result to obtain a context expression vector of each search result.
Optionally, the screening module 85 is further configured to: and screening out the keywords with the top preset number and sorted based on the importance degree.
Based on the above embodiments, referring to fig. 9, a schematic structural diagram of an electronic device in an embodiment of the present application is shown.
The present embodiment provides an electronic device, which may be a terminal or a server, and the electronic device is taken as an example in the present embodiment to be described, and the electronic device may include a processor 910 (CPU), a memory 920, an input device 930, an output device 940, and the like.
The processor 910 is configured to execute any one of the search result processing methods according to the present application by calling the program instructions stored in the memory 920 and the processor 910 is configured to execute the search result processing method according to the obtained program instructions.
Based on the above embodiments, in the embodiments of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the search result processing method in any of the above method embodiments.
Based on the above embodiments, in the embodiments of the present application, there is also provided a computer program product or a computer program, which includes computer instructions, and the computer instructions are stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the search result processing method in any of the above-described method embodiments.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
Claims (12)
1. A method for processing search results, comprising:
acquiring a search word input based on a text, and determining a context expression vector of the search word according to corresponding context information of the search word in the text when the search word is determined to be an ambiguous word;
obtaining each search result corresponding to the search word, and respectively determining a context expression vector of each search result;
respectively determining the similarity of the search word and each search result according to the context expression vector of the search word and the context expression vector of each search result;
and filtering the search results according to the determined similarity to obtain the filtered search results.
2. The method of claim 1, wherein determining that the search term is an ambiguous term comprises:
according to a preset ambiguous word database, if the search word is determined to be in the preset ambiguous word database, determining that the search word is an ambiguous word, wherein the ambiguous word represents a word with at least two different semantics, and the preset ambiguous word database is constructed at least according to a word with a polysemous word identifier and crawled from each preset search engine.
3. The method of claim 1, wherein determining the context representation vector of the search term according to the context information corresponding to the search term in the text comprises:
extracting keywords of the search terms in context information corresponding to the text;
respectively determining word vectors of the extracted keywords;
and obtaining a context expression vector of the search word according to the word vector of each keyword.
4. The method of claim 3, wherein obtaining the context representation vector of the search term according to the term vector of each keyword specifically comprises:
averaging the word direction of each keyword to obtain a context expression vector of the search word; or the like, or, alternatively,
and respectively determining the occurrence frequency of each keyword, respectively determining the weight of each keyword according to the occurrence frequency of each keyword, and respectively carrying out weighted average on the word vectors of each keyword according to the weight of each keyword to obtain the context expression vector of the search word.
5. The method of claim 1, wherein determining the context representation vector for each search result separately comprises:
respectively extracting key words of each search result;
respectively determining word vectors of the keywords extracted from the search results;
and obtaining the context expression vector of each search result according to the word vector of each keyword extracted from each search result.
6. The method according to claim 5, wherein obtaining the context expression vector of each search result according to the word vector of each keyword extracted from each search result specifically comprises:
respectively averaging word vectors of each keyword extracted from each search result to obtain a context expression vector of each search result; or the like, or, alternatively,
and respectively determining the occurrence frequency of each keyword extracted from each search result, respectively determining the weight of each keyword according to the occurrence frequency of each keyword, and respectively performing weighted average on word vectors of each keyword extracted from each search result according to the weight of each keyword extracted from each search result to obtain a context expression vector of each search result.
7. The method of any one of claims 3-6, further comprising:
and screening out the keywords with the top preset number and sorted based on the importance degree.
8. A search result processing apparatus, comprising:
the acquisition module is used for acquiring search terms input based on the text;
a first determining module, configured to determine a context representation vector of the search word according to context information corresponding to the search word in the text when the search word is determined to be an ambiguous word;
the second determining module is used for obtaining each search result corresponding to the search word and respectively determining the context expression vector of each search result;
a third determining module, configured to determine similarity between the search term and each search result according to the context representation vector of the search term and the context representation vector of each search result;
and the filtering module is used for filtering the search results according to the determined similarity to obtain the filtered search results.
9. The apparatus of claim 8, wherein when determining the context representation vector of the search term according to the context information corresponding to the search term in the text, the first determining module is specifically configured to:
extracting keywords of the search terms in context information corresponding to the text;
respectively determining word vectors of the extracted keywords;
and obtaining a context expression vector of the search word according to the word vector of each keyword.
10. The apparatus as claimed in claim 8, wherein, when determining the context representation vector of each search result separately, the second determining module is specifically configured to:
respectively extracting key words of each search result;
respectively determining word vectors of the keywords extracted from the search results;
and obtaining the context expression vector of each search result according to the word vector of each keyword extracted from each search result.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1-7 are implemented when the program is executed by the processor.
12. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011345102.9A CN112347339A (en) | 2020-11-26 | 2020-11-26 | Search result processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011345102.9A CN112347339A (en) | 2020-11-26 | 2020-11-26 | Search result processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112347339A true CN112347339A (en) | 2021-02-09 |
Family
ID=74365831
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011345102.9A Pending CN112347339A (en) | 2020-11-26 | 2020-11-26 | Search result processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112347339A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113158091A (en) * | 2021-03-24 | 2021-07-23 | 北京奇艺世纪科技有限公司 | Recall method, apparatus, electronic device and storage medium |
CN113449091A (en) * | 2021-06-29 | 2021-09-28 | 重庆长安汽车股份有限公司 | Intelligent question and answer method, device, terminal and computer readable storage medium based on automobile field label |
CN113486253A (en) * | 2021-07-30 | 2021-10-08 | 北京字节跳动网络技术有限公司 | Search result display method, device, equipment and medium |
CN116910278A (en) * | 2023-09-14 | 2023-10-20 | 深圳市智慧城市科技发展集团有限公司 | Data dictionary generation method, terminal device and storage medium |
CN118093982A (en) * | 2024-04-29 | 2024-05-28 | 福建中科星泰数据科技有限公司 | Internet mass data accurate searching method and system based on AI technology |
-
2020
- 2020-11-26 CN CN202011345102.9A patent/CN112347339A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113158091A (en) * | 2021-03-24 | 2021-07-23 | 北京奇艺世纪科技有限公司 | Recall method, apparatus, electronic device and storage medium |
CN113449091A (en) * | 2021-06-29 | 2021-09-28 | 重庆长安汽车股份有限公司 | Intelligent question and answer method, device, terminal and computer readable storage medium based on automobile field label |
CN113486253A (en) * | 2021-07-30 | 2021-10-08 | 北京字节跳动网络技术有限公司 | Search result display method, device, equipment and medium |
CN113486253B (en) * | 2021-07-30 | 2024-03-19 | 抖音视界有限公司 | Search result display method, device, equipment and medium |
CN116910278A (en) * | 2023-09-14 | 2023-10-20 | 深圳市智慧城市科技发展集团有限公司 | Data dictionary generation method, terminal device and storage medium |
CN118093982A (en) * | 2024-04-29 | 2024-05-28 | 福建中科星泰数据科技有限公司 | Internet mass data accurate searching method and system based on AI technology |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112182166B (en) | Text matching method and device, electronic equipment and storage medium | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN109635297B (en) | Entity disambiguation method and device, computer device and computer storage medium | |
CN112347339A (en) | Search result processing method and device | |
KR102491172B1 (en) | Natural language question-answering system and learning method | |
CN112800170A (en) | Question matching method and device and question reply method and device | |
CN112214593A (en) | Question and answer processing method and device, electronic equipment and storage medium | |
CN107315734B (en) | A kind of method and system to be standardized based on time window and semantic variant word | |
Mills et al. | Graph-based methods for natural language processing and understanding—A survey and analysis | |
CN110347790B (en) | Text duplicate checking method, device and equipment based on attention mechanism and storage medium | |
CN111414763A (en) | Semantic disambiguation method, device, equipment and storage device for sign language calculation | |
CN111198946A (en) | Network news hotspot mining method and device | |
CN113392265A (en) | Multimedia processing method, device and equipment | |
CN110674378A (en) | Chinese semantic recognition method based on cosine similarity and minimum editing distance | |
CN113392305A (en) | Keyword extraction method and device, electronic equipment and computer storage medium | |
CN111061939A (en) | Scientific research academic news keyword matching recommendation method based on deep learning | |
WO2012067586A1 (en) | Database searching | |
CN113822038A (en) | Abstract generation method and related device | |
CN113761104A (en) | Method and device for detecting entity relationship in knowledge graph and electronic equipment | |
CN112182159A (en) | Personalized retrieval type conversation method and system based on semantic representation | |
CN111046168A (en) | Method, apparatus, electronic device, and medium for generating patent summary information | |
CN114722774B (en) | Data compression method, device, electronic equipment and storage medium | |
CN114398903B (en) | Intention recognition method, device, electronic equipment and storage medium | |
CN114048319B (en) | Humor text classification method, device, equipment and medium based on attention mechanism | |
CN114595370A (en) | Model training and sorting method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40038746 Country of ref document: HK |
|
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |