CN112307314A - Method and device for generating fine selection abstract of search engine - Google Patents

Method and device for generating fine selection abstract of search engine Download PDF

Info

Publication number
CN112307314A
CN112307314A CN201910707322.2A CN201910707322A CN112307314A CN 112307314 A CN112307314 A CN 112307314A CN 201910707322 A CN201910707322 A CN 201910707322A CN 112307314 A CN112307314 A CN 112307314A
Authority
CN
China
Prior art keywords
answer
question
query
query words
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910707322.2A
Other languages
Chinese (zh)
Inventor
张晨
李路云
周梁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201910707322.2A priority Critical patent/CN112307314A/en
Publication of CN112307314A publication Critical patent/CN112307314A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for generating a fine selection abstract of a search engine. The method comprises the following steps: identifying question and answer query words from the search log data; obtaining a search result corresponding to the question-answer query word; if the search result does not contain the document of the specified type, the question-answer query words and the corresponding search results are used as input data, and label candidate answers corresponding to the question-answer query words are output according to a machine reading understanding model; and acquiring a labeled answer corresponding to the labeled candidate answer based on active learning, and taking the labeled answer as a selected abstract of a corresponding question-answer query word in a search engine. According to the technical scheme, aiming at the identified question-answer query words, a mode of combining machine reading understanding and active learning is adopted, a carefully selected abstract of required contents can be directly checked in a search result page by a user is provided for a search engine, the query cost and the answer accuracy are considered, and the use experience and the search quality of the user are improved.

Description

Method and device for generating fine selection abstract of search engine
Technical Field
The invention relates to the technical field of search engines, in particular to a method and a device for generating a fine abstract of a search engine.
Background
The traditional search engine performs operations such as query term understanding, user intention analysis, web page ranking and the like based on query terms (query) of a user, and finally returns a web page with high relevance to the user as a natural result. But this approach has become increasingly inadequate for user needs. For example, in a real search scenario, a user may inquire "why sea water is salty", in a normal case, a returned webpage further needs to be screened by the user, and further needs to click on the webpage to obtain more complete content, and if the user finds that an answer is not obtained well after clicking on the webpage, the user may return to an original webpage to perform a second or even third screening and clicking process.
Disclosure of Invention
In view of the above, the present invention has been developed to provide a method and apparatus for generating a search engine refined summary that overcomes, or at least partially solves, the above-mentioned problems.
According to one aspect of the invention, a method for generating a fine abstract of a search engine is provided, which comprises the following steps: identifying question and answer query words from the search log data; obtaining a search result corresponding to the question-answer query word; if the search result does not contain the document of the specified type, the question-answer query words and the corresponding search results are used as input data, and label candidate answers corresponding to the question-answer query words are output according to a machine reading understanding model; and acquiring a labeled answer corresponding to the labeled candidate answer based on active learning, and taking the labeled answer as a selected abstract of a corresponding question-answer query word in a search engine.
According to another aspect of the present invention, there is provided a device for generating a search engine refined abstract, comprising: the identification unit is suitable for identifying question and answer query words from the search log data; the searching unit is suitable for obtaining a searching result corresponding to the question-answer type query word; the candidate unit is suitable for taking the question-answer query words and the corresponding search results as input data if the search results do not contain the specified types of documents, and outputting labeled candidate answers corresponding to the question-answer query words according to a machine reading understanding model; and the choice abstract generating unit is suitable for acquiring the labeled answers corresponding to the labeled candidate answers based on active learning and taking the labeled answers as choice abstract of the corresponding question-answer type query words in a search engine.
In accordance with still another aspect of the present invention, there is provided an electronic apparatus including: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform a method as any one of the above.
According to a further aspect of the invention, there is provided a computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement a method as any one of the above.
In view of the above, the present invention identifies question and answer query terms from search log data; obtaining a search result corresponding to the question-answer query word; if the search result does not contain the document of the specified type, the question-answer query words and the corresponding search results are used as input data, and label candidate answers corresponding to the question-answer query words are output according to a machine reading understanding model; the technical scheme that the labeled answers corresponding to the labeled candidate answers are obtained based on active learning, the labeled answers are used as the selected abstract of the corresponding question-answer query words in the search engine, the selected abstract of the required content can be directly checked by the user in a search result page for the identified question-answer query words through a mode of combining machine reading understanding and active learning, the query cost and the answer accuracy are considered, and the user use experience and the search quality are improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 illustrates a flow diagram of a method for generating a search engine pick summary in accordance with one embodiment of the present invention;
FIG. 2 is a schematic diagram of an apparatus for generating a search engine refined summary according to an embodiment of the present invention;
FIG. 3 shows a schematic structural diagram of an electronic device according to one embodiment of the invention;
FIG. 4 shows a schematic structural diagram of a computer-readable storage medium according to one embodiment of the invention;
FIG. 5a is a diagram illustrating a search results page according to question and answer query terms in the prior art;
FIG. 5b is a diagram illustrating a search results page based on question-answer type query terms, according to an embodiment of the present invention;
FIG. 6a shows question-answer class query terms for the fact-description class;
FIG. 6b illustrates question-answer class query terms for the fact-entity class;
FIG. 6c illustrates question-answer class query terms for the point-of-view-description class;
FIG. 6d illustrates question-answer class query terms for a point of view-entity class query;
FIG. 7 is a schematic diagram of a web page for a document of a specified type;
FIG. 8a shows a schematic diagram of a bolded interface;
fig. 8b shows a schematic diagram of a fine-scale interface.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The design idea of the invention is that the method is different from the traditional search results, provides a 'fine abstract' for question and answer query words, and enables users to obtain answers with higher accuracy without clicking a webpage for jumping in a search result page. For example, the search engine can extract the determined answers of the user questions and put the answers in the first place of the returned results in a relatively conspicuous manner, which can help the user save a lot of time spent on obtaining the answers, thereby improving the user experience and the search quality. This process is a search engine that helps the user refine the answers, and is therefore referred to herein as a "refined summary".
It should be noted that, although the technical solution of the present invention is to provide the refined abstract for the search engine, the generation process of the refined abstract may be independent from the existing process of the general search engine, and the adaptability to various search engines is improved by loose coupling, which is convenient for integration.
FIG. 1 illustrates a flow diagram of a method for generating a search engine pick summary according to one embodiment of the invention. As shown in fig. 1, the method includes:
step 110, identifying question and answer query words from the search log data.
Each user of a search engine can generate a large amount of search log data during daily searches, for example, which query terms the user has searched for, which websites in which search results have been clicked after the search results are obtained, and so on. Through statistics and research, the selected abstract is most helpful for improving the search experience of the user for question and answer query words, so that the technical scheme of the invention mainly generates the corresponding selected abstract for the question and answer query words.
And step 120, obtaining a search result corresponding to the question-answer query word. This step may be accomplished by calling an existing search engine interface, or the like.
And step 130, if the search result does not contain the document of the specified type, taking the question-answer query words and the corresponding search result as input data, and outputting the label candidate answers corresponding to the question-answer query words according to the machine reading understanding model.
For the documents of the specified type, the content extraction can be carried out in a mode of templates and the like because the type is known, and for other web pages, the difficulty is how to obtain the information related to the answers required by the question-answer type query words. In contrast, in the embodiment of the present invention, an MRC (Machine Reading Comprehension) mode is adopted, and a candidate annotation answer corresponding to a question and answer type query term is output according to a Machine Reading Comprehension model by using the question and answer type query term and a corresponding search result as input data.
Since the accuracy of the answer may deviate from the actual requirement of the user, for example, it is found in the verification process that the cold start effect of the model is not good, and the accuracy of the generated answer is not enough to be directly on-line, the embodiment of the present invention obtains the labeled answer corresponding to the labeled candidate answer based on active learning through step 140, and uses the labeled answer as the refined abstract of the corresponding question-answer query word in the search engine. The traditional marking mode has low speed for improving the effect of the model, and meanwhile, the requirement of the marking of the carefully chosen abstract on the text reason and knowledge storage of a marking person is high, so that the marking difficulty is high; therefore, the embodiment of the invention further improves the accuracy by an active learning (active learning) mode.
It can be seen that the method shown in fig. 1 provides a user with a choice abstract that the user can directly view required content in a search result page for a search engine in a manner of combining machine reading understanding and active learning for identified question and answer query terms, gives consideration to query cost and answer accuracy, and improves user experience and search quality.
FIG. 5a is a diagram illustrating a search results page according to question and answer query terms in the prior art; fig. 5b is a schematic diagram of a search result page obtained according to question and answer query words according to an embodiment of the present invention, and it can be seen that the first search result in fig. 5b is a refined abstract, so that a user can obtain most answers without clicking on an entry web page, or can obtain more information through an original page by clicking on an entry web page.
In an embodiment of the present invention, the method includes: processing the search log data in a preset type, and extracting query words meeting requirements; and taking the query words as input data, and outputting question and answer query words according to the query word classification model.
Because the query words searched by the user are massive, the query words may have different forms and different heat degrees, in order to take efficiency into account, the query words meeting the requirements can be selectively screened out, and the query words are classified by using a query word classification model obtained by machine learning to obtain question and answer query words, for example, "why seawater is blue" is a question and answer query word, and "1933" is not a question and answer query word but a year query word.
In an embodiment of the present invention, the method includes: sorting the query words in the search log data according to the page browsing amount PV, and extracting a plurality of query words from high to low according to the sorting; and/or normalizing the query term to remove punctuation and/or blank spaces in the query term.
Because PV can reflect the heat of a query word, for the query word with lower heat, even if a refined abstract is provided, the feedback of the effect is slower and not obvious enough, so that the query word with higher heat can be extracted firstly under the condition of considering the efficiency. In addition, because the query words are not standard enough, punctuation and/or blank spaces are removed in a normalization mode, and subsequent processing can be facilitated.
In an embodiment of the present invention, the method further includes the following steps of making training data of the query term classification model: acquiring search log data generated in a preset time period as sample data; counting the total clicks of all the query words in the sample data and the clicks of the question and answer sites, and calculating the question and answer click ratio of all the query words in the sample data in a preset time period; and taking the query words with the question-answer click ratio larger than the first threshold value as positive examples, and taking the query words with the question-answer click ratio lower than the second threshold value as negative examples to obtain training data.
Here, whether a query term is a question-answer type query term is identified according to a click condition of a user, and the query term is implemented in a manner of a query term classification model, and an example of generating training data is given in the embodiment of the present invention. In the process of making the training data, two types of threshold values are preset, the total click amount of each query word in sample data and the click number of question-answer sites are counted, and the question-answer click ratio of each query word in the sample data in a preset time period is calculated. It is easy to understand that if a word is a question-answer type query word, a user can click a question-answer type site probably after searching; and if the query words are not question-answer query words, the user cannot click a question-answer site, and positive examples and negative examples are generated according to the click. In a specific scenario, the first threshold may be 0.8, and the second threshold may be 0.1, which is better in verified effect.
In an embodiment of the present invention, the method further includes the following steps of training to obtain a query term classification model: and dividing the training data into a training set, a verification set and a test set according to a preset proportion, and training based on the textCNN model to obtain a query word classification model.
For example, the training data is scaled according to 7: 2: the proportion of 1 is divided into a training set, a verification set and a test set, and then the training set is input into a textCNN model for training. the textCNN model is a deep learning model for solving the text classification problem by using CNN, and has a good text classification effect.
In an embodiment of the present invention, the method includes: and taking the query word containing the specified word in the sample data as a positive example. For example, a query term including the terms of "how" and "how" can be basically determined to be a question-and-answer query term, so that the query term including the specified term can be directly used as a positive example, and the query term including the specified term can be removed in a negative example according to the click proportion statistics.
In an embodiment of the present invention, in the method, the question-answering site is determined according to a web site mode. For example, for the question and answer query word "how do tomato stir-fried eggs", corresponding answers can be found in different question and answer sites, and for example, the web address URL of each answer is as follows:
knowing the URL: https:// www.zhihu.com/query/19576438
Baidu aware URL: https:// zhidao.baidu.com/query/572659771. html
360 question-answer URL: https:// wenda. so. com/q/1532241908216013
In fact, statistics show that the website addresses of the questions and answers in each question and answer type site all follow certain website address patterns, such as known website address patterns of https:// www.zhihu.com/query/… …, and known website address patterns of hundreds degree of knowledge of https:// zhidao.baidu.com/query/… …. html, and the website address pattern of 360 questions and answers of https:// wenda.so.com/q/… …. Therefore, the question-answer type sites can be determined according to the website mode.
In an embodiment of the present invention, the method includes: and preprocessing the identified question-answer query words, removing repeated question-answer query words and removing the question-answer query words with marked answers. The repeated process of generating the selection abstract can be reduced by carrying out the duplication removal, and the efficiency is improved.
In an embodiment of the present invention, the method includes: and classifying the identified question and answer query words, and filtering out the question and answer query words outside the preset types. This can further filter out some question-answer query words that are not suitable for the choice abstract generation.
In an embodiment of the present invention, the method includes: and classifying the identified question-answer query words according to at least one of the topic classification model, the query word classification model and the answer classification model. As the name suggests, the topic classification model can classify according to the topic of the question-answer type query word, the query word classification model can classify according to the attribute of the query word, and the answer classification model can classify according to the attribute of the answer of the question-answer type query word.
In an embodiment of the present invention, in the above method, the theme type includes at least one of: mobile phone digital, life, game, education science, leisure hobbies, cultural art, financial management, social livelihood, sports, and region; the query part of speech type includes: facts class and/or opinion class; answer types include: description classes and/or entity classes.
The topic types shown above are the topic types frequently searched by the user, and the effect of generating the refined abstract for the question-answer type query words is obvious. The attribute analysis of the question-answer query words and the answer focuses on four types of user queries, and the queries are considered to be most helpful to the improvement of the user search experience. These four types of queries can be de-partitioned from two perspectives: according to the nature of the query, the query is divided into a fact class and a viewpoint class, and according to the length characteristics of the answer, the query is divided into an entity class and a description class. For example, FIG. 6a shows question-answer class query terms for a fact-description class; FIG. 6b illustrates question-answer class query terms for the fact-entity class; FIG. 6c illustrates question-answer class query terms for the point-of-view-description class; FIG. 6d shows question-answer class query terms for a point of view-entity class query.
In an embodiment of the present invention, the method includes: if the answer type of the question-answer query word is an entity type, calling a machine reading understanding model which does not contain a sorting algorithm to output a plurality of labeled candidate answers; otherwise, calling a complete machine reading understanding model and outputting a label candidate answer.
Because the answer length of the question-answer query words of the entity class is shorter, the labeling cost is lower, and a plurality of labeled candidate answers can be output for labeling; and other question-answer query words can call a complete machine reading understanding model containing a sorting algorithm and output labeling candidate answers with the highest scores, so that the labeling cost can be reduced.
Before calling a machine reading understanding model to generate a label candidate answer, adjusting the data format: this step generates the format required to invoke the machine-readable understanding model, based primarily on the data obtained in the previous step. One specific example is as follows:
inputting a model: the input to the model is a query and several documents, as follows:
{ "query": "why the sky is blue",
the sunlight consists of seven kinds of light including red, orange, yellow, green, cyan, blue and purple. The seven lights have shorter wavelengths of cyan, blue and violet, and are easily scattered by air molecules and dust. Atmospheric molecules and atmospheric dust have a much higher scattering power for blue light of shorter wavelength than for photons of longer wavelength. ",
"shorter wavelength light, such as violet and blue, is more readily absorbed by air molecules than longer wavelength light (i.e., red, orange and yellow bands of the spectrum). The air molecules then radiate violet and blue light in different directions, saturating the sky. "
……
]
}
And (3) outputting a model: the interval of the answer in the document is placed in the answer _ spans field.
{ "query": "why the sky is blue",
the sunlight consists of seven kinds of light including red, orange, yellow, green, cyan, blue and purple. The seven lights have shorter wavelengths of cyan, blue and violet, and are easily scattered by air molecules and dust. Atmospheric molecules and atmospheric dust have a much higher scattering power for blue light of shorter wavelength than for photons of longer wavelength. ",
"shorter wavelength light, such as violet and blue, is more readily absorbed by air molecules than longer wavelength light (i.e., red, orange and yellow bands of the spectrum). The air molecules then radiate violet and blue light in different directions, saturating the sky. "
……
],
“answer_spans”:[44,77]
}
In an embodiment of the present invention, in the above method, the topic classification model is a text multi-classification model, and the method further includes the following steps of making training data of the topic classification model: acquiring search log data generated in a preset time period as sample data; counting the click ratio of each query word in the sample data on different topic type sites; counting page browsing amount of each query word in sample data, and dividing high-frequency query words from the query words in the sample data according to the page browsing amount; and taking the topic type of the site with the highest click ratio of the high-frequency query word as the topic type of the corresponding high-frequency query word.
For example, a user searches for "how the iPhone XS is configured", often clicks a mobile phone digital website in a search result, for example, a middle guan village online site, and the like, the topic type of a query word can be determined according to a click ratio, but the effect of a high-frequency word, namely a query word with a high page view volume PV is obvious through inspection, and the effect of a medium-frequency word and a low-frequency word is slightly poor, so that the invention also provides a training data making example for the medium-frequency word and the low-frequency word.
In an embodiment of the present invention, in the method, the step of making the training data of the topic classification model further includes: dividing intermediate frequency query words from the query words of the sample data according to the page browsing amount; and taking the high-frequency query words with the determined theme types as a training set, training a Support Vector Machine (SVM) model, and determining the theme types of the medium-frequency query words according to the trained SVM model. An SVM (Support Vector Machine) is a generalized linear classifier that performs binary classification (binary classification) on data in a supervised learning manner, and a decision boundary of the generalized linear classifier is a maximum edge distance hyperplane for solving a learning sample. Through verification, the method has higher classification accuracy for intermediate frequency words.
In an embodiment of the present invention, in the method, the step of making the training data of the topic classification model further includes: dividing low-frequency query words from the query words of the sample data according to the page browsing amount; and determining the topic type of the low-frequency query word according to the syntactic dependency tree. Syntactic Dependency (DP) reveals its syntactic structure by analyzing the Dependency relationships between components within a linguistic unit. Intuitively, the dependency parsing identifies grammatical components such as "principal and predicate object" and "fixed shape complement" in a sentence, and analyzes the relationship between the components. Usually, a tree structure is formed, and the classification accuracy of the low-frequency words is higher through verification.
In one embodiment, words with a single daily average PV greater than 50 are high frequency words, words between 5 and 50 are medium frequency words, and words below 5 are low frequency words.
In an embodiment of the present invention, in the method, the query term classification model and the answer classification model are both text multi-classification models; the query term classification model is obtained by training based on the property characteristics of the query terms; the answer classification model is obtained by training based on the length characteristics of the answers. According to the above example, the query terms of the fact category have more uniform answers, and the query terms of the point of view category have more answers; the query term answers for the entity class are shorter and the query term answers for the description class are longer.
In an embodiment of the present invention, in the method, identifying question-answer query terms from the search log data further includes: and filtering the identified question and answer query words according to a plurality of specified dimensions to obtain the filtered question and answer query words. Specifically, in an embodiment of the present invention, in the method, the filtering the identified question-answer query terms according to a plurality of specified dimensions includes: filtering each appointed dimension by adopting a corresponding text classification model; the text classification model is obtained by training based on SVM and/or fastText respectively. Different text classification models can be selected according to the types of the dimensions, the specified dimensions can contain question and answer query words which do not accord with related laws and regulations, sensitive query words and the like, and question and answer query words and unsmooth question and answer query words which relate to people can be filtered.
In an embodiment of the present invention, in the method, obtaining a search result corresponding to a question-answer type query term includes: calling a search engine interface, and acquiring a first number of natural search results corresponding to the question-answer query words according to the search result sequence; and adjusting the natural search results according to a preset algorithm, and selecting a second number of natural search results as search results corresponding to the corresponding question and answer query words.
According to the technical scheme, the search results can be obtained by calling the interface of the existing search engine, and the search engine does not necessarily realize the sequencing sequence of the search results according to the most relevant semantic and other factors, so that a plurality of natural search results with higher ranking can be screened out firstly, and then the natural search results are adjusted by using a preset algorithm.
In addition, many search engines rewrite the query word, for example, rewrite pinyin to chinese characters, and the finally generated refined abstract may correspond to the rewritten word.
In an embodiment of the present invention, in the method, the adjusting the natural search result according to a preset algorithm includes at least one of: filtering document sites which cannot acquire webpage contents from natural search results; and improving the sequence of the stations with the confidence level higher than the first preset value.
And for the document site which can not acquire the webpage content, directly filtering if the content can not be captured through the subsequent steps. Additionally, the order may be promoted for authoritative sites.
In an embodiment of the present invention, in the method, obtaining a search result corresponding to the question-answer query term further includes: and filtering the question and answer query words according to the natural search result. Specifically, in an embodiment of the present invention, in the method, the filtering the question-answer type query term according to the natural search result includes at least one of: filtering out question and answer query words containing the application box in the natural search result; filtering out question and answer query words containing illegal words in the titles of the natural search results; and filtering out question and answer query words lacking high-quality natural search results according to semantic matching.
The application box (onebox) is a mature search result display mode, and can provide ideal search results for related query terms such as stocks, movies, dramas, weather and the like, so that a refined abstract does not need to be generated for the query terms. Because the machine reading understanding model is required to be used for processing subsequently, if a high-quality natural search result is found to be lacked according to semantic matching, the expected answer cannot be obtained even if the machine reading understanding is carried out, and the question-answer query words are filtered. The offending term can be a question-and-answer type query term that does not comply with the relevant laws and regulations.
In an embodiment of the present invention, in the above method, the method further includes: and if the search result contains the specified type of document, extracting the label candidate answer from the specified type of document directly. Specifically, in an embodiment of the present invention, in the method, the step of extracting the annotation candidate answer directly from the document of the specified type is performed by: and analyzing the html document, extracting a plurality of pieces of step description information according to field matching, and splicing to obtain a label candidate answer.
FIG. 7 shows a schematic view of a web page for a document of a specified type. As shown in fig. 7, there are several steps, and the annotation candidate answer can be obtained by extracting the steps. One way is to use Beautiful Soup library (Beautiful Soup is a Python package, the functions include parsing html, xml documents, and repairing documents with errors such as unclosed tags, which are often called tag Soup), parsing html into objects for processing, then matching relevant fields, thereby finding out the contents behind step in the webpage, and then splicing to form the labeled candidate answers of the question and answer query words.
In an embodiment of the present invention, the method, wherein the using the question-answer query terms and the corresponding search results as input data includes: and calculating semantic relevance scores of the question-answer query words and the webpage titles of the search results according to the semantic matching model, and sequencing the search results according to the semantic relevance scores.
In order to further improve the efficiency and accuracy of answer generation, before the answer generation is actually performed, a semantic matching model is adopted to perform semantic scoring on the question-answer query words and the webpage titles of the candidate documents, the scoring is given for reordering, the candidate documents with high relevance are weighted and placed at the position relative to the front, and therefore the candidate documents as far as possible contain correct answers.
In an embodiment of the present invention, in the above method, the method further includes: if the calculated semantic relevance scores are lower than a second preset value, outputting the labeled candidate answers corresponding to the question-answer query words according to the machine reading understanding model and subsequent steps are not executed, and the carefully selected abstract of the question-answer query words is directly captured from the question-answer site.
If all candidate documents of the question-answer type query words are found to be low in question-answer type query word meaning matching score, the question-answer type query words are indicated to have no high-quality documents, if answers are directly generated based on the question-answer type query words, the answers are high in probability and poor in quality, and then a subsequent answer generation step is not carried out, and the process of 'knowledge type site grabbing' is carried out.
In an embodiment of the present invention, in the above method, the method further includes the following steps of training the semantic matching model: obtaining question-answer query words and webpage title pairs marked with positive examples and negative examples, and constructing the question-answer query words and the webpage title pairs as training data through a processor dictionary; and performing fine-tune adjustment training on the basis of the pre-training model and the training data of the BERT to obtain a semantic matching model.
1) And respectively acquiring a positive example and a negative example through other labeling processes as follows:
the method comprises the following steps: woolen cloth for determining blue sky
Negative example: why the sky is blue and why the sea water is salty
2) Constructing a processor dictionary, constructing a data processing flow and forming a data format required by the model as follows:
the method comprises the following steps: 1/t-sky-blue-what-t-sky-blue woolen
Negative example: why 0\ t sky is blue \ t sea why is salty
3) Carrying out fine-tune: run _ classifier. py for model training.
In an embodiment of the present invention, in the above method, the method further includes: and (5) directly capturing the selected abstract from the question-answer website for question-answer query words with PV lower than a third preset value.
The number of search people is small for the question and answer query words with low PV, so that the resource consumption can be saved by capturing the selected abstract from the question and answer sites.
In an embodiment of the present invention, in the above method, the method further includes: generalizing the question and answer query words to obtain query words with similar semantics; and taking the marked answers as the selected abstracts of the semantically similar query words in the search engine.
On one hand, the query words with similar semantemes have the same answer without generating the answer again, and on the other hand, in order to refine the recall rate of the abstract on the search engine, the query needs to be generalized. There is thus "spread coverage" which is based on accumulating A certain amount of high quality query-answer datA (query term-answer library, Q- A library).
In an embodiment of the present invention, in the method, generalizing the question-answer query terms to obtain query terms with similar semantics includes: digging out candidate query words corresponding to the question-answer query words based on the query word display condition, the user click behaviors and the co-click behaviors; and calculating semantic relevance scores of the question-answer query words and the candidate query words according to the semantic matching model, and taking the candidate query words with the semantic relevance scores higher than a fourth preset value as query words with similar semantics.
For example, A batch of new queries with similar semantics to the query in the Q- A library are mined from massive search logs by using information such as query presentation, user click behavior, and common click behavior, and are used as A candidate set for coverage extension. And then, a semantic matching model based on Bert can be called (the training process can refer to the foregoing embodiment), and the query-query pairs are scored, wherein the higher the score value is, the closer the semantics of the two are. For example, selecting a query with a score higher than 0.9 as a synonymous query, and successfully performing the coverage extension.
In an embodiment of the present invention, in the method, generalizing the question-answer query terms to obtain query terms with similar semantics includes: performing vector representation on the query words, and determining candidate query words corresponding to the question-answer query words by calculating cosine similarity of each vector; and calculating semantic relevance scores of the question-answer query words and the candidate query words according to the semantic matching model, and if the highest semantic relevance score is larger than a fifth preset value, taking the candidate query word corresponding to the highest semantic relevance score as a query word with similar semantics.
This approach can be understood as a fully semantic based approach, called indexing matching. When judging whether a new query can be triggered, searching the query with the most similar semantic meaning, and performing recall and matching. In the recall stage, the main purpose is to initially screen out a candidate set, for example, a sensor 2vector (sen2vec) model is adopted to map the query into vector representation, and then 200 queries with the highest semantic similarity are obtained from the library by calculating cosine similarity between vectors. In the matching stage, the 200 queries and the new query form query-query pairs respectively, and a semantic matching model is called for scoring, so that the query pair with the highest score value and higher than a certain threshold value is selected as a final synonym pair.
One specific example is as follows:
recalling:
1) vector representation of query: calling a sen2vec model, and speaking an input query to be expressed into a vector;
2) calculating cosine similarity: the cosine value of the included angle between the two vectors in the vector space is used as the measure of the difference between the two individuals.
3) Sorting: according to cosine similarity sorting, selecting a query pair with the similarity of 200 degrees in advance as an expansion trigger candidate set;
matching:
1) calling semantic matching models for 200 query pairs in the candidate set, and screening out the query pair with the highest semantic score;
2) and judging whether the score of the query pair is higher than a certain threshold value, and if so, taking the result as a final extension triggering result.
In an embodiment of the present invention, the obtaining of the labeled answer corresponding to the labeled candidate answer based on active learning in the method includes: providing a fine marking interface and a coarse marking interface; displaying the only labeling candidate answer through the rough label interface, receiving returned correctness evaluation information, and determining a labeling answer according to the correctness evaluation information; and displaying a plurality of labeling candidate answers through the fine labeling interface, and receiving a returned labeling answer.
Through verification, the accuracy rate of finding whether the answer is generated by extracting the step from the specified document or by a machine reading understanding model is far lower than 95%, and the requirement of online search engine can not be met. The generated answers are then labeled by active learning to improve accuracy. For the maximize mark income, reduce the mark cost for the model effect promotes simultaneously, and active learning thought is utilized to the solution, and the output sets up two mark tasks: coarse marking and fine marking.
Firstly, whether a question-answer pair is correct or not is judged through a rough mark, whether the question can be well answered or not needs to be judged, and the question is classified into the following four categories: correct answer, wrong query and no judgment. If the marking result is that the answer is correct, the data can be directly online; if the answer is wrong, the question is the mistake made by the machine reading understanding model, the input data of the query model is sent to a precise label for labeling, namely, a labeling person needs to circle out the best answer from the query and a plurality of documents, and if the documents cannot select the proper answer, the labeling person can also directly fill in the best answer. The data that the accurate mark was annotated can directly be on line on the one hand, and on the other hand also can read the training sample of understanding the model as the machine, corrects the error of model, and the maximum speed promotes the model effect.
The mode of combining the fine marking and the coarse marking gives consideration to the marking cost, improves the online output rate and can help the model to optimize iteration quickly. The following are the labeling pages of the coarse label and the fine label labeling platforms respectively.
Fig. 8a shows a schematic diagram of a coarse interface and fig. 8b shows a schematic diagram of a fine interface.
In an embodiment of the present invention, the above method, wherein the selecting the marked answers as corresponding question-answer type query words in a search engine includes: the pick digest is saved in xml format. The data in the xml format can be on-line to a stable interface and finally configured to a database of a search engine.
Fig. 2 is a schematic structural diagram of an apparatus for generating a search engine refined abstract according to an embodiment of the present invention. As shown in fig. 2, the apparatus 200 for generating a search engine refined abstract includes:
the identifying unit 210 is adapted to identify question and answer query words from the search log data.
Each user of a search engine can generate a large amount of search log data during daily searches, for example, which query terms the user has searched for, which websites in which search results have been clicked after the search results are obtained, and so on. Through statistics and research, the selected abstract is most helpful for improving the search experience of the user for question and answer query words, so that the technical scheme of the invention mainly generates the corresponding selected abstract for the question and answer query words.
The searching unit 220 is adapted to obtain a search result corresponding to the question-answer type query term. The method can be realized by calling the existing search engine interface and the like.
And the candidate unit 230 is adapted to, if the search result does not include the document of the specified type, take the question-answer query term and the corresponding search result as input data, and output the labeled candidate answer corresponding to the question-answer query term according to the machine reading understanding model.
For the documents of the specified type, the content extraction can be carried out in a mode of templates and the like because the type is known, and for other web pages, the difficulty is how to obtain the information related to the answers required by the question-answer type query words. In contrast, in the embodiment of the present invention, an MRC (Machine Reading Comprehension) mode is adopted, and a candidate annotation answer corresponding to a question and answer type query term is output according to a Machine Reading Comprehension model by using the question and answer type query term and a corresponding search result as input data.
Since the accuracy of the answer may deviate from the actual requirement of the user, for example, it is found in the verification process that the cold start effect of the model is not good and the generated answer accuracy is not enough to be on-line directly, the embodiment of the present invention uses the refined abstract generating unit 240 to obtain the labeled answer corresponding to the labeled candidate answer based on active learning, and uses the labeled answer as the refined abstract of the corresponding question-answer type query word in the search engine. The traditional marking mode has low speed for improving the effect of the model, and meanwhile, the requirement of the marking of the carefully chosen abstract on the text reason and knowledge storage of a marking person is high, so that the marking difficulty is high; therefore, the embodiment of the invention further improves the accuracy by an active learning (active learning) mode.
It can be seen that the apparatus shown in fig. 2 provides a user with a choice abstract that the user can directly view the required content in the search result page for the search engine by combining machine reading understanding and active learning for the identified question and answer query terms, which considers both the query cost and the answer accuracy and improves the user experience and the search quality.
In an embodiment of the present invention, in the above apparatus, the identifying unit 210 is adapted to perform a preset type of processing on the search log data, and extract a query term meeting a requirement; and taking the query words as input data, and outputting question and answer query words according to the query word classification model.
In an embodiment of the present invention, in the above apparatus, the identifying unit 210 is adapted to sort the query terms in the search log data according to the page view amount PV, and extract a plurality of query terms from high to low in the sort; and/or normalizing the query term to remove punctuation and/or blank spaces in the query term.
In an embodiment of the present invention, the apparatus further includes: the training unit is suitable for acquiring search log data generated in a preset time period as sample data; counting the total clicks of all the query words in the sample data and the clicks of the question and answer sites, and calculating the question and answer click ratio of all the query words in the sample data in a preset time period; and taking the query words with the question-answer click ratio larger than the first threshold value as positive examples, and taking the query words with the question-answer click ratio lower than the second threshold value as negative examples to obtain training data.
In an embodiment of the present invention, in the apparatus, the training unit is adapted to divide the training data into a training set, a verification set, and a test set according to a preset proportion, and perform training based on the textCNN model to obtain the query term classification model.
In an embodiment of the present invention, in the apparatus, the training unit is adapted to take a query word containing a specified term in the sample data as a positive example.
In an embodiment of the present invention, in the above apparatus, the question and answer site is determined according to a web site mode.
In an embodiment of the present invention, in the above apparatus, the identifying unit 210 is adapted to perform preprocessing on the identified question-answer query terms, remove repeated question-answer query terms, and remove question-answer query terms with labeled answers.
In an embodiment of the present invention, in the above apparatus, the identifying unit 210 is adapted to classify the identified question-answer query terms, and filter out question-answer query terms outside the preset type.
In an embodiment of the present invention, in the above apparatus, the identifying unit 210 is adapted to classify the identified question-answer query terms according to at least one of a topic classification model, a query term classification model and an answer classification model.
In one embodiment of the present invention, in the above apparatus, the theme type includes at least one of: mobile phone digital, life, game, education science, leisure hobbies, cultural art, financial management, social livelihood, sports, and region; the query part of speech type includes: facts class and/or opinion class; answer types include: description classes and/or entity classes.
In an embodiment of the present invention, in the above apparatus, the candidate unit 230 is adapted to, if the answer type of the question-answer query term is an entity class, invoke a machine reading understanding model that does not include a ranking algorithm to output a plurality of labeled candidate answers; otherwise, calling a complete machine reading understanding model and outputting a label candidate answer.
In an embodiment of the present invention, in the above apparatus, the topic classification model is a text multi-classification model, and the apparatus further includes: the training unit is suitable for acquiring search log data generated in a preset time period as sample data; counting the click ratio of each query word in the sample data on different topic type sites; counting page browsing amount of each query word in sample data, and dividing high-frequency query words from the query words in the sample data according to the page browsing amount; and taking the topic type of the site with the highest click ratio of the high-frequency query word as the topic type of the corresponding high-frequency query word.
In an embodiment of the present invention, in the apparatus, the training unit is adapted to divide the intermediate frequency query term from the query terms of the sample data according to the page browsing amount; and taking the high-frequency query words with the determined theme types as a training set, training a Support Vector Machine (SVM) model, and determining the theme types of the medium-frequency query words according to the trained SVM model.
In an embodiment of the present invention, in the apparatus, the training unit is adapted to divide a low-frequency query term from query terms of sample data according to a page browsing amount; and determining the topic type of the low-frequency query word according to the syntactic dependency tree.
In an embodiment of the present invention, in the apparatus, the query term classification model and the answer classification model are both text multi-classification models; the query term classification model is obtained by training based on the property characteristics of the query terms; the answer classification model is obtained by training based on the length characteristics of the answers.
In an embodiment of the present invention, in the above apparatus, the identifying unit 210 is adapted to filter the identified question-answer query terms according to a plurality of specified dimensions, so as to obtain the filtered question-answer query terms.
In an embodiment of the present invention, in the above apparatus, the identifying unit 210 is adapted to filter each specified dimension by using a corresponding text classification model; the text classification model is obtained by training based on SVM and/or fastText respectively.
In an embodiment of the present invention, in the above apparatus, the searching unit 220 is adapted to invoke a search engine interface, and obtain a first number of natural search results corresponding to the question-answer query terms according to the search result sequence; and adjusting the natural search results according to a preset algorithm, and selecting a second number of natural search results as search results corresponding to the corresponding question and answer query words.
In an embodiment of the present invention, in the above apparatus, the searching unit 220 is adapted to filter out document sites that cannot acquire web page content from natural search results; and improving the sequence of the stations with the confidence level higher than the first preset value.
In an embodiment of the present invention, in the above apparatus, the searching unit 220 is adapted to filter the question-answer query terms according to the natural search result.
In an embodiment of the present invention, in the above apparatus, the searching unit 220 is adapted to filter out question-answer query terms containing an application box from natural search results; and/or filtering out question-answer type query words containing illegal words in the titles of the natural search results; and/or filtering out question-answer query words lacking high-quality natural search results according to semantic matching.
In an embodiment of the present invention, in the apparatus, the candidate unit 230 is further adapted to extract the annotation candidate answer directly from the document of the specified type if the search result includes the document of the specified type.
In an embodiment of the present invention, in the apparatus, the document of the specified type is an html document including a plurality of pieces of step description information, and the candidate unit 230 is adapted to parse the html document, extract the plurality of pieces of step description information according to field matching, and obtain the labeling candidate answer by concatenation.
In an embodiment of the present invention, in the above apparatus, the candidate unit 230 is adapted to calculate a semantic relevance score of the question-answer query word and a web page title of the search result according to a semantic matching model, and rank the search result according to the semantic relevance score.
In an embodiment of the present invention, in the above apparatus, the candidate unit 230 is adapted to not execute outputting the labeled candidate answer corresponding to the question-answer query word according to the machine reading understanding model and the subsequent steps if each calculated semantic relevance score is lower than a second preset value; and the carefully chosen abstract unit is suitable for directly grabbing carefully chosen abstract of the question and answer query words from the question and answer sites.
In an embodiment of the present invention, the apparatus further includes: the training unit is suitable for acquiring question-answer query words and webpage title pairs marked with positive examples and negative examples, and constructing the pairs into training data through a processor dictionary; and performing fine-tune adjustment training on the basis of the pre-training model and the training data of the BERT to obtain a semantic matching model.
In an embodiment of the present invention, in the above apparatus, the summarization unit is adapted to capture the summarization directly from the question-answer website for question-answer query words having PV lower than a third preset value.
In an embodiment of the present invention, in the above apparatus, the summarization unit is further adapted to generalize query terms of question and answer classes to obtain query terms with similar semantics; and taking the marked answers as the selected abstracts of the semantically similar query words in the search engine.
In an embodiment of the present invention, in the apparatus, the carefully-selected abstract unit is adapted to dig out candidate query terms corresponding to question-answer query terms based on the query term presentation condition, the user click behavior and the co-click behavior; and calculating semantic relevance scores of the question-answer query words and the candidate query words according to the semantic matching model, and taking the candidate query words with the semantic relevance scores higher than a fourth preset value as query words with similar semantics.
In an embodiment of the present invention, in the above apparatus, the above apparatus includes a carefully-selected abstract unit, adapted to perform vector representation on the query terms, and determine candidate query terms corresponding to the question-answer query terms by calculating cosine similarity of each vector; and calculating semantic relevance scores of the question-answer query words and the candidate query words according to the semantic matching model, and if the highest semantic relevance score is larger than a fifth preset value, taking the candidate query word corresponding to the highest semantic relevance score as a query word with similar semantics.
In an embodiment of the present invention, in the above apparatus, the fine summarization unit is adapted to provide a fine-scale interface and a coarse-scale interface; displaying the only labeling candidate answer through the rough label interface, receiving returned correctness evaluation information, and determining a labeling answer according to the correctness evaluation information; and displaying a plurality of labeling candidate answers through the fine labeling interface, and receiving a returned labeling answer.
In one embodiment of the present invention, in the apparatus described above, the digest unit is adapted to store the digest in xml format.
It should be noted that, for the specific implementation of each apparatus embodiment, reference may be made to the specific implementation of the corresponding method embodiment, which is not described herein again.
In summary, the invention identifies question-answer query terms from the search log data; obtaining a search result corresponding to the question-answer query word; if the search result does not contain the document of the specified type, the question-answer query words and the corresponding search results are used as input data, and label candidate answers corresponding to the question-answer query words are output according to a machine reading understanding model; the technical scheme that the labeled answers corresponding to the labeled candidate answers are obtained based on active learning, the labeled answers are used as the selected abstract of the corresponding question-answer query words in the search engine, the selected abstract of the required content can be directly checked by the user in a search result page for the identified question-answer query words through a mode of combining machine reading understanding and active learning, the query cost and the answer accuracy are considered, and the user use experience and the search quality are improved.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the apparatus for generating a search engine refinement digest in accordance with embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
For example, fig. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the invention. The electronic device 300 comprises a processor 310 and a memory 320 arranged to store computer executable instructions (computer readable program code). The memory 320 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 320 has a storage space 330 storing computer readable program code 331 for performing any of the method steps described above. For example, the storage space 330 for storing the computer readable program code may comprise respective computer readable program codes 331 for respectively implementing various steps in the above method. The computer readable program code 331 may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. Such a computer program product is typically a computer readable storage medium such as described in fig. 4. Fig. 4 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention. The computer readable storage medium 400 has stored thereon a computer readable program code 331 for performing the steps of the method according to the invention, readable by a processor 310 of the electronic device 300, which computer readable program code 331, when executed by the electronic device 300, causes the electronic device 300 to perform the steps of the method described above, in particular the computer readable program code 331 stored on the computer readable storage medium may perform the method shown in any of the embodiments described above. The computer readable program code 331 may be compressed in a suitable form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
The embodiment of the invention discloses A1, a method for generating a refined abstract of a search engine, which comprises the following steps:
identifying question and answer query words from the search log data;
obtaining a search result corresponding to the question-answer query word;
if the search result does not contain the document of the specified type, the question-answer query words and the corresponding search results are used as input data, and label candidate answers corresponding to the question-answer query words are output according to a machine reading understanding model;
and acquiring a labeled answer corresponding to the labeled candidate answer based on active learning, and taking the labeled answer as a selected abstract of a corresponding question-answer query word in a search engine.
A2, the method of a1, wherein the identifying question-answer type query terms from search log data comprises:
processing the search log data in a preset type, and extracting query words meeting requirements;
and taking the query words as input data, and outputting question and answer query words according to the query word classification model.
A3, the method as in a2, wherein the processing the search log data in a preset type, and extracting query terms meeting requirements includes:
sorting the query words in the search log data according to the page browsing amount PV, and extracting a plurality of query words from high to low according to the sorting;
and/or the presence of a gas in the gas,
and carrying out normalization processing on the query words, and removing punctuations and/or spaces in the query words.
A4, the method as in a2, wherein the method further comprises the steps of making training data of the query term classification model as follows:
acquiring search log data generated in a preset time period as sample data;
counting the total clicks of all the query words in the sample data and the clicks of the question and answer sites, and calculating the question and answer click ratio of all the query words in the sample data in a preset time period;
and taking the query words with the question-answer click ratio larger than the first threshold value as positive examples, and taking the query words with the question-answer click ratio lower than the second threshold value as negative examples to obtain training data.
A5, the method according to a4, wherein the method further comprises the following steps of training the obtained query word classification model:
and dividing the training data into a training set, a verification set and a test set according to a preset proportion, and training based on a textCNN model to obtain the query word classification model.
A6, the method as in a4, wherein the step of making training data for the query term classification model further comprises:
and taking the query word containing the specified word in the sample data as a positive example.
A7, the method according to A4, wherein the question-answer sites are determined according to web site mode.
A8, the method of a1, wherein the identifying question-answer type query terms from search log data further comprises:
and preprocessing the identified question-answer query words, removing repeated question-answer query words and removing the question-answer query words with marked answers.
A9, the method of a1, wherein the identifying question-answer type query terms from search log data further comprises:
and classifying the identified question and answer query words, and filtering out the question and answer query words outside the preset types.
A10, the method as in a9, wherein the classifying the identified question-answer type query words comprises:
and classifying the identified question-answer query words according to at least one of the topic classification model, the query word classification model and the answer classification model.
A11 the method of A10, wherein,
the theme type includes at least one of: mobile phone digital, life, game, education science, leisure hobbies, cultural art, financial management, social livelihood, sports, and region;
the query part of speech type includes: facts class and/or opinion class;
answer types include: description classes and/or entity classes.
A12, the method as in a11, wherein the outputting annotation candidate answers corresponding to the question-answer type query terms according to a machine reading understanding model with the question-answer type query terms and the corresponding search results as input data includes:
if the answer type of the question-answer query word is an entity type, calling a machine reading understanding model which does not contain a sorting algorithm to output a plurality of labeled candidate answers; otherwise, calling a complete machine reading understanding model and outputting a label candidate answer.
A13, the method according to a10, wherein the topic classification model is a text multi-classification model, the method further comprising the following steps of making training data of the topic classification model:
acquiring search log data generated in a preset time period as sample data;
counting the click proportion of each query word in the sample data on different topic type sites;
counting page browsing amount of each query word in the sample data, and dividing a high-frequency query word from the query words in the sample data according to the page browsing amount;
and taking the topic type of the site with the highest click ratio of the high-frequency query word as the topic type of the corresponding high-frequency query word.
A14, the method of A13, wherein the step of making training data for the topic classification model further comprises:
dividing intermediate frequency query words from the query words of the sample data according to the page browsing amount;
and taking the high-frequency query words with the determined subject types as a training set, training a Support Vector Machine (SVM) model, and determining the subject types of the medium-frequency query words according to the trained SVM model.
A15, the method of A13, wherein the step of making training data for the topic classification model further comprises:
dividing low-frequency query words from the query words of the sample data according to the page browsing amount;
and determining the topic type of the low-frequency query word according to the syntactic dependency tree.
A16, the method of A10, wherein the query term classification model and the answer classification model are both text multi-classification models;
the query term classification model is obtained by training based on the property characteristics of the query terms;
the answer classification model is obtained by training based on the length features of the answers.
A17, the method of a1, wherein the identifying question-answer type query terms from search log data further comprises:
and filtering the identified question and answer query words according to a plurality of specified dimensions to obtain the filtered question and answer query words.
A18, the method as in a17, wherein the filtering the identified question-answer type query terms by a number of specified dimensions comprises:
filtering each appointed dimension by adopting a corresponding text classification model; the text classification model is obtained by training based on SVM and/or fastText respectively.
A19, the method as in a1, wherein the obtaining search results corresponding to the question-answer type query terms comprises:
calling a search engine interface, and acquiring a first number of natural search results corresponding to the question-answer query words according to the search result sequence;
and adjusting the natural search results according to a preset algorithm, and selecting a second number of natural search results as search results corresponding to the corresponding question and answer query words.
A20, the method of A19, wherein the adjusting the natural search results according to a preset algorithm comprises at least one of:
filtering document sites which cannot acquire webpage contents from the natural search results;
and improving the sequence of the stations with the confidence level higher than the first preset value.
A21, the method as in a19, wherein the obtaining search results corresponding to the question-answer type query terms further comprises:
and filtering the question and answer query words according to the natural search result.
A22, the method as in A21, wherein the filtering the question-answer type query terms according to the natural search results comprises at least one of:
filtering out question and answer query words containing the application box in the natural search result;
filtering out question and answer query words containing illegal words in the titles of the natural search results;
and filtering out question and answer query words lacking high-quality natural search results according to semantic matching.
A23, the method of a1, wherein the method further comprises:
and if the search result contains the document with the specified type, directly extracting the label candidate answer from the document with the specified type.
A24, the method as in A23, wherein the specified type of document is an html document containing several pieces of step description information, and the extracting annotation candidate answers directly from the specified type of document comprises:
and analyzing the html document, extracting the description information of the steps according to field matching, and obtaining the candidate answers of the labels by splicing.
A25, the method of a1, wherein the taking the question-answer type query terms and corresponding search results as input data comprises:
and calculating semantic relevance scores of the question-answer query words and the webpage titles of the search results according to a semantic matching model, and sequencing the search results according to the semantic relevance scores.
A26, the method of a25, wherein the method further comprises:
and if the semantic relevance scores obtained through calculation are lower than a second preset value, outputting the labeled candidate answers corresponding to the question-answer query words according to a machine reading understanding model and subsequent steps are not executed, and the carefully selected abstract of the question-answer query words is directly captured from a question-answer site.
A27, the method of A25, wherein the method further comprises the step of training the semantic matching model as follows:
obtaining question-answer query words and webpage title pairs marked with positive examples and negative examples, and constructing the question-answer query words and the webpage title pairs as training data through a processor dictionary;
and performing fine-tune adjustment training on the basis of a pretrained model of the BERT and the training data to obtain the semantic matching model.
A28, the method of a1, wherein the method further comprises:
and (5) directly capturing the selected abstract from the question-answer website for question-answer query words with PV lower than a third preset value.
A29, the method of a1, wherein the method further comprises:
generalizing the question and answer query words to obtain query words with similar semantics;
and taking the marked answers as the selected abstracts of the query words with similar semantemes in a search engine.
A30, the method as in a29, wherein the generalizing the question-answer query terms to obtain semantically similar query terms includes:
digging out candidate query words corresponding to the question-answer type query words based on the query word display condition, the user click behavior and the co-click behavior;
and calculating semantic relevance scores of the question-answer query words and the candidate query words according to a semantic matching model, and taking the candidate query words with the semantic relevance scores higher than a fourth preset value as query words with similar semantics.
A31, the method as in a29, wherein the generalizing the question-answer query terms to obtain semantically similar query terms includes:
performing vector representation on the query words, and determining candidate query words corresponding to the question-answer query words by calculating cosine similarity of each vector;
and calculating semantic relevance scores of the question-answer query words and the candidate query words according to a semantic matching model, and if the highest semantic relevance score is larger than a fifth preset value, taking the candidate query word corresponding to the highest semantic relevance score as a query word with similar semantics.
A32, the method as in A1, wherein the obtaining labeled answers corresponding to the labeled candidate answers based on active learning comprises:
providing a fine marking interface and a coarse marking interface;
displaying a unique labeling candidate answer through the rough label interface, receiving returned correctness evaluation information, and determining a labeling answer according to the correctness evaluation information;
and the number of the first and second groups,
and displaying a plurality of labeling candidate answers through the fine labeling interface, and receiving returned labeling answers.
A33, the method as in any one of a1-a32, wherein the selecting the labeled answers as corresponding question-answer type query words in a search engine includes:
the pick digest is saved in xml format.
The embodiment of the invention also discloses B34, a device for generating the fine selection abstract of the search engine, which comprises:
the identification unit is suitable for identifying question and answer query words from the search log data;
the searching unit is suitable for obtaining a searching result corresponding to the question-answer type query word;
the candidate unit is suitable for taking the question-answer query words and the corresponding search results as input data if the search results do not contain the specified types of documents, and outputting labeled candidate answers corresponding to the question-answer query words according to a machine reading understanding model;
and the choice abstract generating unit is suitable for acquiring the labeled answers corresponding to the labeled candidate answers based on active learning and taking the labeled answers as choice abstract of the corresponding question-answer type query words in a search engine.
B35, the device of B34, wherein,
the identification unit is suitable for processing the search log data in a preset type and extracting query words meeting requirements; and taking the query words as input data, and outputting question and answer query words according to the query word classification model.
B36, the device of B35, wherein,
the identification unit is suitable for sequencing the query words in the search log data according to the page browsing amount PV and extracting a plurality of query words from high to low according to the sequencing; and/or normalizing the query term to remove punctuation and/or blank spaces in the query term.
B37, the apparatus of B35, wherein the apparatus further comprises:
the training unit is suitable for acquiring search log data generated in a preset time period as sample data; counting the total clicks of all the query words in the sample data and the clicks of the question and answer sites, and calculating the question and answer click ratio of all the query words in the sample data in a preset time period; and taking the query words with the question-answer click ratio larger than the first threshold value as positive examples, and taking the query words with the question-answer click ratio lower than the second threshold value as negative examples to obtain training data.
B38, the device of B37, wherein,
the training unit is suitable for dividing the training data into a training set, a verification set and a test set according to a preset proportion, and training the training set based on a textCNN model to obtain the query word classification model.
B39, the device of B37, wherein,
the training unit is suitable for taking the query word containing the specified word in the sample data as a positive example.
B40, the device as B37, wherein the question-answering site is determined according to the website address mode.
B41, the device of B34, wherein,
the identification unit is suitable for preprocessing the identified question-answer query words, removing repeated question-answer query words and removing the question-answer query words marked with answers.
B42, the device of B34, wherein,
the identification unit is suitable for classifying the identified question and answer query words and filtering the question and answer query words out of the preset types.
B43, the device of B42, wherein,
the identification unit is suitable for classifying the identified question-answer query words according to at least one of the theme classification model, the query word classification model and the answer classification model.
B44, the apparatus as in B43, wherein the theme type includes at least one of: mobile phone digital, life, game, education science, leisure hobbies, cultural art, financial management, social livelihood, sports, and region; the query part of speech type includes: facts class and/or opinion class; answer types include: description classes and/or entity classes.
B45, the device of B44, wherein,
the candidate unit is suitable for calling a machine reading understanding model which does not comprise a sorting algorithm to output a plurality of labeled candidate answers if the answer type of the question-answer type query word is an entity type; otherwise, calling a complete machine reading understanding model and outputting a label candidate answer.
B46, the apparatus according to B43, wherein the topic classification model is a text multi-classification model, the apparatus further comprising: the training unit is suitable for acquiring search log data generated in a preset time period as sample data; counting the click proportion of each query word in the sample data on different topic type sites; counting page browsing amount of each query word in the sample data, and dividing a high-frequency query word from the query words in the sample data according to the page browsing amount; and taking the topic type of the site with the highest click ratio of the high-frequency query word as the topic type of the corresponding high-frequency query word.
B47, the device of B46, wherein,
the training unit is suitable for dividing intermediate frequency query words from the query words of the sample data according to the page browsing amount; and taking the high-frequency query words with the determined subject types as a training set, training a Support Vector Machine (SVM) model, and determining the subject types of the medium-frequency query words according to the trained SVM model.
B48, the device of B46, wherein,
the training unit is suitable for dividing low-frequency query words from the query words of the sample data according to the page browsing amount; and determining the topic type of the low-frequency query word according to the syntactic dependency tree.
B49, the device as B43, wherein the query word classification model and the answer classification model are both text multi-classification models; the query term classification model is obtained by training based on the property characteristics of the query terms; the answer classification model is obtained by training based on the length features of the answers.
B50, the device of B34, wherein,
the recognition unit is suitable for filtering the recognized question and answer query words according to a plurality of specified dimensions to obtain the filtered question and answer query words.
B51, the device of B50, wherein,
the recognition unit is suitable for filtering each specified dimension by adopting a corresponding text classification model; the text classification model is obtained by training based on SVM and/or fastText respectively.
B52, the device of B34, wherein,
the search unit is suitable for calling a search engine interface and acquiring a first number of natural search results corresponding to the question-answer query words according to the search result sequence; and adjusting the natural search results according to a preset algorithm, and selecting a second number of natural search results as search results corresponding to the corresponding question and answer query words.
B53, the device of B52, wherein,
the search unit is suitable for filtering document sites which cannot acquire webpage contents from the natural search results; and improving the sequence of the stations with the confidence level higher than the first preset value.
B54, the device of B52, wherein,
and the searching unit is suitable for filtering the question and answer query words according to the natural searching result.
B55, the device of B54, wherein,
the search unit is suitable for filtering out question and answer query words containing the application box in the natural search result; and/or filtering out question-answer type query words containing illegal words in the titles of the natural search results; and/or filtering out question-answer query words lacking high-quality natural search results according to semantic matching.
B56, the device of B34, wherein,
and the candidate unit is also suitable for directly extracting the labeling candidate answers from the documents of the specified type if the search results contain the documents of the specified type.
The device of B57, as in B56, wherein the document of the specified type is an html document including a plurality of pieces of step description information, and the candidate unit is adapted to parse the html document, extract the plurality of pieces of step description information according to field matching, and obtain the labeling candidate answer by concatenation.
B58, the device of B34, wherein,
and the candidate unit is suitable for calculating the semantic relevance scores of the question-answer query words and the webpage titles of the search results according to a semantic matching model and sequencing the search results according to the semantic relevance scores.
B59, the device of B58, wherein,
the candidate unit is suitable for not executing the steps of outputting the labeled candidate answers corresponding to the question-answer query words according to the machine reading understanding model and the subsequent steps if the semantic relevance scores obtained through calculation are lower than a second preset value;
the carefully chosen abstract unit is suitable for directly grabbing carefully chosen abstract of the question and answer query words from question and answer sites.
B60, the apparatus of B58, wherein the apparatus further comprises:
the training unit is suitable for acquiring question-answer query words and webpage title pairs marked with positive examples and negative examples, and constructing the pairs into training data through a processor dictionary; and performing fine-tune adjustment training on the basis of a pretrained model of the BERT and the training data to obtain the semantic matching model.
B61, the device of B34, wherein,
the choice abstract unit is suitable for directly grabbing choice abstract from question and answer sites for question and answer query words with PV lower than a third preset value.
B62, the device of B34, wherein,
the carefully chosen abstract unit is also suitable for generalizing the question-answer query words to obtain query words with similar semantics; and taking the marked answers as the selected abstracts of the query words with similar semantemes in a search engine.
B63, the device of B62, wherein,
the carefully chosen abstract unit is suitable for excavating candidate query words corresponding to the question-answer query words based on the display condition of the query words, the click behaviors of the users and the co-click behaviors; and calculating semantic relevance scores of the question-answer query words and the candidate query words according to a semantic matching model, and taking the candidate query words with the semantic relevance scores higher than a fourth preset value as query words with similar semantics.
B64, the device of B62, wherein,
the carefully chosen abstract unit is suitable for expressing the query words in vectors and determining candidate query words corresponding to the question-answer query words by calculating the cosine similarity of the vectors; and calculating semantic relevance scores of the question-answer query words and the candidate query words according to a semantic matching model, and if the highest semantic relevance score is larger than a fifth preset value, taking the candidate query word corresponding to the highest semantic relevance score as a query word with similar semantics.
B65, the device of B34, wherein,
the fine selection abstract unit is suitable for providing a fine marking interface and a rough marking interface; displaying a unique labeling candidate answer through the rough label interface, receiving returned correctness evaluation information, and determining a labeling answer according to the correctness evaluation information; and displaying a plurality of labeling candidate answers through the fine labeling interface, and receiving a returned labeling answer.
B66, the device of any one of B34-B65, wherein,
the selection abstract unit is suitable for storing the selection abstract in an xml format.
The embodiment of the invention also discloses C67 and electronic equipment, wherein the electronic equipment comprises: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform the method of any one of a1-a 33.
Embodiments of the invention also disclose D68, a computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method as any one of a1-a 33.

Claims (10)

1. A method for generating a fine abstract of a search engine comprises the following steps:
identifying question and answer query words from the search log data;
obtaining a search result corresponding to the question-answer query word;
if the search result does not contain the document of the specified type, the question-answer query words and the corresponding search results are used as input data, and label candidate answers corresponding to the question-answer query words are output according to a machine reading understanding model;
and acquiring a labeled answer corresponding to the labeled candidate answer based on active learning, and taking the labeled answer as a selected abstract of a corresponding question-answer query word in a search engine.
2. The method of claim 1, wherein the identifying question-and-answer type query terms from the search log data comprises:
processing the search log data in a preset type, and extracting query words meeting requirements;
and taking the query words as input data, and outputting question and answer query words according to the query word classification model.
3. The method of claim 2, wherein the processing the search log data in a preset type and extracting query terms meeting requirements comprises:
sorting the query words in the search log data according to the page browsing amount PV, and extracting a plurality of query words from high to low according to the sorting;
and/or the presence of a gas in the gas,
and carrying out normalization processing on the query words, and removing punctuations and/or spaces in the query words.
4. The method of claim 2, wherein the method further comprises the step of making training data for the query term classification model by:
acquiring search log data generated in a preset time period as sample data;
counting the total clicks of all the query words in the sample data and the clicks of the question and answer sites, and calculating the question and answer click ratio of all the query words in the sample data in a preset time period;
and taking the query words with the question-answer click ratio larger than the first threshold value as positive examples, and taking the query words with the question-answer click ratio lower than the second threshold value as negative examples to obtain training data.
5. An apparatus for generating a refined summary for a search engine, comprising:
the identification unit is suitable for identifying question and answer query words from the search log data;
the searching unit is suitable for obtaining a searching result corresponding to the question-answer type query word;
the candidate unit is suitable for taking the question-answer query words and the corresponding search results as input data if the search results do not contain the specified types of documents, and outputting labeled candidate answers corresponding to the question-answer query words according to a machine reading understanding model;
and the choice abstract generating unit is suitable for acquiring the labeled answers corresponding to the labeled candidate answers based on active learning and taking the labeled answers as choice abstract of the corresponding question-answer type query words in a search engine.
6. The apparatus of claim 5, wherein,
the identification unit is suitable for processing the search log data in a preset type and extracting query words meeting requirements; and taking the query words as input data, and outputting question and answer query words according to the query word classification model.
7. The apparatus of claim 6, wherein,
the identification unit is suitable for sequencing the query words in the search log data according to the page browsing amount PV and extracting a plurality of query words from high to low according to the sequencing; and/or normalizing the query term to remove punctuation and/or blank spaces in the query term.
8. The apparatus of claim 6, wherein the apparatus further comprises:
the training unit is suitable for acquiring search log data generated in a preset time period as sample data; counting the total clicks of all the query words in the sample data and the clicks of the question and answer sites, and calculating the question and answer click ratio of all the query words in the sample data in a preset time period; and taking the query words with the question-answer click ratio larger than the first threshold value as positive examples, and taking the query words with the question-answer click ratio lower than the second threshold value as negative examples to obtain training data.
9. An electronic device, wherein the electronic device comprises: a processor; and a memory arranged to store computer-executable instructions that, when executed, cause the processor to perform the method of any one of claims 1-4.
10. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-4.
CN201910707322.2A 2019-08-01 2019-08-01 Method and device for generating fine selection abstract of search engine Pending CN112307314A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910707322.2A CN112307314A (en) 2019-08-01 2019-08-01 Method and device for generating fine selection abstract of search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910707322.2A CN112307314A (en) 2019-08-01 2019-08-01 Method and device for generating fine selection abstract of search engine

Publications (1)

Publication Number Publication Date
CN112307314A true CN112307314A (en) 2021-02-02

Family

ID=74486345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910707322.2A Pending CN112307314A (en) 2019-08-01 2019-08-01 Method and device for generating fine selection abstract of search engine

Country Status (1)

Country Link
CN (1) CN112307314A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609296A (en) * 2021-08-23 2021-11-05 南京擎盾信息科技有限公司 Data processing method and device for public opinion data identification
CN115130022A (en) * 2022-07-04 2022-09-30 北京字跳网络技术有限公司 Content search method, device, equipment and medium
CN116501960A (en) * 2023-04-18 2023-07-28 百度在线网络技术(北京)有限公司 Content retrieval method, device, equipment and medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609296A (en) * 2021-08-23 2021-11-05 南京擎盾信息科技有限公司 Data processing method and device for public opinion data identification
CN113609296B (en) * 2021-08-23 2022-09-06 南京擎盾信息科技有限公司 Data processing method and device for public opinion data identification
CN115130022A (en) * 2022-07-04 2022-09-30 北京字跳网络技术有限公司 Content search method, device, equipment and medium
CN116501960A (en) * 2023-04-18 2023-07-28 百度在线网络技术(北京)有限公司 Content retrieval method, device, equipment and medium
CN116501960B (en) * 2023-04-18 2024-03-15 百度在线网络技术(北京)有限公司 Content retrieval method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN110399457B (en) Intelligent question answering method and system
US10565533B2 (en) Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches
CN109145153B (en) Intention category identification method and device
CN106649818B (en) Application search intention identification method and device, application search method and server
US11100124B2 (en) Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches
US9767144B2 (en) Search system with query refinement
CN103514299B (en) Information search method and device
WO2020237856A1 (en) Smart question and answer method and apparatus based on knowledge graph, and computer storage medium
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN110222045A (en) A kind of data sheet acquisition methods, device and computer equipment, storage medium
CN102402604A (en) Effective Forward Ordering Of Search Engine
CN102495892A (en) Webpage information extraction method
CN109472022B (en) New word recognition method based on machine learning and terminal equipment
US9569525B2 (en) Techniques for entity-level technology recommendation
CN110765761A (en) Contract sensitive word checking method and device based on artificial intelligence and storage medium
CA3138556A1 (en) Apparatuses, storage medium and method of querying data based on vertical search
CN112307314A (en) Method and device for generating fine selection abstract of search engine
CN112256845A (en) Intention recognition method, device, electronic equipment and computer readable storage medium
CN113468339B (en) Label extraction method and system based on knowledge graph, electronic equipment and medium
CN105956181A (en) Searching method and apparatus
CN113988057A (en) Title generation method, device, equipment and medium based on concept extraction
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN112711666B (en) Futures label extraction method and device
CN117350302A (en) Semantic analysis-based language writing text error correction method, system and man-machine interaction device
CN110851560B (en) Information retrieval method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination